Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies

Transcription

1 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

3 Virtual prototyping with TLM2.0 Laurent Maillet-Contoz Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion

4 Motivations for Virtual Prototyping SoC architecture exploration - Too late - Too costly + Accurate - Too late - Too slow + Accurate HDL simulation FPGA prototype Functional validation environment Pre-silicon software development Soc Virtual Prototype SoC Virtual Prototyping SW SW Input devices TLM models (HW blocks) H.M.P. SoC SW SW Output devices

5 Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion Rationale to support standards Model interoperability Integrate models coming from different IP suppliers Deliver subsystems and/or virtual platforms to customers CAD tools support Benefit from CAD tools support Benefit from best-in-class tools from various providers without migration campaigns SystemC and IP-XACT standards are required and complementary

6 Adopting standards OSCI/SystemC A single language for modeling hardware/software systems Support multiple abstraction levels An object-oriented approach built on top of C++ as a set of classes SPIRIT/IP-Xact Covers HW IP interfaces, register banks and configurations Support RTL and TLM abstractions Based on XML Benefits of standards Enable competition between suppliers Avoid dependency to proprietary format of suppliers Enable adoption of new approaches inside the company SystemC standards evolution OSCI TLM2 OSCI TLM1 PV, PVT Core TLM I/F transport put, get Payload REQ, RESP LT, AT TLM I/F b_transport nb_transport Payload tlm_transaction Extension Complements IEEE Incl TLM1 and TLM2 IEEE

7 Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion TLM2 content Targets interoperability of TLM models Dedicated to memory-mapped bus communication 2 modeling styles Loosely Timed (LT) Approximately Timed (AT) Transaction-level communication APIs Generic transaction payload Transport Backdoor and debug Sockets Exposes all interfaces Simplifies binding

8 Modeling styles Loosely Timed Sufficient to develop embedded software and functional verification test suites Able to boot O/S and run multi-core systems System synchronization scheme to be modelled Easy and fast to setup Approximately Timed cycle-approximate or cycle-count-accurate Suitable for architectural exploration Supports pipelining and out of order Requires significant modeling effort TLM2 payload Typical attributes of memory-mapped busses Command, address, data, byte enables, single word transfers, burst transfers, streaming, response status Standard-defined extension mechanism Generic payload Can be used as is for LT style (abstract bus) Define ignorable extensions for simulation artefacts A starting point to model specific bus protocols Define mandatory extension to cover missing attributes Compile-time type checking to avoid incompatibility

9 Extension mechanism A way to extend the generic payload definition Ignorable: binding of heterogeneous models allowed Mandatory: strict matching required Generic payload defines an array-of-pointers to extensions Extension mechanism is standard, extensions are not! TLM2 b_transport interface Typically used in a Loosely Timed context Transaction is completed when function call returns No pipelining No out of order No thread needed on target side May execute wait() in the implementation void b_transport( TRANS&, sc_time& ) ; Forward path Initiator Target

10 TLM2 nb_transport interface Typically used in an Approximately Timed context Requires phase management BEGIN_REQ, END_REQ, BEGIN_RESP, END_RESP A thread is required on the target side wait() in the implementation is not allowed tlm_sync_enum nb_transport_fw( TRANS&, PHASE&, sc_time& ); tlm_sync_enum nb_transport_bw( TRANS&, PHASE&, sc_time& ); Forward path Initiator Backward path Target Other interfaces Direct memory access Backdoor access to enable simulation speed up Interconnect model and sockets are bypassed Target responsible to invalidate memory pointer if needed bool get_direct_mem_ptr( TRANS& trans, tlm_dmi& dmi_data ) ; void invalidate_direct_mem_ptr( sc_dt::uint64 start_range, sc_dt::uint64 end_range ) ; Debug API Same as regular b_transport but for debug accesses No side effect in target No timing unsigned int transport_dbg( TRANS& trans ) ;

11 Sockets Can be seen as super communication ports Group the transport, DMI and debug transport interfaces Bind forward and backward paths with a single call Strong connection checking Have a bus width parameter Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion

12 Full interoperability is still the Holly Grail Models ecosystem is emerging slowly Issues with business models from IP/model suppliers TLM2-based models are not always interoperable Variant interpretations of the standard Important items still to be addressed Model configuration, control & inspection On going work in OSCI CCI working group Interrupt management System address map Synchronization schemes Transaction Level Model Value Chain IP Specification High Modeling Engineers Expertise added value TLM Modeling on top of standards Methodology TLM TLM1.0, TLM 2.0 incl. Communication in IEEE APIs revision IEEE C++ class library SystemC SOC Specification TLM Model Bit accurate Register accurate Functionally correct Communicates through transactions Variable timing accuracy IP IP IP Interconnect TLM model IP IP SoC Virtual Prototype

13 Virtual Prototypes with TLM Applied in the industry for complex SoCs System architecture exploration Anticipation of RTL functional verification Pre-silicon software development Functional Verificatio n Functiona l Embedde d Software TLM/RTL Co-simulation LT Models Architectur e Analysis LT: Loosely Timed AT Models Embedded Software Optimizatio n AT: Approximately Timed Conclusion Virtual prototypes enable key activities System architecture exploration Anticipation of verification activities Pre-silicon software development TLM2 is a step towards model interoperability But several steps further down the road are needed Industry is adopting virtual prototypes Proprietary developments Use of standards Well-defined modeling methodology Education is key to success!

15 Virtual prototyping of NoC based MP-SoCs SystemC modeling in the SoCLib platform Alain Greiner OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 2

16 Example of a shared memory MPSoC architecture Input Engine Ext. RAM ctrl Ext. RAM ctrl Output Engine VCI VCI/OCP network on chip VCI VCI/OCP Local interconnect VCI/OCP Local Interconnect VCI Proc 5.1 Proc 5.4 RAM 5.1 Locks 5.2 Proc 9.1 Proc 9.4 RAM 9.1 Locks 9.2 Sub-system 5 Sub-system 9 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 3 GOALS Design Space Exploration : fast performance evaluation for the mapping of a multi-threaded software application on a multi-processor hardware architecture. Hardware / Software codesign : provide the system designer a reliable simulation environment to develop embedded software. Plat-form based design : define a set of reference generic hardware/software multi-processor architectures. IP reuse : clear separation between computing ressources and communication ressources to support component reuse. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 4

17 SoCLib Partners The Soclib platform has been funded by the french authorities, and developped by 10 industriel companies and 10 laboratories : Industrial companies Thales ST Micro-electronics Thomson Orange Magilem Design Services TurboConcept Laboratories LIP6 INRIA CEA-LIST ENST IRISA LabSTIC TIMA CEA-LETI CITI IETR Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 5 SoCLib General Principles The modeling language is SystemC. Two simulation models for each hardware component: CABA (Cycle-Accurate / Bit-Accurate) TLMT (Transaction Level Model with Time) All hardware components respect the same (VCI/OCP) communication protocol. All simulation models will be available as free software. For each SoCLib component, there is a - commercially available - synthezisable RTL model. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 6

18 Hardware Components COMPONENTS ORIGINE General Purpose Processors Digital Signal Processors System Utilities NoC Interconnects MIPS32 SPARC V8 ARM-v6t PowerPC 405 Lattice Mico32 Nios2 MicroBlaze ST231 TMS320 C62 Interrupt controler Locks engine DMA controler MWMR controler Frame Buffer controler Disk controler VCI Generic Micro Network DSPIN Micro Network LIP6 ENST LIP6 LIP6 ENST IRISA TIMA INRIA IRISA LIP6 LIP6 LIP6 LIP6 LIP6 Lip6 LIP6 LIP6 IRISA Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 7 Hardware Components COMPONENTS ORIGIN Bus & crossbar interconnects Memory Controlers Dedicated Coprocessors Bridges VCI Generic System Bus VCI / Pibus VCI / Avalon VCI / Token Ring VCI / Local Crossbar Embedded SRAM External SDRAM Turbodecodeur TC1000 Turbodecodeur TC3000 LDPC DWT MOD DEM VCI / PCI bridge VCI / USB bridge VCI / CAN bridge VCI / I2C bridge VCI / Wishbone LIP6 LIP6 IRISA LIP6 LIP6 LIP6 Magillem TURBOCONCEPT TURBOCONCEPT ENST LESTER LESTER LESTER IETR ENST ENST LISIF ENST Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 8

19 Processor Modeling All SoClib general purpose processors are modeled as cycle-based ISS (Instruction Set Simulator). The same ISS is used by the two simulation models. MIPS32 SPARC V8 PowerPC 405 ARM-v6t Lattice Mico32 MicroBlaze Nios2 CABA CABA/MIPS_ISS CABA/SPARC_ISS CABA/PPC_ISS CABA/ARM_ISS CABA/L32_ISS CABA/MB_ISS CABA/NIOS_ISS TLM-DT TLM-DT/MIPS_ISS TLM-DT/SPARC_ISS TLM-DT/PPC_ISS TLM-DT/ARM_ISS TLM-DT/L32_ISS TLM-DT/MB_ISS TLM-DT/NIOS_ISS Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 9 Operating systems Various Operating systems have been ported or developed for SoCLib based MP-SoCs architectures : DNA/OS : micro-kernel for shared memory MP-SoCs, providing the POSIX thread API. MutekH : exo-kernel-based, supporting both homogeneous and heterogeneous MP-SoCs architectures, with POSIX threads and OpenMP support. RTEMS : Real-Time Operating System for message passing multi-processors architecture. NetBSD : General purpose UNIX-like operating with paginated virtual memory support, and all UNIX facilities (monoprocessor). Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 10

20 OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 11 VCI Goals IP reuse require a clear separation between computation and communication in designing reusable hardware components. Support scalable, massively parallel multi-processors architectures : We want to build hardware architectures containing several hundreds of embedded processors. Support the shared address space communication paradigm: Each VCI initiator can send «read» or «write» commands to each VCI target existing in the system, and the selected target is identified by the MSB bits of the address. Simplify the access to the interconnect, by giving to each initiator the illusion that he has a direct - point to point - communication with each target in the system. Support various interconnect architectures : from the classical shared system bus, to distributed, hierarchical networks on chip. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 12

21 Shared address space communications T0 T1 T2 T3 Targets Interconnect I0 I1 I2 I3 Initiators Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 13 System bus T0 T1 T2 T3 Complexity : O( M + T) Bandwidth : O(1) => non scalable I0 I1 I2 I3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 14

22 Cross-bar T0 T1 T2 T3 There is actually two Separated cross-bar for Commands & responses Complexity : O( M * T) Bandwidth : O(max (M,T)) => non scalable I0 I1 I2 I3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 15 Multi-stages network on chip T0 T1 T2 T3 Router Router Here again is actually two Eeparated networks for commands & responses Router Router Complexity : O(log( M + T)) Bandwidth : O(max (M,T)) => scalable I0 I1 I2 I3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 16

23 Virtual Component Interface Targets T0 T1 T2 T3 VCI Interconnect Command Response Initiators I0 I1 I2 VCI Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 17 VCI wrappers VCI T0 T1 T2 T3 Target wrappers TW0 TW1 TW2 TW3 Physical Interconnect Initiator wrappers IW0 IW1 IW2 VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 18

24 WRAPPERS VCI / PIBUS VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper PIBUS BCU I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 19 WRAPPERS VCI / TOKEN RING VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 20

25 WRAPPERS VCI / AMBA VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper DTR AD DTW BCU I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 21 WRAPPERS VCI / DSPIN VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper DSPIN Network on Chip I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 22

26 VCI Principles All initiators & targets components share the same address space. A VCI initiator starts a VCI transaction by sending a VCI command packet. Burst are supported, for both read & write commands. The VCI command packet is routed to the selected target by decoding the MSB bits of the VCI address. The selected VCI target complete the transaction by sending a VCI response packet. Both command and response packets contains a source identifier, that is used to route the response packet to the proper initiator. All VCI packets must be transported through the interconnect network as atomic entities. A given master can initiate several simultaneous transactions : The command packet for transaction (n+1) can be sent before the response packet for transaction (n) is received. This require a packet identifier in both command and response packets. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 23 Flow control mechanism CK W 1 ROK Producer WOK 1 n R consumer DOUT DIN «FIFO» Protocol : A data word (n bits) is transmitted when both the W/ROK and the R/WOK signals are true. The maximun bandwith is one word (n bits) per cycle. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 24

27 FIFO protocol implementation 1 W ROK W = 1 R = 1 1 WOK R WOK ROK WOK n ROK Dout DIn Producer consumer This communication protocol is easily implemented by Moore FSMs for both the producer & the consumer, without any combinational handshaking... Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 25 VCI / requêtes et réponses Command Command Packet packet VCI Initiator Initiator Wrapper Network On Chip Target Wrapper VCI Target Response Response Packet Packet Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 26

28 VCI Physical Interface Command VCI Initiator n 2 8 m p q r CMD_VAL CMD_ACK CMD_EOP CMD_ADDRESS CMD_CMD CMD_PLEN CMD_WDATA CMD_SRCID CMD_TRDID CMD_PKTID RSP_ERROR RSP_RDATA RSP_SRCID RSP_ACK RSP_VAL 1 m p q RSP_TRDID r RSP_PKTID RSP_E0P VCI Target Response Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 27 VCI Packet length A VCI cell (a flit) is transmitted at each cycle where both CMDVAL & CMDACK (resp. RSPVAL & RSPACK) are true. The burst length is defined by the PLEN field (number of bytes) and depends on the transaction type : Read Write Command Packet 1 flit (= 1 cycle) N flits (= N cycles) Response Packet N flits (= N cycles) 1 flit (= 1 cycle) Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 28

29 VCI commands Four command types are defined : READ : simple read WRITE : standard write LINKED LOAD : read with exclusivity STORE COND : conditionnal write Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 29 VCI versus OCP The OCP (Open Core Protocol) is an industrial evolution of the VCI standard. OCP is more flexible and defines more functionnalities than the VCI specification. A detailed analysis demonstrates that the OCP protocol is functionnally equivalent to the VCI Advanced protocol : - signal encoding is different - chronograms are identical. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 30

30 CONCLUSION The VCI Advanced protocol is well suited for shared address space MP-SoCs architectures. It supports all types of system interconnect : - shared system bus - token ring - full crossbar - multi-stage micro-networks It strongly enforces the IP reuse policy, and simplifies IP design. It is functionnally compatible with the OCP protocol. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 31 OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 32

31 Overview Reliable performance evaluation requires an accurate time modeling. All SoClib hardware components provide a Cycle-Accurate / Bit accurate simulation model, but the simulation speed is an issue. The SoCLib CABA modeling is based on the CSFSM theory (Communicating Synchronous Finite State Machine). These models can be simulated by the standard OSCI simulation engine, but the simulation can be strongly accelerated by specialised simulation engines, using static scheduling technics. The CSFSM model SystemC CABA modeling Static Scheduling SystemCASS accelerator Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 33 Communicating Synchronous Finite State Machine T0 T1 T2 CK R0 R1 R2 G0 G1 G2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 34

32 VCI-based multi-processor architecture VCI Ram 0 SW Ram 1 Ram 2 Ram 3 Ram 4 Ram 5 Ram 6 Ram 7 SW SW SW SW SW SW SW Router Router Router Router VCI MW MW MW MW P 0 P 1 P 2 P 3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 35 Cycle-based Simulation Principle No scheduler! The simulation engine is just an execution loop : For (cycle = 0 ; cycle < MAX ; cycle++) { Transition(MODULE_0) ; Transition(MODULE_N) ; Generation(MODULE_0) Generation(MODULE_N) ; } If all components behave as Moore FSMs, the evaluation order is not significant. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 36

33 The Mealy signals issue The most frequent case Combinational dependencies... processor core combi PC IR address instruction combi combi external cache Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 37 Mealy Dependency Graph The nodes are the signals The arrows are Mealy dependencies between signals X Y X Y Z Z If the design contains Mealy signals, the evaluation ordering is critical Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 38

34 CABA model general structure E TRANSITION ETAT GENERATION MEALY GENERATION MOORE S S Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 39 Cycle-based simulation with Mealy signals For (cycle = 0 ; cycle < MAX ; cycle++) { Transition(MODULE_0) ; Transition(MODULE_N) ; GenMoore(Module_0) GenMoore(Module_N) while(instable) { GenMealy(MODULE_0) ; GenMealy(MODULE_N) ; } } The internal evaluation loop for GenMealy() functions can be optimised by a static ordering respecting the Mealy dependencies. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 40

35 Multi-FSMs component E T0 R0 G0 T1 R1 G1 Transition() { R0 = T0(E, R0, R1); R1 = T1(E, R0, R1); } The actual modification of a register value must be delayed until the end of the transition function. We use the sc_signal type to model internal registers. S0 S1 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 41 Cycle-Based simulation with OSCI SystemC We use the sensitivity lists to force the cycle based scheduling : The sensitivity list of the Transition() methods contains only the CK rising edge The sensitivity list of the GenMoore() methods contains only the CK falling edge The sensitivity list of the GenMealy() methods contain the CK falling edge, plus a sub-set of the input signals. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 42

36 Cycle-based simulation with SystemCASS SystemCASS is a static-scheduling simulation engine, optimised to exploit the communicating FSMs modeling approach (CSFSM) : a static scheduling, taking into account the Mealy dependencies is computed at elaboration time. Contrary to OSCI SystemC, that is a «general-purpose» simulation engine, SystemCass only accept models respecting the CSFSM approach, but SystemCass is about 10 times faster than OSCI systemc... SystemCASS is distributed by UPMC/LIP6 as free software (GPL license), and can be downloaded on the SoCLib WEB server : Parallel versions of SystemCASS targeting multi-cores SMP workstation are under development at TIMA and LIP6 laboratories. Preliminary results demonstrated a quasi-linear speedup when there is no Mealy dependencies in the architecture. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 43 CABA Simulation Speed Simulation with SystemCASS, on a multi-core, 2.5 GHz, LINUX PC. A/ Simple architecture : one single MIPS32 processor (with instruction & data caches), four memory banks, one TTY controler, one graphic controler, and one system bus controler cycles / s B/ Two levels multi-processors architecture : 16 MIPS32 processors in 4 clusters, with a local interconnect in each cluster, and DSPIN micro-network as global interconnect cycles / s C/ Same 16 processors architecture, but with SystemCASS SMP, using 4 Posix Threads running in parallel on 4 Intel processor cores : cycles / s => The simulation time increases linearly with the number of processor cores. On a quadri-cores workstation, the speedup provided by CABA parallel simulation is quasi linear. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 44

37 CONCLUSION Cycle accurate SystemC simulation can be achieved with acceptable simulation speed for reliable performance evaluation of large multi-processors systems (up to 64 processors). Careful modeling policy, based on the CSFSM model, supporting static scheduling technics provides a simulation speed of about 1 million cycles/s (1 MHz) on a 1 GHz Linux PC, for a single processor system. The static scheduling approach is compatible with parallel simulation on multi-cores SMP workstations. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 45 OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 46

38 OVERVIEW TLM-DT stands for Transaction Level Modeling with Distributed Time This modeling style has been designed to support parallel simulation on SMP multi-cores workstation. It is based on the theory of PDES (Parallel Discrete Event Simulation) developed by Shandy / Misra Bryant. It can be described as an extension of TLM2.0 PDES principles TLM-DT over TLM2.0 Component modelling Experimental results Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 47 3 keys for simulation speed Flow control granularity In CABA, the transfer unit is the «flit». In TLM2.0, the transfer unit is the «packet». Interface Method Call In CABA, data transfers are performed through signals. In TLM2.0, data transfers are performed thanks to Interface Method Calls. Parallel simulation Standard TLM2.0 simulation relies on the SystemC global scheduler that is hard to be parallelized on a SMP workstation. The TLM-DT modeling does not use the SystemC global time to open the door to parallel simulation but we must accept to shift from the global simulation time paradigm to the distributed simulation time paradigm. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 48

39 PDES principles The complete simulated system is described as a set of logical processes that execute in parallel and communicate via point-to-point channels. There is neither global scheduler nor global clock. To each process is associated a local clock, defining a local time. The processes synchronize themselves by embedding timing information in the packets carried through the communication channels. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 49 Optimistic PDES / Pessimistic PDES T1 T2 T0 T4 T3 In pessimistic PDES, a process is allowed to increase its local time if and only if it has the guaranty that it can not receive on any of its input channels a message with a timestamp smaller than its local time : this is called time filtering. This constraint can be violated in optimistic PDES, but the roll-back mechanism needed to put a process into a previous state is very expensive and can not be used for MPSoC modeling. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 50

40 Null-messages The pessimistic PDES algorithm relies on temporal filtering of the incoming messages : A PDES process that has N input channels is only allowed to process when it has timing information on all its N input ports. => For example, a NOC interconnect is allowed to let a command packet reach a given target if and only if all the initiators that can theorically address this target have sent at least one timed message. To solve this issue, the PDES algorithm uses null message. A null message contains no data, but only a time information. Moreover, all processes can be in two modes : active & non- active. Only processes that are active participate to the temporal filtering. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 51 TLM 2.0 overview T. Kogel, MPSoC 08 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 52

41 Generic payload Used in SocLib T. Kogel, MPSoC 08 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 53 soclib_payload_extension enum vci_command { VCI_READ_COMMAND, VCI_WRITE_COMMAND, VCI_LINKED_READ_COMMAND, VCI_STORE_CONDITIONAL_COMMAND, PDES_NULL_MESSAGE, PDES_ACTIVE, PDES_INACTIVE, }; unsigned int m_src_id; unsigned int m_trd_id; unsigned int m_pkt_id; enum vci_command m_vci_command; Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 54

42 Non-blocking transport payload The Initiator Local Time is an offset relative to the SystemC simulation time) The SystemC Simulation time has always a zero value...) Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 55 Initiator and Target Modeling Each PDES process is implemented as a sc-thread. There is at least one sc_thread (and one associated local time) for each VCI initiator, and one sc_thread for each VCI target. Initiators : The «free running» of an initiator (such as a processor executing instructions contained in its L1 cache without MISS) is limited by a «timing quantum» (typically 100 cycles). When this boundary is reached, the sc_thread is descheduled, and a NULL message is sent. Targets : Most targets have a purely «reactive behavior : the associated sc_thread is in waiting mode until a command packet is received. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 56

43 NoC Interconnect Behavior In shared memory MP-SoCs architecture, all arbitration between concurrent transactions is done in the Network on Chip. Therefore, the PDES time filtering operation is mainly implemented in the NoC model. There is only one thread (and one associated local time) per interconnect. In case of hierarchical interconnects (with N local interconnects, and a global interconnect), there is one sc_thread per interconnect (N+1). In order to avoid deadlocks, the dynamic contention (and the associated time filtering mechanism) is only implemented in the command network. The dynamic contention in the response network is neglected. All incoming command packets are stored in a centralized buffer indexed by the initiator index : when all slots of this buffer are filled with timed messages, the arbitration can occur. Only active initiators are taken into account. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 57 NoC Modeling T Initiator 0 Commands: time filtering and routing, using the central buffer T T Target 0 T Initiator 1 T Initiator 3 Responses: Routing, but no time filtering (no thread) T Target 1 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 58

44 Conclusion Using the standard (sequencial) OSCI systemc simulation engine (SystemC 2.0), and the standard TLM2.0 package, the simulation speedup (TLM-DT versus CABA) is one order of magitude. The distributed time modeling approach (TLM-DT) remove the bottleneck associated to the SystemC centralized scheduler. Preliminary results, obtained with a parallel SystemC SMP simulator on a quadri-core workstation demonstrate a quasi linear speedup when simulating a large 160 processor cores MP-SoCs architecture Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 59

46 DSX: Design space exploration and mapping tool Nicolas Pouillon NoC Symposium Tutorial Parallel application modeling Goal We want to: map a given application on a Multiprocessor System-on-Chip have some parts of the application mapped on hardware blocks validate early on a workstation choose actual hardware/software implementation late in time debug all the way try lots of different designs Design Space Exploration Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

47 Parallel application modeling Focus We ll focus on data-flow, packet-based applications: network packet handling multimedia (audio, video) Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR High-level description We use a generic communication middleware called MWMR Multi-Writer Multi-Reader. It has the following properties: FIFO-like channel conveys packets of fixed width called blocks fixed depth fixed access protocol shared-memory-based Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

48 Parallel application modeling MWMR More properties blocks are delivered in order contiguity is not asserted access to channels can be done from software tasks or hardware coprocessors in an uniform way In its simple form 1 writer and 1 reader MWMR is a bounded KPN-like channel. Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR Datatypes MWMR uses 3 data buffers: A ringbuffer A status: read & write pointers, lock, usage counter A descriptor: the channel s properties (block width, channel depth, status & ringbuffer addresses in memory) Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

49 Parallel application modeling MWMR Protocol 1. Take a lock on the status 2. Read & decide what to transmit, if nothing to do, go to 5 3. Do the data transfer between local buffer and data ringbuffer 4. Update status buffer 5. Release status lock Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR Illustration A5 A4 A3 A2 P0 C0 A0 B2 B1 B0 A1 B5 B4 B3 P1 C1 A5 A4 A3 A2 P0 C0 B1 A0 B2 B5 B4 B3 P1 C1 B0 A0 Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

50 Parallel application modeling MWMR Illustration A5 A4 A3 A2 P0 C0 A0 B2 B1 B0 A1 B5 B4 B3 P1 C1 A5 A4 A3 A2 P0 C0 B1 A0 B2 B5 B4 B3 P1 C1 B0 A0 Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR Conclusion Each task can be either an Hardware accelerator or a Software task running on a programmable processor If you want deterministic behavior (KPN), restrict to 1 producer and 1 consumer per channel Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

51 DSX Goals DSX Design-Space Explorer SoC designer may want to be able to easily design a parallel application (without bugs) taking dedicated hardware into account focusing on optimization aspects one at a time DSX How to? Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Goals How to feed DSX with an application? Split the application in tasks (a task may be used more than once in the application) Describe the application in terms of tasks, connected through explicit communication resources Create an Hardware platform to host the application Map the application and its resources on the hardware platform Demux VLD IQZZ IDCT TG Split Libu Ramdac Demux VLD IQZZ IDCT Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

52 DSX Goals DSX Design-Space Explorer Everything is about choosing Implementation of tasks Hardware/Software partitioning Hardware platform dimensioning and features Mapping of software tasks and resources Implementation of software communication routines Tweaking of all the above keeping the application coherent DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Goals The big picture TCG Mapping Hardware platform Ad hoc Generic infrastructure Cross compilers MutekH DSX SystemC SystemCASS SoCView CABA Tlm T Tlm Software application for validation on workstation Firmware binary image Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

53 DSX Application DSX Task and Communication Graph With the previously-defined tasks, and the available communication frameworks, designer creates a bipartite oriented graph where: TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm nodes are tasks or communication channels edges assign tasks ports to communication channels Task and Communication Graph (TCG) This is our application. Demux VLD IQZZ IDCT TG Split Libu Ramdac Demux VLD IQZZ IDCT DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Application Task modeling A task is described by its name some communication ports, with associated communication type IDCT one or more implementations, each can be a C program (or anything a compiler can handle) an ASM program an handcrafted hardware coprocessor an synthetized hardware coprocessor All implementations should behave the same way, and will be considered as interchangeable. Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

54 DSX Application DSX Communication and synchronization modeling DSX lets you define your parallel programming model, some are available out of the box: Multi-Writer-Multi-Reader channels Shared-memory buffers (no built-in synchronization) Mutexes (locks) Synchronization barriers Events DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Application Focus on application Now DSX knows about the application, designer can try and test it on a classic workstation (if a software implementation exists for every task), with some instrumentation debugging (gdb, ddd, valgrind, etc.) task profiling communication profiling TCG Software application for validation on workstation Mapping Firmware binary image communication channels logging (in MWMR at least) DSX Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

55 DSX Hardware platform DSX Hardware definition DSX lets designers use their own hardware components. In order to know how to use the components, DSX needs a description: the so-called Metadata. TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm All DSX needs is a file containing: entity name ports parameters dependencies on sub-modules pointers to corresponding implementation files Hopefully enough, this is already done for all SoCLib modules! DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Hardware platform Hardware usage Designer can define its own hardware architecture. Several usual generic platforms are predefined and directly usable as a development base. TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm Mips Mips Mips Mips Ram LocalCrossbar LocalCrossbar Xcache Xcache Xcache Xcache Mips Mips Mips Ram Xcache Xcache Xcache Mips Ram Xcache VGMN LocalCrossbar LocalCrossbar Mips Mips Xcache Xcache VGMN Xcache Mips Ram Xcache Xcache Mips Mips Xcache Mips Ram Xcache Xcache Mips Mips DSX can be used as a netlister for SystemC. Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

56 DSX Mapping the application on the platform DSX Mapping Application on Hardware Finally, with the application TCG and the hardware platform description, DSX lets the designer map each application entity on the hardware platform, including: TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm Where tasks are to be executed (CPU, coprocessor) Placement of communication channels in memory or dedicated communication components Placement of running software stacks in memory Placement of binary executable code in memory Placement of global utility objects in memory DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Mapping the application on the platform Python API Every input description for DSX is made through a Python API, thus description is in fact a script many close-enough descriptions can be factored in one, with parameters changing those parameters can be done through the script running the simulator and extracting performance figures can be made automatic Design Space Exploration can be thoroughly automated! Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

57 Questions?

59 NoC Symposium 03/05/ Grenoble Thales Communications Porting an H264 decoder on a virtual platform Fabien Colas-Bigey Thales Communications ANR SoCLib -- NoC Symposium rd May Grenoble 1 Agenda Origin of the need Switching to MP-SoC architectures New constraints imposed by MP-SoC Application : H264 decoder Presentation of the standard Main characteristics of the SoCLib implementation Software stack Hardware platforms CABA TLM-DT Experiments ANR SoCLib -- NoC Symposium rd May Grenoble 2

60 Why do we need MP-SoC? Reduction of power consumption is the next challenge, even more important than processor size The performance increase cannot be achieve anymore by the simple rising of frequency. Need to increase systems complexity without increasing power consumption Software applications and hardware platforms need to be tightly coupled: the platform only embeds the services required by the application Multiprocessor platforms become the references of our suppliers MP-SoC are required to cope with new requirements of modern systems ANR SoCLib -- NoC Symposium rd May Grenoble New constraints imposed by MP- SoC New architectures play in favour of power reduction but: 3 How to program efficiently these new multiprocessor architectures? Expression of parallelism in applications Load balancing between several processing units Use of specialized instructions : SIMD, etc. How to predict the performances of the system? Load of cache coherency mechanisms Memory contention Synchronizations overhead Global system validation: scheduling, hard device accesses, etc. How to manage data? Storage of shared data Validation of memory accesses Detection of cache break ANR SoCLib -- NoC Symposium rd May Grenoble 4

61 New constraints imposed by MP-SoC performance Ways of indeterminism Customisation wall => Reconfiguration Parallelisation wall =>Shared memories, Cache coherency => Unpredictable performance Frequency and power walls => ILP, Pipelines, thread parallelism, Multicores => Unpredictable performance Memory wall => Memory hierarchies (Cache) => Unpredictable performance ANR SoCLib -- NoC Symposium rd May Grenoble 5 years Application H264 decoder Video compression standard developed jointly by JVT and MPEG groups Several profiles: Baseline: intra and inter predicted images, CAVLC entropic coding, deblocking filter Main Extended Our implementation Our implementation is the constrained baseline profile The base element is a macroblock : 16x16 pixels ANR SoCLib -- NoC Symposium rd May Grenoble 6

62 Application H264 decoder Predictions Spatial prediction : intra predicted frames Temporal prediction : inter predicted frames ANR SoCLib -- NoC Symposium rd May Grenoble 7 Application H264 decoder Predictions and dependencies Prediction of prediction information Prediction of macroblocks I N B D B C T R A A E A E I N T B C No spatial dependency E R A E Temporal dependency ANR SoCLib -- NoC Symposium rd May Grenoble 8

63 Application H264 decoder An implementation following the SPMD scheme A common part manages the input stream and dispatches data Each task has in charge the management of a part of the image called a slice ANR SoCLib -- NoC Symposium rd May Grenoble 9 Software stack The operating system has been developed by LIP6 in the scope of the SoCLib project Appli H264 Appli Benchmark MUTEK/H Interrupt Management HEXO abstraction layer SoCLib virtual platform ANR SoCLib -- NoC Symposium rd May Grenoble 10

64 Hardware platforms : CABA IT Loader Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) ROM RAM Interconnect VGMN (micro-network) XICU + Timer TTY Ramdisk Frame Buffer IT ANR SoCLib -- NoC Symposium rd May Grenoble 11 Hardware platforms : CABA Processors The number of processors is configurable : 1 to 8 The nature of processors is configurable : MIPS, PowerPC, Arm The ramdisk is used to load files into memory The frame buffer is a video viewer to observe YUV or RGB images The ROM contains the boot and text sections of the executable The system can be used with a NoC (vgmn) or a bus (vgsb) No cache coherency is implemented on this platform ANR SoCLib -- NoC Symposium rd May Grenoble 12

65 Hardware platforms : TLM-DT Loader IT Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) RAM0 RAM1 Interconnect VGMN (micro-network) ICU Timer TTY Ramdisk Frame Buffer IT IT ANR SoCLib -- NoC Symposium rd May Grenoble 13 Hardware platforms : TLM-DT Almost identical to the CABA platform XICU has been replaced by ICU + Timer ANR SoCLib -- NoC Symposium rd May Grenoble 14

66 Performances evaluation Parallelisation efficiency on a CABA platform cpu 2 cpu 4 cpu 6 cpu dbf overhead process ANR SoCLib -- NoC Symposium rd May Grenoble 15 Performances evaluation Parallelisation efficiency on a TLM-DT platform dbf overhead process cpu 2 cpu 4 cpu 6 cpu ANR SoCLib -- NoC Symposium rd May Grenoble 16

67 Performances evaluation Variation of decoding time on a CABA platform cpu 2 cpu 4 cpu 6 cpu ANR SoCLib -- NoC Symposium rd May Grenoble 17 Performance evaluation Variation of decoding time on a TLM-DT platform cpu 2 cpu 4 cpu 6 cpu ANR SoCLib -- NoC Symposium rd May Grenoble 18

68 Comparison CABA, TLM-DT Processor cycles Constant imprecision independent of the number of processors Temporal imprecision of approximately 70% This imprecision is mainly due the tlmdt xcache, which still need some fine tuning Both CABA and TLM-DT reflect the same behaviour when testing different configurations Number of memory accesses Difference inferior to 10% Simulation time In TLM-DT, the acceleration factor compared to CABA is at minimum 10 ANR SoCLib -- NoC Symposium rd May Grenoble 19 How can we use SoCLib Training of architects and designers Modelling techniques on multiprocessor architectures Software development Development of the software stack early in the design flow Validation of the functioning Application debug Check memory consistency Use advanced debug tools packages with the platform Architecture exploration Possibility to change configuration very easily Possibility to run different policies and study their impact on the system ANR SoCLib -- NoC Symposium rd May Grenoble 20

69 Demonstration Parallel execution of a CABA and a TLM-DT platform Test different configurations Processor number, image format A few words about gdbserver ANR SoCLib -- NoC Symposium rd May Grenoble 21

71 Virtual prototyping of NoC based MP-SoCs Prototyping of a scalable, coherent, shared memory, multi-cores architecture Nam NGUYEN Eric GUTHMULLER Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 2/20

72 The TSAR Project TSAR is a Medea+ project that started in june One central goal of this project is to define and implement a shared memory, cache coherent, many-cores architecture. The project leader is BULL. Industrial partners ACE (Netherland) Bull S.A. (France) Compaan (Netherland) NXP Semiconductors (Netherland) Philips Medical Systems (Netherland) Thales Communications (France) Academic partners Université Paris 6 (LIP6) Technical University Delft (CEL) University Leiden (LIACS) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 3/20 Main Technical Choices Processor: TSAR is a MP2-SoC (Massively Parallel, Multi-processor System on Chip), containing up to 4096 processor cores. The architecture is independent on the selected processor core: It could be SPARC V8, MIPS32, PPC 405, ARM, etc. Interconnect: The TSAR architecture is clusterized, with a two-level interconnect. It will use a proven Network on Chip implementing the the VCI/OCP standard: DSPIN. Memory: The Tsar architecture will support a (NUMA) shared address space. The memory is logically shared, but physically distributed, with cache coherence enforced by hardware, using a directory-based protocol. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 4/20

73 Clusterized Architecture External RAM ctrl CPU FPU L1 I L1 D CPU FPU L1 I L1 D Mem cache Timer DMA ICU CPU FPU L1 I L1 D CPU FPU L1 I L1 D Mem cache Timer DMA ICU VCI/OCP Local Interconnect Local Interconnect NIC NIC I/O ctrl NIC DSPIN micro- network NIC I/O ctrl NIC NIC Local Interconnect Local Interconnect VCI/OCP L1 I CPU FPU L1 D L1 I L1 D CPU FPU Mem cache Timer DMA ICU L1 I CPU FPU L1 D L1 I CPU FPU L1 D Mem cache Timer DMA ICU External RAM ctrl Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 5/20 A regular 2D mesh topology NIC cluster ck03 cluster ck13 cluster ck23 cluster ck33 Local Interconnect cluster ck02 cluster ck12 cluster ck22 cluster ck32 L1 I CPU FPU L1 D L1 I CPU FPU L1 D Mem cache Timer DMA ICU cluster ck01 cluster ck11 cluster ck21 cluster ck31 TSAR will use the DSPIN Network on Chip technology developed by LIP6. each cluster is implemented in a different clock domain. cluster cluster cluster cluster inter-cluster communications use ck00 ck10 ck20 ck30 bi-synchronous (or fully asynchronous) FIFOs. => Clock frequency & Voltage can be independently adjusted in each cluster. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 6/20

74 A NUMA architecture TSAR is a NUMA architecture (Non Uniform Memory Access): CPU CPU MEM CPU CPU MEM local remote external Local Interconnect Local Interconnect I/O DSPIN micro- network I/O The physical memory space is statically distributed amongst Local Interconnect Local Interconnect the clusters. The operating system must CPU CPU MEM CPU CPU MEM precisely control the mapping: - tasks on the processors - software objects on the memory caches. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 7/20 Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 8/20

75 Soclib Use In Tsar SystemC CABA models were developed on top of the existing Soclib infrastructure Benefits from existing models: Interconnects Peripherals Processors Rapid prototyping => design space exploration Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 9/20 Debugging Facility Software: GdB Server (independent from the processor Iss) Memory Checker: tracks potentially wrong memory accesses => «Valgrind like» tool for any processor Iss => Need support from the software side (OS) Hardware: Vci Logger: displays all the Vci traffic on a given Vci port Vci Simhelper: allow the running software to stop the simulation when an error has been detected => To use in automated tests Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 10/20

76 Operating Systems A handful of OSes already runs on Soclib and then runs on Tsar without much work Three of these OSes have been successfully tested on various Tsar platforms: NetBSD (see has been successfully tested on a monoprocessor Tsar platform with support for virtual memory MutekH (see has been successfully tested on Tsar platforms with up to 64 processors (16 clusters) AlmOs (see has been successfully tested on Tsar platforms with up to 16 processors (4 clusters) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 11/20 Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 12/20

77 Tsar Versions There are 2 main versions of the Tsar architecture: Tsar V0: First implementation of the DHCCP protocol Copies are represented by a vector of bits (one bit per processor) Bit vector limits the number of processors to 32 Tsar V1: Copies are represented by a chained list stored in a single memory, which is common to all cache lines in the Memory Cache Support for up to 4096 processors (VCI IDs on 12 bits) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 13/20 Splash benchmarks Splash FFT (65536 points, 256 in each dimension) Speedup form 1 to 16 processors (Tsar V1), constant data set Speedup Ideal Speedup Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 14/20

78 Splash benchmarks Splash FFT (4096 points for 1 cpu, for 4 cpus, for 16 cpus) Speedup form 1 to 16 processors, (data set size/nb cpus) constant 1,2 1 0,8 0,6 0,4 0, Speedup Ideal Speedup Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 15/20 Parallelization of the simulation Tsar platforms are really big: on a 2,6 GHz Xeon cpu, a 16 processors platform simulates at about 30 Khz with classical SystemCASS Running big benchmarks can take hours A parallelized version of SystemCASS has been developed Mealy generation functions should be prohibited because they are not run in parallel Shared variables between threads have to be avoided => Frontiers have to be chosen with caution: 1 thread per cluster With 1 thread per cluster, speedup is linear: speedup of N with N clusters (has been tested with 4 clusters for the moment) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 16/20

79 Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 17/20 Cosimulation SystemC/VHDL CABA models are designed to be very near to real hardware A simple way to develop VHDL models is to use cosimulation: you put VHDL/Verilog models in a SystemC platform Soclib infrastructure is able to compile a mixed platform with SystemC and VHDL/Verilog models for the Modelsim simulator Each VHDL/Verilog can be developed separately by replacing only the corresponding SystemC model in the platform Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 18/20

80 Fast RTL debug If you have already written a SystemC model, you should not have to re-debug the functionality of your model What you want is to have your RTL model match cycle-accurately your SystemC model For that purpose, wrappers have been written to debug the RTL of the Memory Cache => these wrappers instantiate the SystemC and the RTL models, send to them the same inputs and compare the outputs of the models But comparing only the outputs is not sufficient (the bug can happen very lately on outputs), thus we also compare internal FSMs states => the debug time of the RTL is considerably reduced Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 19/20 Conclusion The SoCLib virtual prototyping platform was very useful for quantitative analysis of various cache coherence protocols. The Cycle Accurate SystemC modeling was mandatory to design (& debug) the TSAR critical hardware components, such as the L1 cache controller, and the memory cache. The RTL design (synthesizable VHDL models) has been strongly simplified (and accelerated) by the SystemC virtual prototyping. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 20/20

82 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures Pierre Guironnet de Massas and Frédéric Pétrot SLS Group TIMA Laboratory 46, Av Félix Viallet, Grenoble, France S C lib S Clib 3/4/2010 S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion 1 / 27 Integrated architecture evolution Continuous increase in integration capabilities Parallelism allows to exploit VLSI integration advances Systems have to support many standards/application classes More and more processors: 128 expected in ,000 1,800 1,600 1, ,200 1, S C lib2022 Number of Processing Engines (source: ITRS System Drivers 2007) 1023 S Clib tutorial at 2 / 27

83 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated architecture evolution Heterogeneous architectures Complex to program Many different software stacks Not tolerant to process variability FIRM WARE DSP OS + MIDDLEWARE DRIVERS FIRM WARE DSP APPLICATION HAL GPP HAL GPP DRIVERS OS ASIC FIRMWARE PE PE FIRMWARE PE PE Homogeneous architectures Coherent shared memory architectures A single software stack Tolerant to process variability Many decisions taken by sw and hw to make programming easy HW ACCEL DRIVERS GPP APPLICATION Hardware Abstraction Layer GPP GPP OS GPP GPP S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Outline 3 / 27 1 Introduction 2 Issues & Context 3 Memory, data placement and migration 4 Data migration and ideal data placement 5 Solution and implementation 6 Conclusion S C lib S Clib tutorial at 4 / 27

84 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data access: A key point Efficiency means: 1 Low latency 2 High bandwidth 3 Low energy consumption 4 Not visible from software Average Memory-Access Time: AMAT = t hit + r miss p miss Goal: reduce p miss by dynamic data placement chip ~20 cycles SRAM Mem 1 Board ~1000 cycles SRAM Mem 0 ~5 cycles Serial Link DRAM Mem 4 cache GPP DRAM Mem 3 ~100 cycles S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion 5 / 27 Context Architectural template Simple pipeline processor with instruction and data caches Large amount of on-chip memory NoC: backbone for massive processor parallelism Directory based memory coherency Logically shared and physically distributed memory chip GPP cache SRAM NI interconnection network S C lib S Clib tutorial at 6 / 27

85 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Distributed Architectures Characteristics Distributed on chips, boards, machines NUMA Distributed shared memory chip GPP cache SRAM NI S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Distributed Architectures: Existing solutions 7 / 27 Data placement and migration Intelligent memory allocator OS based page migration using the MMU and ad-hoc hardware First-touch (or second-touch) placement Data placement problems Where to place the shared data? What happens to data when task migrate? Non optimized data placement Migration cost seldom profitable Low system reactivity S C lib S Clib tutorial at 8 / 27

86 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated Architectures: Existing solutions Characteristics Integrated within a SoC L2 caches No embedded memory chip GPP L1 cache L2 NI S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated Architectures: Existing solutions 9 / 27 L2 cache management Shared L2: Cache capacity maximisation Minimizes number of off-chip misses High access latency Distributed L2: Suboptimal cache capacity (copies) Increases number of off-chip misses Lower latency of accesses (hit) S C lib S Clib tutorial at 10 / 27

87 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated Architectures: Existing solutions Cooperative L2 caches: Steal blocks in the neighbouring caches Avantages Reactivity Transparency of use Drawbacks Data movement Data duplication S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Proposal 11 / 27 Addressable memory instead of L2 caches Combines the advantages of the other approaches Not visible to software at all Hardware implementation: Efficiency, reactivity Addressable memory: Maximal on-chip capacity S C lib S Clib tutorial at 12 / 27

88 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data sharing and placement Solution independent of the type of accessed data Shared memory, local stacks Tasks, OS code Used through all software layers 0xbfc02000 <printf> 0xbfc0320c <scanf> 0xbfc04284 <fprint> printf("hello\n"); A[0] = 1; cache cache GPP GPP Ti cache cache GPP Ti GPP A[1] = 5; printf("task1\n"); A[0] A[1] A[2] A[3] S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data sharing and placement 13 / 27 Data placement and CPI CPI : Measure of execution speed CPI = f (AMAT ) = f (t hit + r miss p miss ) Minimizing the miss penalty by moving data close to its usage The access cost depends on the number of accesses and of the latency P2 P0 D' processors P1 Pi D The optimal placement of D minimizes the sum of the access costs S C lib S Clib tutorial at 14 / 27

89 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Moving data around Suboptimal data placement: System state High access cost Data access requests Active tasks Distributed accesses No congestion points 0 MAX 0 1 Access cost Congestion M_c Threshold Problems Suboptimal data placement means: High access cost (distance) High power consumption (distance) Low overall system performances S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Moving data around 15 / 27 Optimal data placement: System state Low access cost thanks to data locality Access congestion is happening on this zone 0 MAX 0 1 Access cost Congestion M_c Threshold Change in situation Data are closer A hot spot occurs Performance drops down drastically S C lib S Clib tutorial at 16 / 27

90 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Moving data around Before migration After migration D Best possible placement that does not incur congestion M e D' D' M c D Theoretic optimal for data placement 0 1 Congestion Threshold M_c Solution: stay under the congestion threshold 1 Place data in a conservative manner 2 Remove hot spot by migrating the often accessed data S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Proposed Architecture 17 / 27 NI 2D-MESH NI SRAM PORT Ring Out 3 Counters Crossbar L1 P M Crossbar L1 P M 1 2 Initiator Ni Target NI Ring In Node distance table 4 NI Crossbar L1 P M NI Crossbar L1 P RING M Page tables Page table directory 7 8 Migration trigger 6 5 Hardware add-ons An other interconnection network: Ring Memory controller: access counters, page tables, etc Cache controller: small TLB S C lib S Clib tutorial at 18 / 27

91 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion How to access the moved data Addressing : Physical address Hardware address translation Small TLB as translation cache Physical address cache Hardware page table RAM 0 0x100 -> 0x130 DIR data CACHE RAM 20 0x300 -> 0x330 DIR data @src 0x100 0x320 0x210 0x110 0x210 0x200 CACHE LINE 0x100 Icache/Dcache @src 0x300 0x310 0x420 0x310 0x420 0x300 0x320 0x200 0x100 0x100 RAM 10 CACHE 0 RAM 30 0x200 -> 0x230 DIR data Table @src 0x200 0x110 0x320 0x210 0x100 0x110 miss/unc CACHE LINE 0xXXX 0x100 INVAL (0x104) Processor 1 0x400 -> 0x430 DIR data @src 0x420 0x300 0x310 S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion How to access the moved data 19 / 27 Addressing : Physical address Hardware address translation Small TLB as translation cache Physical address cache Hardware page table RAM 0 0x100 -> 0x130 DIR data CACHE RAM 20 0x300 -> 0x330 DIR data @src 0x100 0x320 0x210 0x110 0x210 0x200 2 CACHE LINE 0x100 Icache/Dcache @src 0x300 0x310 0x420 0x310 0x420 0x300 0x320 0x200 0x100 0x100 rsp 0x320 RAM 10 CACHE 0 RAM 30 0x200 -> 0x230 DIR data Table @src 0x200 0x110 0x320 0x210 0x100 0x110 miss/unc CACHE LINE 0xXXX 0x100 0x320 (0x104) Processor 1 0x400 -> 0x430 DIR data @src 0x420 0x300 0x310 S C lib S Clib tutorial at 19 / 27

92 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion How to access the moved data Addressing : Physical address Hardware address translation Small TLB as translation cache Physical address cache Hardware page table RAM 0 0x100 -> 0x130 DIR data CACHE RAM 20 0x300 -> 0x330 DIR data P1 @src 0x100 0x320 0x210 0x110 0x210 0x200 2 CACHE LINE 0x100 Icache/Dcache 0x100 0x324 @src 0x300 0x310 0x420 0x310 0x420 0x300 0x320 0x200 0x100 0x104 3 rsp 0x320 RAM 10 CACHE 0 RAM 30 0x200 -> 0x230 DIR data Table @src 0x200 0x110 0x320 0x210 0x100 0x110 miss/unc CACHE LINE 0xXXX 0x100 0x320 (0x104) Processor 1 0x400 -> 0x430 DIR data @src 0x420 0x300 0x310 S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Migration decision 19 / 27 Based on per processor and per page access counters Periodic migration ❶: Processor access counter reaches 128 Congestion ❷: Most accessed page migrates n x p counters n nodes page counters p pages Max id pointer to the mostly accessed page timer periodic counter reset node access cost (=1) + M_pe 1 > Migration + > M_c + > 2 Migration S C lib S Clib tutorial at 20 / 27

93 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data transfer and target memory choice Transfer mechanism characteristics Concurrent transfer of 2 pages on the ring Takes 2000 cycles (4 Kb pages) Still possible to access the other pages of the banks Simultaneous transfer of two pages Arbitration of accesses PORT Ring Out NI Master NI Target Ring In NoC NoC Challenges 1) TLB coherency 2) Deadlock avoidance S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results 21 / 27 Platform SoCLib CABA Simulation 16 MIPS Processors DSPIN NoC, 4 4 2D mesh New SoCLib components: Memory: 8 concurrent finite state machines Caches: 5 concurrent finite state machines Synchronous Ring interfaces S C lib S Clib tutorial at 22 / 27

94 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results Applications MJPEG: 6 tasks (4 processors) SPLASH-2: water nsquared, ocean (contiguous), fft Operating system DNA: in-house, Posix threads, Redhat newlib (libc), 100Kb Two possible configurations: 1 AD : dynamic task migration, data allocated sequentially 2 AS : statically pinned task and task s stack, shared data placed as efficiently as possible S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results: Total time 23 / % Normalized execution time, 512 byte pages 80% 60% 40% 20% 0% fft ocean_c water_ns mjpeg AD + P_STD AD + P_MIG AS + P_STD AS + P_MIG Analysis AD: 40% to 70% gain in execution time AS: -5% to 15% gain in execution time Makes AD similar to AS without any software work S C lib S Clib tutorial at 24 / 27

95 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results: Total hop count 100% Normalized accumulated hop count, 512 byte pages 80% 60% AD + P_STD AD + P_MIG AS + P_STD AS + P_MIG 40% 20% 0% fft ocean_c water_ns mjpeg Analysis (Hop count somehow related to power consumption) AD : 90% to 75% gain in hop count AS : -20% to 15% gain in hop count Makes AD similar to AS without any software work S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Summary and Conclusion 25 / 27 Cycle accurate simulation is a necessity To validate complex hardware concepts To obtain trustworthy results And is easier/faster than HDLs SoCLib advantages First available library of interoperable modules Used in academia for research and teaching Provides many existing components with sources Nice start for trying new stuff at IP/System level SoCLib drawbacks Still requires a lot of work, as designs are complex Learning curve Speed of simulation may be (is?) a problem S C lib S Clib tutorial at 26 / 27

96 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Summary and Conclusion However, the balance is largely positive! S C lib S Clib tutorial at 27 / 27