Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies

Size: px
Start display at page:

Download "Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies"

Transcription

1 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

2 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

3 Virtual prototyping with TLM2.0 Laurent Maillet-Contoz Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion

4 Motivations for Virtual Prototyping SoC architecture exploration - Too late - Too costly + Accurate - Too late - Too slow + Accurate HDL simulation FPGA prototype Functional validation environment Pre-silicon software development Soc Virtual Prototype SoC Virtual Prototyping SW SW Input devices TLM models (HW blocks) H.M.P. SoC SW SW Output devices

5 Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion Rationale to support standards Model interoperability Integrate models coming from different IP suppliers Deliver subsystems and/or virtual platforms to customers CAD tools support Benefit from CAD tools support Benefit from best-in-class tools from various providers without migration campaigns SystemC and IP-XACT standards are required and complementary

6 Adopting standards OSCI/SystemC A single language for modeling hardware/software systems Support multiple abstraction levels An object-oriented approach built on top of C++ as a set of classes SPIRIT/IP-Xact Covers HW IP interfaces, register banks and configurations Support RTL and TLM abstractions Based on XML Benefits of standards Enable competition between suppliers Avoid dependency to proprietary format of suppliers Enable adoption of new approaches inside the company SystemC standards evolution OSCI TLM2 OSCI TLM1 PV, PVT Core TLM I/F transport put, get Payload REQ, RESP LT, AT TLM I/F b_transport nb_transport Payload tlm_transaction Extension Complements IEEE Incl TLM1 and TLM2 IEEE

7 Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion TLM2 content Targets interoperability of TLM models Dedicated to memory-mapped bus communication 2 modeling styles Loosely Timed (LT) Approximately Timed (AT) Transaction-level communication APIs Generic transaction payload Transport Backdoor and debug Sockets Exposes all interfaces Simplifies binding

8 Modeling styles Loosely Timed Sufficient to develop embedded software and functional verification test suites Able to boot O/S and run multi-core systems System synchronization scheme to be modelled Easy and fast to setup Approximately Timed cycle-approximate or cycle-count-accurate Suitable for architectural exploration Supports pipelining and out of order Requires significant modeling effort TLM2 payload Typical attributes of memory-mapped busses Command, address, data, byte enables, single word transfers, burst transfers, streaming, response status Standard-defined extension mechanism Generic payload Can be used as is for LT style (abstract bus) Define ignorable extensions for simulation artefacts A starting point to model specific bus protocols Define mandatory extension to cover missing attributes Compile-time type checking to avoid incompatibility

9 Extension mechanism A way to extend the generic payload definition Ignorable: binding of heterogeneous models allowed Mandatory: strict matching required Generic payload defines an array-of-pointers to extensions Extension mechanism is standard, extensions are not! TLM2 b_transport interface Typically used in a Loosely Timed context Transaction is completed when function call returns No pipelining No out of order No thread needed on target side May execute wait() in the implementation void b_transport( TRANS&, sc_time& ) ; Forward path Initiator Target

10 TLM2 nb_transport interface Typically used in an Approximately Timed context Requires phase management BEGIN_REQ, END_REQ, BEGIN_RESP, END_RESP A thread is required on the target side wait() in the implementation is not allowed tlm_sync_enum nb_transport_fw( TRANS&, PHASE&, sc_time& ); tlm_sync_enum nb_transport_bw( TRANS&, PHASE&, sc_time& ); Forward path Initiator Backward path Target Other interfaces Direct memory access Backdoor access to enable simulation speed up Interconnect model and sockets are bypassed Target responsible to invalidate memory pointer if needed bool get_direct_mem_ptr( TRANS& trans, tlm_dmi& dmi_data ) ; void invalidate_direct_mem_ptr( sc_dt::uint64 start_range, sc_dt::uint64 end_range ) ; Debug API Same as regular b_transport but for debug accesses No side effect in target No timing unsigned int transport_dbg( TRANS& trans ) ;

11 Sockets Can be seen as super communication ports Group the transport, DMI and debug transport interfaces Bind forward and backward paths with a single call Strong connection checking Have a bus width parameter Agenda Motivations for virtual prototyping ESL standards: rationale and evolution TLM2 at a glance Comments on virtual prototyping experience Conclusion

12 Full interoperability is still the Holly Grail Models ecosystem is emerging slowly Issues with business models from IP/model suppliers TLM2-based models are not always interoperable Variant interpretations of the standard Important items still to be addressed Model configuration, control & inspection On going work in OSCI CCI working group Interrupt management System address map Synchronization schemes Transaction Level Model Value Chain IP Specification High Modeling Engineers Expertise added value TLM Modeling on top of standards Methodology TLM TLM1.0, TLM 2.0 incl. Communication in IEEE APIs revision IEEE C++ class library SystemC SOC Specification TLM Model Bit accurate Register accurate Functionally correct Communicates through transactions Variable timing accuracy IP IP IP Interconnect TLM model IP IP SoC Virtual Prototype

13 Virtual Prototypes with TLM Applied in the industry for complex SoCs System architecture exploration Anticipation of RTL functional verification Pre-silicon software development Functional Verificatio n Functiona l Embedde d Software TLM/RTL Co-simulation LT Models Architectur e Analysis LT: Loosely Timed AT Models Embedded Software Optimizatio n AT: Approximately Timed Conclusion Virtual prototypes enable key activities System architecture exploration Anticipation of verification activities Pre-silicon software development TLM2 is a step towards model interoperability But several steps further down the road are needed Industry is adopting virtual prototypes Proprietary developments Use of standards Well-defined modeling methodology Education is key to success!

14 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

15 Virtual prototyping of NoC based MP-SoCs SystemC modeling in the SoCLib platform Alain Greiner OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 2

16 Example of a shared memory MPSoC architecture Input Engine Ext. RAM ctrl Ext. RAM ctrl Output Engine VCI VCI/OCP network on chip VCI VCI/OCP Local interconnect VCI/OCP Local Interconnect VCI Proc 5.1 Proc 5.4 RAM 5.1 Locks 5.2 Proc 9.1 Proc 9.4 RAM 9.1 Locks 9.2 Sub-system 5 Sub-system 9 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 3 GOALS Design Space Exploration : fast performance evaluation for the mapping of a multi-threaded software application on a multi-processor hardware architecture. Hardware / Software codesign : provide the system designer a reliable simulation environment to develop embedded software. Plat-form based design : define a set of reference generic hardware/software multi-processor architectures. IP reuse : clear separation between computing ressources and communication ressources to support component reuse. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 4

17 SoCLib Partners The Soclib platform has been funded by the french authorities, and developped by 10 industriel companies and 10 laboratories : Industrial companies Thales ST Micro-electronics Thomson Orange Magilem Design Services TurboConcept Laboratories LIP6 INRIA CEA-LIST ENST IRISA LabSTIC TIMA CEA-LETI CITI IETR Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 5 SoCLib General Principles The modeling language is SystemC. Two simulation models for each hardware component: CABA (Cycle-Accurate / Bit-Accurate) TLMT (Transaction Level Model with Time) All hardware components respect the same (VCI/OCP) communication protocol. All simulation models will be available as free software. For each SoCLib component, there is a - commercially available - synthezisable RTL model. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 6

18 Hardware Components COMPONENTS ORIGINE General Purpose Processors Digital Signal Processors System Utilities NoC Interconnects MIPS32 SPARC V8 ARM-v6t PowerPC 405 Lattice Mico32 Nios2 MicroBlaze ST231 TMS320 C62 Interrupt controler Locks engine DMA controler MWMR controler Frame Buffer controler Disk controler VCI Generic Micro Network DSPIN Micro Network LIP6 ENST LIP6 LIP6 ENST IRISA TIMA INRIA IRISA LIP6 LIP6 LIP6 LIP6 LIP6 Lip6 LIP6 LIP6 IRISA Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 7 Hardware Components COMPONENTS ORIGIN Bus & crossbar interconnects Memory Controlers Dedicated Coprocessors Bridges VCI Generic System Bus VCI / Pibus VCI / Avalon VCI / Token Ring VCI / Local Crossbar Embedded SRAM External SDRAM Turbodecodeur TC1000 Turbodecodeur TC3000 LDPC DWT MOD DEM VCI / PCI bridge VCI / USB bridge VCI / CAN bridge VCI / I2C bridge VCI / Wishbone LIP6 LIP6 IRISA LIP6 LIP6 LIP6 Magillem TURBOCONCEPT TURBOCONCEPT ENST LESTER LESTER LESTER IETR ENST ENST LISIF ENST Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 8

19 Processor Modeling All SoClib general purpose processors are modeled as cycle-based ISS (Instruction Set Simulator). The same ISS is used by the two simulation models. MIPS32 SPARC V8 PowerPC 405 ARM-v6t Lattice Mico32 MicroBlaze Nios2 CABA CABA/MIPS_ISS CABA/SPARC_ISS CABA/PPC_ISS CABA/ARM_ISS CABA/L32_ISS CABA/MB_ISS CABA/NIOS_ISS TLM-DT TLM-DT/MIPS_ISS TLM-DT/SPARC_ISS TLM-DT/PPC_ISS TLM-DT/ARM_ISS TLM-DT/L32_ISS TLM-DT/MB_ISS TLM-DT/NIOS_ISS Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 9 Operating systems Various Operating systems have been ported or developed for SoCLib based MP-SoCs architectures : DNA/OS : micro-kernel for shared memory MP-SoCs, providing the POSIX thread API. MutekH : exo-kernel-based, supporting both homogeneous and heterogeneous MP-SoCs architectures, with POSIX threads and OpenMP support. RTEMS : Real-Time Operating System for message passing multi-processors architecture. NetBSD : General purpose UNIX-like operating with paginated virtual memory support, and all UNIX facilities (monoprocessor). Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 10

20 OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 11 VCI Goals IP reuse require a clear separation between computation and communication in designing reusable hardware components. Support scalable, massively parallel multi-processors architectures : We want to build hardware architectures containing several hundreds of embedded processors. Support the shared address space communication paradigm: Each VCI initiator can send «read» or «write» commands to each VCI target existing in the system, and the selected target is identified by the MSB bits of the address. Simplify the access to the interconnect, by giving to each initiator the illusion that he has a direct - point to point - communication with each target in the system. Support various interconnect architectures : from the classical shared system bus, to distributed, hierarchical networks on chip. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 12

21 Shared address space communications T0 T1 T2 T3 Targets Interconnect I0 I1 I2 I3 Initiators Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 13 System bus T0 T1 T2 T3 Complexity : O( M + T) Bandwidth : O(1) => non scalable I0 I1 I2 I3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 14

22 Cross-bar T0 T1 T2 T3 There is actually two Separated cross-bar for Commands & responses Complexity : O( M * T) Bandwidth : O(max (M,T)) => non scalable I0 I1 I2 I3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 15 Multi-stages network on chip T0 T1 T2 T3 Router Router Here again is actually two Eeparated networks for commands & responses Router Router Complexity : O(log( M + T)) Bandwidth : O(max (M,T)) => scalable I0 I1 I2 I3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 16

23 Virtual Component Interface Targets T0 T1 T2 T3 VCI Interconnect Command Response Initiators I0 I1 I2 VCI Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 17 VCI wrappers VCI T0 T1 T2 T3 Target wrappers TW0 TW1 TW2 TW3 Physical Interconnect Initiator wrappers IW0 IW1 IW2 VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 18

24 WRAPPERS VCI / PIBUS VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper PIBUS BCU I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 19 WRAPPERS VCI / TOKEN RING VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 20

25 WRAPPERS VCI / AMBA VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper DTR AD DTW BCU I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 21 WRAPPERS VCI / DSPIN VCI T0 T1 T2 T3 T-wrapper T-wrapper T-wrapper T-wrapper DSPIN Network on Chip I-wrapper I-wrapper I-wrapper VCI I0 I1 I2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 22

26 VCI Principles All initiators & targets components share the same address space. A VCI initiator starts a VCI transaction by sending a VCI command packet. Burst are supported, for both read & write commands. The VCI command packet is routed to the selected target by decoding the MSB bits of the VCI address. The selected VCI target complete the transaction by sending a VCI response packet. Both command and response packets contains a source identifier, that is used to route the response packet to the proper initiator. All VCI packets must be transported through the interconnect network as atomic entities. A given master can initiate several simultaneous transactions : The command packet for transaction (n+1) can be sent before the response packet for transaction (n) is received. This require a packet identifier in both command and response packets. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 23 Flow control mechanism CK W 1 ROK Producer WOK 1 n R consumer DOUT DIN «FIFO» Protocol : A data word (n bits) is transmitted when both the W/ROK and the R/WOK signals are true. The maximun bandwith is one word (n bits) per cycle. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 24

27 FIFO protocol implementation 1 W ROK W = 1 R = 1 1 WOK R WOK ROK WOK n ROK Dout DIn Producer consumer This communication protocol is easily implemented by Moore FSMs for both the producer & the consumer, without any combinational handshaking... Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 25 VCI / requêtes et réponses Command Command Packet packet VCI Initiator Initiator Wrapper Network On Chip Target Wrapper VCI Target Response Response Packet Packet Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 26

28 VCI Physical Interface Command VCI Initiator n 2 8 m p q r CMD_VAL CMD_ACK CMD_EOP CMD_ADDRESS CMD_CMD CMD_PLEN CMD_WDATA CMD_SRCID CMD_TRDID CMD_PKTID RSP_ERROR RSP_RDATA RSP_SRCID RSP_ACK RSP_VAL 1 m p q RSP_TRDID r RSP_PKTID RSP_E0P VCI Target Response Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 27 VCI Packet length A VCI cell (a flit) is transmitted at each cycle where both CMDVAL & CMDACK (resp. RSPVAL & RSPACK) are true. The burst length is defined by the PLEN field (number of bytes) and depends on the transaction type : Read Write Command Packet 1 flit (= 1 cycle) N flits (= N cycles) Response Packet N flits (= N cycles) 1 flit (= 1 cycle) Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 28

29 VCI commands Four command types are defined : READ : simple read WRITE : standard write LINKED LOAD : read with exclusivity STORE COND : conditionnal write Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 29 VCI versus OCP The OCP (Open Core Protocol) is an industrial evolution of the VCI standard. OCP is more flexible and defines more functionnalities than the VCI specification. A detailed analysis demonstrates that the OCP protocol is functionnally equivalent to the VCI Advanced protocol : - signal encoding is different - chronograms are identical. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 30

30 CONCLUSION The VCI Advanced protocol is well suited for shared address space MP-SoCs architectures. It supports all types of system interconnect : - shared system bus - token ring - full crossbar - multi-stage micro-networks It strongly enforces the IP reuse policy, and simplifies IP design. It is functionnally compatible with the OCP protocol. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 31 OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 32

31 Overview Reliable performance evaluation requires an accurate time modeling. All SoClib hardware components provide a Cycle-Accurate / Bit accurate simulation model, but the simulation speed is an issue. The SoCLib CABA modeling is based on the CSFSM theory (Communicating Synchronous Finite State Machine). These models can be simulated by the standard OSCI simulation engine, but the simulation can be strongly accelerated by specialised simulation engines, using static scheduling technics. The CSFSM model SystemC CABA modeling Static Scheduling SystemCASS accelerator Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 33 Communicating Synchronous Finite State Machine T0 T1 T2 CK R0 R1 R2 G0 G1 G2 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 34

32 VCI-based multi-processor architecture VCI Ram 0 SW Ram 1 Ram 2 Ram 3 Ram 4 Ram 5 Ram 6 Ram 7 SW SW SW SW SW SW SW Router Router Router Router VCI MW MW MW MW P 0 P 1 P 2 P 3 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 35 Cycle-based Simulation Principle No scheduler! The simulation engine is just an execution loop : For (cycle = 0 ; cycle < MAX ; cycle++) { Transition(MODULE_0) ; Transition(MODULE_N) ; Generation(MODULE_0) Generation(MODULE_N) ; } If all components behave as Moore FSMs, the evaluation order is not significant. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 36

33 The Mealy signals issue The most frequent case Combinational dependencies... processor core combi PC IR address instruction combi combi external cache Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 37 Mealy Dependency Graph The nodes are the signals The arrows are Mealy dependencies between signals X Y X Y Z Z If the design contains Mealy signals, the evaluation ordering is critical Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 38

34 CABA model general structure E TRANSITION ETAT GENERATION MEALY GENERATION MOORE S S Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 39 Cycle-based simulation with Mealy signals For (cycle = 0 ; cycle < MAX ; cycle++) { Transition(MODULE_0) ; Transition(MODULE_N) ; GenMoore(Module_0) GenMoore(Module_N) while(instable) { GenMealy(MODULE_0) ; GenMealy(MODULE_N) ; } } The internal evaluation loop for GenMealy() functions can be optimised by a static ordering respecting the Mealy dependencies. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 40

35 Multi-FSMs component E T0 R0 G0 T1 R1 G1 Transition() { R0 = T0(E, R0, R1); R1 = T1(E, R0, R1); } The actual modification of a register value must be delayed until the end of the transition function. We use the sc_signal type to model internal registers. S0 S1 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 41 Cycle-Based simulation with OSCI SystemC We use the sensitivity lists to force the cycle based scheduling : The sensitivity list of the Transition() methods contains only the CK rising edge The sensitivity list of the GenMoore() methods contains only the CK falling edge The sensitivity list of the GenMealy() methods contain the CK falling edge, plus a sub-set of the input signals. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 42

36 Cycle-based simulation with SystemCASS SystemCASS is a static-scheduling simulation engine, optimised to exploit the communicating FSMs modeling approach (CSFSM) : a static scheduling, taking into account the Mealy dependencies is computed at elaboration time. Contrary to OSCI SystemC, that is a «general-purpose» simulation engine, SystemCass only accept models respecting the CSFSM approach, but SystemCass is about 10 times faster than OSCI systemc... SystemCASS is distributed by UPMC/LIP6 as free software (GPL license), and can be downloaded on the SoCLib WEB server : Parallel versions of SystemCASS targeting multi-cores SMP workstation are under development at TIMA and LIP6 laboratories. Preliminary results demonstrated a quasi-linear speedup when there is no Mealy dependencies in the architecture. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 43 CABA Simulation Speed Simulation with SystemCASS, on a multi-core, 2.5 GHz, LINUX PC. A/ Simple architecture : one single MIPS32 processor (with instruction & data caches), four memory banks, one TTY controler, one graphic controler, and one system bus controler cycles / s B/ Two levels multi-processors architecture : 16 MIPS32 processors in 4 clusters, with a local interconnect in each cluster, and DSPIN micro-network as global interconnect cycles / s C/ Same 16 processors architecture, but with SystemCASS SMP, using 4 Posix Threads running in parallel on 4 Intel processor cores : cycles / s => The simulation time increases linearly with the number of processor cores. On a quadri-cores workstation, the speedup provided by CABA parallel simulation is quasi linear. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 44

37 CONCLUSION Cycle accurate SystemC simulation can be achieved with acceptable simulation speed for reliable performance evaluation of large multi-processors systems (up to 64 processors). Careful modeling policy, based on the CSFSM model, supporting static scheduling technics provides a simulation speed of about 1 million cycles/s (1 MHz) on a 1 GHz Linux PC, for a single processor system. The static scheduling approach is compatible with parallel simulation on multi-cores SMP workstations. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 45 OUTLINE The SoCLib Platform The VCI protocol CABA modeling TLM-DT modeling Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 46

38 OVERVIEW TLM-DT stands for Transaction Level Modeling with Distributed Time This modeling style has been designed to support parallel simulation on SMP multi-cores workstation. It is based on the theory of PDES (Parallel Discrete Event Simulation) developed by Shandy / Misra Bryant. It can be described as an extension of TLM2.0 PDES principles TLM-DT over TLM2.0 Component modelling Experimental results Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 47 3 keys for simulation speed Flow control granularity In CABA, the transfer unit is the «flit». In TLM2.0, the transfer unit is the «packet». Interface Method Call In CABA, data transfers are performed through signals. In TLM2.0, data transfers are performed thanks to Interface Method Calls. Parallel simulation Standard TLM2.0 simulation relies on the SystemC global scheduler that is hard to be parallelized on a SMP workstation. The TLM-DT modeling does not use the SystemC global time to open the door to parallel simulation but we must accept to shift from the global simulation time paradigm to the distributed simulation time paradigm. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 48

39 PDES principles The complete simulated system is described as a set of logical processes that execute in parallel and communicate via point-to-point channels. There is neither global scheduler nor global clock. To each process is associated a local clock, defining a local time. The processes synchronize themselves by embedding timing information in the packets carried through the communication channels. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 49 Optimistic PDES / Pessimistic PDES T1 T2 T0 T4 T3 In pessimistic PDES, a process is allowed to increase its local time if and only if it has the guaranty that it can not receive on any of its input channels a message with a timestamp smaller than its local time : this is called time filtering. This constraint can be violated in optimistic PDES, but the roll-back mechanism needed to put a process into a previous state is very expensive and can not be used for MPSoC modeling. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 50

40 Null-messages The pessimistic PDES algorithm relies on temporal filtering of the incoming messages : A PDES process that has N input channels is only allowed to process when it has timing information on all its N input ports. => For example, a NOC interconnect is allowed to let a command packet reach a given target if and only if all the initiators that can theorically address this target have sent at least one timed message. To solve this issue, the PDES algorithm uses null message. A null message contains no data, but only a time information. Moreover, all processes can be in two modes : active & non- active. Only processes that are active participate to the temporal filtering. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 51 TLM 2.0 overview T. Kogel, MPSoC 08 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 52

41 Generic payload Used in SocLib T. Kogel, MPSoC 08 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 53 soclib_payload_extension enum vci_command { VCI_READ_COMMAND, VCI_WRITE_COMMAND, VCI_LINKED_READ_COMMAND, VCI_STORE_CONDITIONAL_COMMAND, PDES_NULL_MESSAGE, PDES_ACTIVE, PDES_INACTIVE, }; unsigned int m_src_id; unsigned int m_trd_id; unsigned int m_pkt_id; enum vci_command m_vci_command; Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 54

42 Non-blocking transport payload The Initiator Local Time is an offset relative to the SystemC simulation time) The SystemC Simulation time has always a zero value...) Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 55 Initiator and Target Modeling Each PDES process is implemented as a sc-thread. There is at least one sc_thread (and one associated local time) for each VCI initiator, and one sc_thread for each VCI target. Initiators : The «free running» of an initiator (such as a processor executing instructions contained in its L1 cache without MISS) is limited by a «timing quantum» (typically 100 cycles). When this boundary is reached, the sc_thread is descheduled, and a NULL message is sent. Targets : Most targets have a purely «reactive behavior : the associated sc_thread is in waiting mode until a command packet is received. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 56

43 NoC Interconnect Behavior In shared memory MP-SoCs architecture, all arbitration between concurrent transactions is done in the Network on Chip. Therefore, the PDES time filtering operation is mainly implemented in the NoC model. There is only one thread (and one associated local time) per interconnect. In case of hierarchical interconnects (with N local interconnects, and a global interconnect), there is one sc_thread per interconnect (N+1). In order to avoid deadlocks, the dynamic contention (and the associated time filtering mechanism) is only implemented in the command network. The dynamic contention in the response network is neglected. All incoming command packets are stored in a centralized buffer indexed by the initiator index : when all slots of this buffer are filled with timed messages, the arbitration can occur. Only active initiators are taken into account. Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 57 NoC Modeling T Initiator 0 Commands: time filtering and routing, using the central buffer T T Target 0 T Initiator 1 T Initiator 3 Responses: Routing, but no time filtering (no thread) T Target 1 Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 58

44 Conclusion Using the standard (sequencial) OSCI systemc simulation engine (SystemC 2.0), and the standard TLM2.0 package, the simulation speedup (TLM-DT versus CABA) is one order of magitude. The distributed time modeling approach (TLM-DT) remove the bottleneck associated to the SystemC centralized scheduler. Preliminary results, obtained with a parallel SystemC SMP simulator on a quadri-core workstation demonstrate a quasi linear speedup when simulating a large 160 processor cores MP-SoCs architecture Alain Greiner NOC 2010 : Virtual prototyping of NoC based MP-SoCs / The SoCLib Platform 59

45 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

46 DSX: Design space exploration and mapping tool Nicolas Pouillon NoC Symposium Tutorial Parallel application modeling Goal We want to: map a given application on a Multiprocessor System-on-Chip have some parts of the application mapped on hardware blocks validate early on a workstation choose actual hardware/software implementation late in time debug all the way try lots of different designs Design Space Exploration Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

47 Parallel application modeling Focus We ll focus on data-flow, packet-based applications: network packet handling multimedia (audio, video) Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR High-level description We use a generic communication middleware called MWMR Multi-Writer Multi-Reader. It has the following properties: FIFO-like channel conveys packets of fixed width called blocks fixed depth fixed access protocol shared-memory-based Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

48 Parallel application modeling MWMR More properties blocks are delivered in order contiguity is not asserted access to channels can be done from software tasks or hardware coprocessors in an uniform way In its simple form 1 writer and 1 reader MWMR is a bounded KPN-like channel. Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR Datatypes MWMR uses 3 data buffers: A ringbuffer A status: read & write pointers, lock, usage counter A descriptor: the channel s properties (block width, channel depth, status & ringbuffer addresses in memory) Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

49 Parallel application modeling MWMR Protocol 1. Take a lock on the status 2. Read & decide what to transmit, if nothing to do, go to 5 3. Do the data transfer between local buffer and data ringbuffer 4. Update status buffer 5. Release status lock Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR Illustration A5 A4 A3 A2 P0 C0 A0 B2 B1 B0 A1 B5 B4 B3 P1 C1 A5 A4 A3 A2 P0 C0 B1 A0 B2 B5 B4 B3 P1 C1 B0 A0 Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

50 Parallel application modeling MWMR Illustration A5 A4 A3 A2 P0 C0 A0 B2 B1 B0 A1 B5 B4 B3 P1 C1 A5 A4 A3 A2 P0 C0 B1 A0 B2 B5 B4 B3 P1 C1 B0 A0 Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 Parallel application modeling MWMR Conclusion Each task can be either an Hardware accelerator or a Software task running on a programmable processor If you want deterministic behavior (KPN), restrict to 1 producer and 1 consumer per channel Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

51 DSX Goals DSX Design-Space Explorer SoC designer may want to be able to easily design a parallel application (without bugs) taking dedicated hardware into account focusing on optimization aspects one at a time DSX How to? Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Goals How to feed DSX with an application? Split the application in tasks (a task may be used more than once in the application) Describe the application in terms of tasks, connected through explicit communication resources Create an Hardware platform to host the application Map the application and its resources on the hardware platform Demux VLD IQZZ IDCT TG Split Libu Ramdac Demux VLD IQZZ IDCT Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

52 DSX Goals DSX Design-Space Explorer Everything is about choosing Implementation of tasks Hardware/Software partitioning Hardware platform dimensioning and features Mapping of software tasks and resources Implementation of software communication routines Tweaking of all the above keeping the application coherent DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Goals The big picture TCG Mapping Hardware platform Ad hoc Generic infrastructure Cross compilers MutekH DSX SystemC SystemCASS SoCView CABA Tlm T Tlm Software application for validation on workstation Firmware binary image Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

53 DSX Application DSX Task and Communication Graph With the previously-defined tasks, and the available communication frameworks, designer creates a bipartite oriented graph where: TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm nodes are tasks or communication channels edges assign tasks ports to communication channels Task and Communication Graph (TCG) This is our application. Demux VLD IQZZ IDCT TG Split Libu Ramdac Demux VLD IQZZ IDCT DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Application Task modeling A task is described by its name some communication ports, with associated communication type IDCT one or more implementations, each can be a C program (or anything a compiler can handle) an ASM program an handcrafted hardware coprocessor an synthetized hardware coprocessor All implementations should behave the same way, and will be considered as interchangeable. Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

54 DSX Application DSX Communication and synchronization modeling DSX lets you define your parallel programming model, some are available out of the box: Multi-Writer-Multi-Reader channels Shared-memory buffers (no built-in synchronization) Mutexes (locks) Synchronization barriers Events DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Application Focus on application Now DSX knows about the application, designer can try and test it on a classic workstation (if a software implementation exists for every task), with some instrumentation debugging (gdb, ddd, valgrind, etc.) task profiling communication profiling TCG Software application for validation on workstation Mapping Firmware binary image communication channels logging (in MWMR at least) DSX Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

55 DSX Hardware platform DSX Hardware definition DSX lets designers use their own hardware components. In order to know how to use the components, DSX needs a description: the so-called Metadata. TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm All DSX needs is a file containing: entity name ports parameters dependencies on sub-modules pointers to corresponding implementation files Hopefully enough, this is already done for all SoCLib modules! DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Hardware platform Hardware usage Designer can define its own hardware architecture. Several usual generic platforms are predefined and directly usable as a development base. TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm Mips Mips Mips Mips Ram LocalCrossbar LocalCrossbar Xcache Xcache Xcache Xcache Mips Mips Mips Ram Xcache Xcache Xcache Mips Ram Xcache VGMN LocalCrossbar LocalCrossbar Mips Mips Xcache Xcache VGMN Xcache Mips Ram Xcache Xcache Mips Mips Xcache Mips Ram Xcache Xcache Mips Mips DSX can be used as a netlister for SystemC. Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

56 DSX Mapping the application on the platform DSX Mapping Application on Hardware Finally, with the application TCG and the hardware platform description, DSX lets the designer map each application entity on the hardware platform, including: TCG Software application for validation on workstation Mapping DSX Firmware binary image Hardware platform Ad hoc Generic infrastructure Hardware platform Simulator System on Chip One binary image Many simulation abstraction levels Cross compilers MutekH SystemC SystemCASS SoCView CABA Tlm T Tlm Where tasks are to be executed (CPU, coprocessor) Placement of communication channels in memory or dedicated communication components Placement of running software stacks in memory Placement of binary executable code in memory Placement of global utility objects in memory DSX Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28 DSX Mapping the application on the platform Python API Every input description for DSX is made through a Python API, thus description is in fact a script many close-enough descriptions can be factored in one, with parameters changing those parameters can be done through the script running the simulator and extracting performance figures can be made automatic Design Space Exploration can be thoroughly automated! Nicolas Pouillon (Lip6) DSX: Design space exploration and mapping tool / 28

57 Questions?

58 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

59 NoC Symposium 03/05/ Grenoble Thales Communications Porting an H264 decoder on a virtual platform Fabien Colas-Bigey Thales Communications ANR SoCLib -- NoC Symposium rd May Grenoble 1 Agenda Origin of the need Switching to MP-SoC architectures New constraints imposed by MP-SoC Application : H264 decoder Presentation of the standard Main characteristics of the SoCLib implementation Software stack Hardware platforms CABA TLM-DT Experiments ANR SoCLib -- NoC Symposium rd May Grenoble 2

60 Why do we need MP-SoC? Reduction of power consumption is the next challenge, even more important than processor size The performance increase cannot be achieve anymore by the simple rising of frequency. Need to increase systems complexity without increasing power consumption Software applications and hardware platforms need to be tightly coupled: the platform only embeds the services required by the application Multiprocessor platforms become the references of our suppliers MP-SoC are required to cope with new requirements of modern systems ANR SoCLib -- NoC Symposium rd May Grenoble New constraints imposed by MP- SoC New architectures play in favour of power reduction but: 3 How to program efficiently these new multiprocessor architectures? Expression of parallelism in applications Load balancing between several processing units Use of specialized instructions : SIMD, etc. How to predict the performances of the system? Load of cache coherency mechanisms Memory contention Synchronizations overhead Global system validation: scheduling, hard device accesses, etc. How to manage data? Storage of shared data Validation of memory accesses Detection of cache break ANR SoCLib -- NoC Symposium rd May Grenoble 4

61 New constraints imposed by MP-SoC performance Ways of indeterminism Customisation wall => Reconfiguration Parallelisation wall =>Shared memories, Cache coherency => Unpredictable performance Frequency and power walls => ILP, Pipelines, thread parallelism, Multicores => Unpredictable performance Memory wall => Memory hierarchies (Cache) => Unpredictable performance ANR SoCLib -- NoC Symposium rd May Grenoble 5 years Application H264 decoder Video compression standard developed jointly by JVT and MPEG groups Several profiles: Baseline: intra and inter predicted images, CAVLC entropic coding, deblocking filter Main Extended Our implementation Our implementation is the constrained baseline profile The base element is a macroblock : 16x16 pixels ANR SoCLib -- NoC Symposium rd May Grenoble 6

62 Application H264 decoder Predictions Spatial prediction : intra predicted frames Temporal prediction : inter predicted frames ANR SoCLib -- NoC Symposium rd May Grenoble 7 Application H264 decoder Predictions and dependencies Prediction of prediction information Prediction of macroblocks I N B D B C T R A A E A E I N T B C No spatial dependency E R A E Temporal dependency ANR SoCLib -- NoC Symposium rd May Grenoble 8

63 Application H264 decoder An implementation following the SPMD scheme A common part manages the input stream and dispatches data Each task has in charge the management of a part of the image called a slice ANR SoCLib -- NoC Symposium rd May Grenoble 9 Software stack The operating system has been developed by LIP6 in the scope of the SoCLib project Appli H264 Appli Benchmark MUTEK/H Interrupt Management HEXO abstraction layer SoCLib virtual platform ANR SoCLib -- NoC Symposium rd May Grenoble 10

64 Hardware platforms : CABA IT Loader Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) ROM RAM Interconnect VGMN (micro-network) XICU + Timer TTY Ramdisk Frame Buffer IT ANR SoCLib -- NoC Symposium rd May Grenoble 11 Hardware platforms : CABA Processors The number of processors is configurable : 1 to 8 The nature of processors is configurable : MIPS, PowerPC, Arm The ramdisk is used to load files into memory The frame buffer is a video viewer to observe YUV or RGB images The ROM contains the boot and text sections of the executable The system can be used with a NoC (vgmn) or a bus (vgsb) No cache coherency is implemented on this platform ANR SoCLib -- NoC Symposium rd May Grenoble 12

65 Hardware platforms : TLM-DT Loader IT Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) Xcache Wrapper (MIPS32EL) RAM0 RAM1 Interconnect VGMN (micro-network) ICU Timer TTY Ramdisk Frame Buffer IT IT ANR SoCLib -- NoC Symposium rd May Grenoble 13 Hardware platforms : TLM-DT Almost identical to the CABA platform XICU has been replaced by ICU + Timer ANR SoCLib -- NoC Symposium rd May Grenoble 14

66 Performances evaluation Parallelisation efficiency on a CABA platform cpu 2 cpu 4 cpu 6 cpu dbf overhead process ANR SoCLib -- NoC Symposium rd May Grenoble 15 Performances evaluation Parallelisation efficiency on a TLM-DT platform dbf overhead process cpu 2 cpu 4 cpu 6 cpu ANR SoCLib -- NoC Symposium rd May Grenoble 16

67 Performances evaluation Variation of decoding time on a CABA platform cpu 2 cpu 4 cpu 6 cpu ANR SoCLib -- NoC Symposium rd May Grenoble 17 Performance evaluation Variation of decoding time on a TLM-DT platform cpu 2 cpu 4 cpu 6 cpu ANR SoCLib -- NoC Symposium rd May Grenoble 18

68 Comparison CABA, TLM-DT Processor cycles Constant imprecision independent of the number of processors Temporal imprecision of approximately 70% This imprecision is mainly due the tlmdt xcache, which still need some fine tuning Both CABA and TLM-DT reflect the same behaviour when testing different configurations Number of memory accesses Difference inferior to 10% Simulation time In TLM-DT, the acceleration factor compared to CABA is at minimum 10 ANR SoCLib -- NoC Symposium rd May Grenoble 19 How can we use SoCLib Training of architects and designers Modelling techniques on multiprocessor architectures Software development Development of the software stack early in the design flow Validation of the functioning Application debug Check memory consistency Use advanced debug tools packages with the platform Architecture exploration Possibility to change configuration very easily Possibility to run different policies and study their impact on the system ANR SoCLib -- NoC Symposium rd May Grenoble 20

69 Demonstration Parallel execution of a CABA and a TLM-DT platform Test different configurations Processor number, image format A few words about gdbserver ANR SoCLib -- NoC Symposium rd May Grenoble 21

70 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

71 Virtual prototyping of NoC based MP-SoCs Prototyping of a scalable, coherent, shared memory, multi-cores architecture Nam NGUYEN Eric GUTHMULLER Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 2/20

72 The TSAR Project TSAR is a Medea+ project that started in june One central goal of this project is to define and implement a shared memory, cache coherent, many-cores architecture. The project leader is BULL. Industrial partners ACE (Netherland) Bull S.A. (France) Compaan (Netherland) NXP Semiconductors (Netherland) Philips Medical Systems (Netherland) Thales Communications (France) Academic partners Université Paris 6 (LIP6) Technical University Delft (CEL) University Leiden (LIACS) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 3/20 Main Technical Choices Processor: TSAR is a MP2-SoC (Massively Parallel, Multi-processor System on Chip), containing up to 4096 processor cores. The architecture is independent on the selected processor core: It could be SPARC V8, MIPS32, PPC 405, ARM, etc. Interconnect: The TSAR architecture is clusterized, with a two-level interconnect. It will use a proven Network on Chip implementing the the VCI/OCP standard: DSPIN. Memory: The Tsar architecture will support a (NUMA) shared address space. The memory is logically shared, but physically distributed, with cache coherence enforced by hardware, using a directory-based protocol. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 4/20

73 Clusterized Architecture External RAM ctrl CPU FPU L1 I L1 D CPU FPU L1 I L1 D Mem cache Timer DMA ICU CPU FPU L1 I L1 D CPU FPU L1 I L1 D Mem cache Timer DMA ICU VCI/OCP Local Interconnect Local Interconnect NIC NIC I/O ctrl NIC DSPIN micro- network NIC I/O ctrl NIC NIC Local Interconnect Local Interconnect VCI/OCP L1 I CPU FPU L1 D L1 I L1 D CPU FPU Mem cache Timer DMA ICU L1 I CPU FPU L1 D L1 I CPU FPU L1 D Mem cache Timer DMA ICU External RAM ctrl Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 5/20 A regular 2D mesh topology NIC cluster ck03 cluster ck13 cluster ck23 cluster ck33 Local Interconnect cluster ck02 cluster ck12 cluster ck22 cluster ck32 L1 I CPU FPU L1 D L1 I CPU FPU L1 D Mem cache Timer DMA ICU cluster ck01 cluster ck11 cluster ck21 cluster ck31 TSAR will use the DSPIN Network on Chip technology developed by LIP6. each cluster is implemented in a different clock domain. cluster cluster cluster cluster inter-cluster communications use ck00 ck10 ck20 ck30 bi-synchronous (or fully asynchronous) FIFOs. => Clock frequency & Voltage can be independently adjusted in each cluster. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 6/20

74 A NUMA architecture TSAR is a NUMA architecture (Non Uniform Memory Access): CPU CPU MEM CPU CPU MEM local remote external Local Interconnect Local Interconnect I/O DSPIN micro- network I/O The physical memory space is statically distributed amongst Local Interconnect Local Interconnect the clusters. The operating system must CPU CPU MEM CPU CPU MEM precisely control the mapping: - tasks on the processors - software objects on the memory caches. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 7/20 Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 8/20

75 Soclib Use In Tsar SystemC CABA models were developed on top of the existing Soclib infrastructure Benefits from existing models: Interconnects Peripherals Processors Rapid prototyping => design space exploration Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 9/20 Debugging Facility Software: GdB Server (independent from the processor Iss) Memory Checker: tracks potentially wrong memory accesses => «Valgrind like» tool for any processor Iss => Need support from the software side (OS) Hardware: Vci Logger: displays all the Vci traffic on a given Vci port Vci Simhelper: allow the running software to stop the simulation when an error has been detected => To use in automated tests Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 10/20

76 Operating Systems A handful of OSes already runs on Soclib and then runs on Tsar without much work Three of these OSes have been successfully tested on various Tsar platforms: NetBSD (see has been successfully tested on a monoprocessor Tsar platform with support for virtual memory MutekH (see has been successfully tested on Tsar platforms with up to 64 processors (16 clusters) AlmOs (see has been successfully tested on Tsar platforms with up to 16 processors (4 clusters) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 11/20 Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 12/20

77 Tsar Versions There are 2 main versions of the Tsar architecture: Tsar V0: First implementation of the DHCCP protocol Copies are represented by a vector of bits (one bit per processor) Bit vector limits the number of processors to 32 Tsar V1: Copies are represented by a chained list stored in a single memory, which is common to all cache lines in the Memory Cache Support for up to 4096 processors (VCI IDs on 12 bits) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 13/20 Splash benchmarks Splash FFT (65536 points, 256 in each dimension) Speedup form 1 to 16 processors (Tsar V1), constant data set Speedup Ideal Speedup Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 14/20

78 Splash benchmarks Splash FFT (4096 points for 1 cpu, for 4 cpus, for 16 cpus) Speedup form 1 to 16 processors, (data set size/nb cpus) constant 1,2 1 0,8 0,6 0,4 0, Speedup Ideal Speedup Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 15/20 Parallelization of the simulation Tsar platforms are really big: on a 2,6 GHz Xeon cpu, a 16 processors platform simulates at about 30 Khz with classical SystemCASS Running big benchmarks can take hours A parallelized version of SystemCASS has been developed Mealy generation functions should be prohibited because they are not run in parallel Shared variables between threads have to be avoided => Frontiers have to be chosen with caution: 1 thread per cluster With 1 thread per cluster, speedup is linear: speedup of N with N clusters (has been tested with 4 clusters for the moment) Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 16/20

79 Outline The TSAR project Soclib use case Performed experimentations RTL development perspective Conclusion Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 17/20 Cosimulation SystemC/VHDL CABA models are designed to be very near to real hardware A simple way to develop VHDL models is to use cosimulation: you put VHDL/Verilog models in a SystemC platform Soclib infrastructure is able to compile a mixed platform with SystemC and VHDL/Verilog models for the Modelsim simulator Each VHDL/Verilog can be developed separately by replacing only the corresponding SystemC model in the platform Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 18/20

80 Fast RTL debug If you have already written a SystemC model, you should not have to re-debug the functionality of your model What you want is to have your RTL model match cycle-accurately your SystemC model For that purpose, wrappers have been written to debug the RTL of the Memory Cache => these wrappers instantiate the SystemC and the RTL models, send to them the same inputs and compare the outputs of the models But comparing only the outputs is not sufficient (the bug can happen very lately on outputs), thus we also compare internal FSMs states => the debug time of the RTL is considerably reduced Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 19/20 Conclusion The SoCLib virtual prototyping platform was very useful for quantitative analysis of various cache coherence protocols. The Cycle Accurate SystemC modeling was mandatory to design (& debug) the TSAR critical hardware components, such as the L1 cache controller, and the memory cache. The RTL design (synthesizable VHDL models) has been strongly simplified (and accelerated) by the SystemC virtual prototyping. Eric Guthmuller NOC 2010 Tutorial / Virtual Prototyping of NoC-based MP-SoCs / CABA modeling 20/20

81 Virtual Prototyping of NoC-based MPSoC : Fundamentals & Case Studies Virtual Prototyping principles Virtual prototyping with TLM2.0, Laurent Maillet-Contoz, STMicro SystemC, and the various abstraction levels for NOC-based MPSoC modelling, the corresponding simulation algorithms and the VCI/OCP Communication standards, Alain Greiner, LIP6 Design Space exploration & mapping tool, Nicolas Pouillon, LIP6 Use cases Modelling an H264 decoder, Fabien Colas-Bigey, Thales Performance evaluation of a multi-core architecture, Nguyen Nam, Bull Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures, Frederic Pétrot, TIMA ASYNC 2010 NOCS rd 6th May 2010 CEA-LETI, MINATEC, Grenoble, France

82 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Prototyping an hardware-based page migration mechanism on a NoC-based shared memory multicore architectures Pierre Guironnet de Massas and Frédéric Pétrot SLS Group TIMA Laboratory 46, Av Félix Viallet, Grenoble, France S C lib S Clib 3/4/2010 S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion 1 / 27 Integrated architecture evolution Continuous increase in integration capabilities Parallelism allows to exploit VLSI integration advances Systems have to support many standards/application classes More and more processors: 128 expected in ,000 1,800 1,600 1, ,200 1, S C lib2022 Number of Processing Engines (source: ITRS System Drivers 2007) 1023 S Clib tutorial at 2 / 27

83 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated architecture evolution Heterogeneous architectures Complex to program Many different software stacks Not tolerant to process variability FIRM WARE DSP OS + MIDDLEWARE DRIVERS FIRM WARE DSP APPLICATION HAL GPP HAL GPP DRIVERS OS ASIC FIRMWARE PE PE FIRMWARE PE PE Homogeneous architectures Coherent shared memory architectures A single software stack Tolerant to process variability Many decisions taken by sw and hw to make programming easy HW ACCEL DRIVERS GPP APPLICATION Hardware Abstraction Layer GPP GPP OS GPP GPP S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Outline 3 / 27 1 Introduction 2 Issues & Context 3 Memory, data placement and migration 4 Data migration and ideal data placement 5 Solution and implementation 6 Conclusion S C lib S Clib tutorial at 4 / 27

84 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data access: A key point Efficiency means: 1 Low latency 2 High bandwidth 3 Low energy consumption 4 Not visible from software Average Memory-Access Time: AMAT = t hit + r miss p miss Goal: reduce p miss by dynamic data placement chip ~20 cycles SRAM Mem 1 Board ~1000 cycles SRAM Mem 0 ~5 cycles Serial Link DRAM Mem 4 cache GPP DRAM Mem 3 ~100 cycles S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion 5 / 27 Context Architectural template Simple pipeline processor with instruction and data caches Large amount of on-chip memory NoC: backbone for massive processor parallelism Directory based memory coherency Logically shared and physically distributed memory chip GPP cache SRAM NI interconnection network S C lib S Clib tutorial at 6 / 27

85 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Distributed Architectures Characteristics Distributed on chips, boards, machines NUMA Distributed shared memory chip GPP cache SRAM NI S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Distributed Architectures: Existing solutions 7 / 27 Data placement and migration Intelligent memory allocator OS based page migration using the MMU and ad-hoc hardware First-touch (or second-touch) placement Data placement problems Where to place the shared data? What happens to data when task migrate? Non optimized data placement Migration cost seldom profitable Low system reactivity S C lib S Clib tutorial at 8 / 27

86 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated Architectures: Existing solutions Characteristics Integrated within a SoC L2 caches No embedded memory chip GPP L1 cache L2 NI S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated Architectures: Existing solutions 9 / 27 L2 cache management Shared L2: Cache capacity maximisation Minimizes number of off-chip misses High access latency Distributed L2: Suboptimal cache capacity (copies) Increases number of off-chip misses Lower latency of accesses (hit) S C lib S Clib tutorial at 10 / 27

87 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Integrated Architectures: Existing solutions Cooperative L2 caches: Steal blocks in the neighbouring caches Avantages Reactivity Transparency of use Drawbacks Data movement Data duplication S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Proposal 11 / 27 Addressable memory instead of L2 caches Combines the advantages of the other approaches Not visible to software at all Hardware implementation: Efficiency, reactivity Addressable memory: Maximal on-chip capacity S C lib S Clib tutorial at 12 / 27

88 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data sharing and placement Solution independent of the type of accessed data Shared memory, local stacks Tasks, OS code Used through all software layers 0xbfc02000 <printf> 0xbfc0320c <scanf> 0xbfc04284 <fprint> printf("hello\n"); A[0] = 1; cache cache GPP GPP Ti cache cache GPP Ti GPP A[1] = 5; printf("task1\n"); A[0] A[1] A[2] A[3] S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data sharing and placement 13 / 27 Data placement and CPI CPI : Measure of execution speed CPI = f (AMAT ) = f (t hit + r miss p miss ) Minimizing the miss penalty by moving data close to its usage The access cost depends on the number of accesses and of the latency P2 P0 D' processors P1 Pi D The optimal placement of D minimizes the sum of the access costs S C lib S Clib tutorial at 14 / 27

89 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Moving data around Suboptimal data placement: System state High access cost Data access requests Active tasks Distributed accesses No congestion points 0 MAX 0 1 Access cost Congestion M_c Threshold Problems Suboptimal data placement means: High access cost (distance) High power consumption (distance) Low overall system performances S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Moving data around 15 / 27 Optimal data placement: System state Low access cost thanks to data locality Access congestion is happening on this zone 0 MAX 0 1 Access cost Congestion M_c Threshold Change in situation Data are closer A hot spot occurs Performance drops down drastically S C lib S Clib tutorial at 16 / 27

90 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Moving data around Before migration After migration D Best possible placement that does not incur congestion M e D' D' M c D Theoretic optimal for data placement 0 1 Congestion Threshold M_c Solution: stay under the congestion threshold 1 Place data in a conservative manner 2 Remove hot spot by migrating the often accessed data S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Proposed Architecture 17 / 27 NI 2D-MESH NI SRAM PORT Ring Out 3 Counters Crossbar L1 P M Crossbar L1 P M 1 2 Initiator Ni Target NI Ring In Node distance table 4 NI Crossbar L1 P M NI Crossbar L1 P RING M Page tables Page table directory 7 8 Migration trigger 6 5 Hardware add-ons An other interconnection network: Ring Memory controller: access counters, page tables, etc Cache controller: small TLB S C lib S Clib tutorial at 18 / 27

91 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion How to access the moved data Addressing : Physical address Hardware address translation Small TLB as translation cache Physical address cache Hardware page table RAM 0 0x100 -> 0x130 DIR data CACHE RAM 20 0x300 -> 0x330 DIR data @src 0x100 0x320 0x210 0x110 0x210 0x200 CACHE LINE 0x100 Icache/Dcache @src 0x300 0x310 0x420 0x310 0x420 0x300 0x320 0x200 0x100 0x100 RAM 10 CACHE 0 RAM 30 0x200 -> 0x230 DIR data Table @src 0x200 0x110 0x320 0x210 0x100 0x110 miss/unc CACHE LINE 0xXXX 0x100 INVAL (0x104) Processor 1 0x400 -> 0x430 DIR data @src 0x420 0x300 0x310 S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion How to access the moved data 19 / 27 Addressing : Physical address Hardware address translation Small TLB as translation cache Physical address cache Hardware page table RAM 0 0x100 -> 0x130 DIR data CACHE RAM 20 0x300 -> 0x330 DIR data @src 0x100 0x320 0x210 0x110 0x210 0x200 2 CACHE LINE 0x100 Icache/Dcache @src 0x300 0x310 0x420 0x310 0x420 0x300 0x320 0x200 0x100 0x100 rsp 0x320 RAM 10 CACHE 0 RAM 30 0x200 -> 0x230 DIR data Table @src 0x200 0x110 0x320 0x210 0x100 0x110 miss/unc CACHE LINE 0xXXX 0x100 0x320 (0x104) Processor 1 0x400 -> 0x430 DIR data @src 0x420 0x300 0x310 S C lib S Clib tutorial at 19 / 27

92 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion How to access the moved data Addressing : Physical address Hardware address translation Small TLB as translation cache Physical address cache Hardware page table RAM 0 0x100 -> 0x130 DIR data CACHE RAM 20 0x300 -> 0x330 DIR data P1 @src 0x100 0x320 0x210 0x110 0x210 0x200 2 CACHE LINE 0x100 Icache/Dcache 0x100 0x324 @src 0x300 0x310 0x420 0x310 0x420 0x300 0x320 0x200 0x100 0x104 3 rsp 0x320 RAM 10 CACHE 0 RAM 30 0x200 -> 0x230 DIR data Table @src 0x200 0x110 0x320 0x210 0x100 0x110 miss/unc CACHE LINE 0xXXX 0x100 0x320 (0x104) Processor 1 0x400 -> 0x430 DIR data @src 0x420 0x300 0x310 S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Migration decision 19 / 27 Based on per processor and per page access counters Periodic migration ❶: Processor access counter reaches 128 Congestion ❷: Most accessed page migrates n x p counters n nodes page counters p pages Max id pointer to the mostly accessed page timer periodic counter reset node access cost (=1) + M_pe 1 > Migration + > M_c + > 2 Migration S C lib S Clib tutorial at 20 / 27

93 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Data transfer and target memory choice Transfer mechanism characteristics Concurrent transfer of 2 pages on the ring Takes 2000 cycles (4 Kb pages) Still possible to access the other pages of the banks Simultaneous transfer of two pages Arbitration of accesses PORT Ring Out NI Master NI Target Ring In NoC NoC Challenges 1) TLB coherency 2) Deadlock avoidance S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results 21 / 27 Platform SoCLib CABA Simulation 16 MIPS Processors DSPIN NoC, 4 4 2D mesh New SoCLib components: Memory: 8 concurrent finite state machines Caches: 5 concurrent finite state machines Synchronous Ring interfaces S C lib S Clib tutorial at 22 / 27

94 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results Applications MJPEG: 6 tasks (4 processors) SPLASH-2: water nsquared, ocean (contiguous), fft Operating system DNA: in-house, Posix threads, Redhat newlib (libc), 100Kb Two possible configurations: 1 AD : dynamic task migration, data allocated sequentially 2 AS : statically pinned task and task s stack, shared data placed as efficiently as possible S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results: Total time 23 / % Normalized execution time, 512 byte pages 80% 60% 40% 20% 0% fft ocean_c water_ns mjpeg AD + P_STD AD + P_MIG AS + P_STD AS + P_MIG Analysis AD: 40% to 70% gain in execution time AS: -5% to 15% gain in execution time Makes AD similar to AS without any software work S C lib S Clib tutorial at 24 / 27

95 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Experiments and results: Total hop count 100% Normalized accumulated hop count, 512 byte pages 80% 60% AD + P_STD AD + P_MIG AS + P_STD AS + P_MIG 40% 20% 0% fft ocean_c water_ns mjpeg Analysis (Hop count somehow related to power consumption) AD : 90% to 75% gain in hop count AS : -20% to 15% gain in hop count Makes AD similar to AS without any software work S C lib S Clib tutorial at Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Summary and Conclusion 25 / 27 Cycle accurate simulation is a necessity To validate complex hardware concepts To obtain trustworthy results And is easier/faster than HDLs SoCLib advantages First available library of interoperable modules Used in academia for research and teaching Provides many existing components with sources Nice start for trying new stuff at IP/System level SoCLib drawbacks Still requires a lot of work, as designs are complex Learning curve Speed of simulation may be (is?) a problem S C lib S Clib tutorial at 26 / 27

96 Introduction Issues & Context Memory, data placement and migration Data migration and ideal data placement Solution and implementation Conclusion Summary and Conclusion However, the balance is largely positive! S C lib S Clib tutorial at 27 / 27

SoCLib : Une plate-forme de prototypage virtuel pour systèmes multi-processeurs intégrés sur puce

SoCLib : Une plate-forme de prototypage virtuel pour systèmes multi-processeurs intégrés sur puce SoCLib : Une plate-forme de prototypage virtuel pour systèmes multi-processeurs intégrés sur puce FETCH 07 Outline SoCLib goals SystemC modeling principles The Mutek Real-Time Operating System The MWMR

More information

TLM-2.0 in Action: An Example-based Approach to Transaction-level Modeling and the New World of Model Interoperability

TLM-2.0 in Action: An Example-based Approach to Transaction-level Modeling and the New World of Model Interoperability DVCon 2009 TLM-2.0 in Action: An Example-based Approach to Transaction-level Modeling and the New World of Model Interoperability John Aynsley, Doulos TLM Introduction CONTENTS What is TLM and SystemC?

More information

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC

More information

7a. System-on-chip design and prototyping platforms

7a. System-on-chip design and prototyping platforms 7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra

More information

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Cristina SILVANO [email protected] Politecnico di Milano, Milano (Italy) Talk Outline

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Operating System Support for Multiprocessor Systems-on-Chip

Operating System Support for Multiprocessor Systems-on-Chip Operating System Support for Multiprocessor Systems-on-Chip Dr. Gabriel marchesan almeida Agenda. Introduction. Adaptive System + Shop Architecture. Preliminary Results. Perspectives & Conclusions Dr.

More information

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Applying the Benefits of Network on a Chip Architecture to FPGA System Design Applying the Benefits of on a Chip Architecture to FPGA System Design WP-01149-1.1 White Paper This document describes the advantages of network on a chip (NoC) architecture in Altera FPGA system design.

More information

High Performance or Cycle Accuracy?

High Performance or Cycle Accuracy? CHIP DESIGN High Performance or Cycle Accuracy? You can have both! Bill Neifert, Carbon Design Systems Rob Kaye, ARM ATC-100 AGENDA Modelling 101 & Programmer s View (PV) Models Cycle Accurate Models Bringing

More information

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor Von der Hardware zur Software in FPGAs mit Embedded Prozessoren Alexander Hahn Senior Field Application Engineer Lattice Semiconductor AGENDA Overview Mico32 Embedded Processor Development Tool Chain HW/SW

More information

Going Linux on Massive Multicore

Going Linux on Massive Multicore Embedded Linux Conference Europe 2013 Going Linux on Massive Multicore Marta Rybczyńska 24th October, 2013 Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 2 Agenda Architecture

More information

Early Hardware/Software Integration Using SystemC 2.0

Early Hardware/Software Integration Using SystemC 2.0 Early Hardware/Software Integration Using SystemC 2.0 Jon Connell, ARM. Bruce Johnson, Synopsys, Inc. Class 552, ESC San Francisco 2002 Abstract Capabilities added to SystemC 2.0 provide the needed expressiveness

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

Introduction to System-on-Chip

Introduction to System-on-Chip Introduction to System-on-Chip COE838: Systems-on-Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

A Generic Network Interface Architecture for a Networked Processor Array (NePA) A Generic Network Interface Architecture for a Networked Processor Array (NePA) Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, and Nader Bagherzadeh EECS @ University of California, Irvine Outline Introduction

More information

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah (DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de [email protected] NIOS II 1 1 What is Nios II? Altera s Second Generation

More information

BY STEVE BROWN, CADENCE DESIGN SYSTEMS AND MICHEL GENARD, VIRTUTECH

BY STEVE BROWN, CADENCE DESIGN SYSTEMS AND MICHEL GENARD, VIRTUTECH WHITE PAPER METRIC-DRIVEN VERIFICATION ENSURES SOFTWARE DEVELOPMENT QUALITY BY STEVE BROWN, CADENCE DESIGN SYSTEMS AND MICHEL GENARD, VIRTUTECH INTRODUCTION The complexity of electronic systems is rapidly

More information

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana

More information

A case study of mobile SoC architecture design based on transaction-level modeling

A case study of mobile SoC architecture design based on transaction-level modeling A case study of mobile SoC architecture design based on transaction-level modeling Eui-Young Chung School of Electrical & Electronic Eng. Yonsei University 1 EUI-YOUNG(EY) CHUNG, EY CHUNG Outline Introduction

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

What is a System on a Chip?

What is a System on a Chip? What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex

More information

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001 Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering

More information

Optimizing Configuration and Application Mapping for MPSoC Architectures

Optimizing Configuration and Application Mapping for MPSoC Architectures Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : [email protected] 1 Multi-Processor Systems on Chip (MPSoC) Design Trends

More information

Operating Systems 4 th Class

Operating Systems 4 th Class Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science

More information

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps

More information

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and

More information

Embedded Development Tools

Embedded Development Tools Embedded Development Tools Software Development Tools by ARM ARM tools enable developers to get the best from their ARM technology-based systems. Whether implementing an ARM processor-based SoC, writing

More information

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG 29.10.2012, Moscow

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG 29.10.2012, Moscow PikeOS: Multi-Core RTOS for IMA Dr. Sergey Tverdyshev SYSGO AG 29.10.2012, Moscow Contents Multi Core Overview Hardware Considerations Multi Core Software Design Certification Consideratins PikeOS Multi-Core

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Hybrid Platform Application in Software Debug

Hybrid Platform Application in Software Debug Hybrid Platform Application in Software Debug Jiao Feng July 15 2015.7.15 Software costs in SoC development 2 Early software adoption Previous Development Process IC Development RTL Design Physical Design

More information

Switched Interconnect for System-on-a-Chip Designs

Switched Interconnect for System-on-a-Chip Designs witched Interconnect for ystem-on-a-chip Designs Abstract Daniel iklund and Dake Liu Dept. of Physics and Measurement Technology Linköping University -581 83 Linköping {danwi,dake}@ifm.liu.se ith the increased

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

SoC IP Interfaces and Infrastructure A Hybrid Approach

SoC IP Interfaces and Infrastructure A Hybrid Approach SoC IP Interfaces and Infrastructure A Hybrid Approach Cary Robins, Shannon Hill ChipWrights, Inc. ABSTRACT System-On-Chip (SoC) designs incorporate more and more Intellectual Property (IP) with each year.

More information

Linux. Reverse Debugging. Target Communication Framework. Nexus. Intel Trace Hub GDB. PIL Simulation CONTENTS

Linux. Reverse Debugging. Target Communication Framework. Nexus. Intel Trace Hub GDB. PIL Simulation CONTENTS Android NEWS 2016 AUTOSAR Linux Windows 10 Reverse ging Target Communication Framework ARM CoreSight Requirements Analysis Nexus Timing Tools Intel Trace Hub GDB Unit Testing PIL Simulation Infineon MCDS

More information

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux White Paper Real-time Capabilities for Linux SGI REACT Real-Time for Linux Abstract This white paper describes the real-time capabilities provided by SGI REACT Real-Time for Linux. software. REACT enables

More information

Xeon+FPGA Platform for the Data Center

Xeon+FPGA Platform for the Data Center Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system

More information

Computer Organization & Architecture Lecture #19

Computer Organization & Architecture Lecture #19 Computer Organization & Architecture Lecture #19 Input/Output The computer system s I/O architecture is its interface to the outside world. This architecture is designed to provide a systematic means of

More information

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems Fastboot Techniques for x86 Architectures Marcus Bortel Field Application Engineer QNX Software Systems Agenda Introduction BIOS and BIOS boot time Fastboot versus BIOS? Fastboot time Customizing the boot

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,

More information

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what

More information

Principles and characteristics of distributed systems and environments

Principles and characteristics of distributed systems and environments Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single

More information

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

From Bus and Crossbar to Network-On-Chip. Arteris S.A. From Bus and Crossbar to Network-On-Chip Arteris S.A. Copyright 2009 Arteris S.A. All rights reserved. Contact information Corporate Headquarters Arteris, Inc. 1741 Technology Drive, Suite 250 San Jose,

More information

Types Of Operating Systems

Types Of Operating Systems Types Of Operating Systems Date 10/01/2004 1/24/2004 Operating Systems 1 Brief history of OS design In the beginning OSes were runtime libraries The OS was just code you linked with your program and loaded

More information

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102

More information

LogiCORE IP AXI Performance Monitor v2.00.a

LogiCORE IP AXI Performance Monitor v2.00.a LogiCORE IP AXI Performance Monitor v2.00.a Product Guide Table of Contents IP Facts Chapter 1: Overview Target Technology................................................................. 9 Applications......................................................................

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

ESE566 REPORT3. Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU

ESE566 REPORT3. Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU ESE566 REPORT3 Design Methodologies for Core-based System-on-Chip HUA TANG OVIDIU CARNU Nov 19th, 2002 ABSTRACT: In this report, we discuss several recent published papers on design methodologies of core-based

More information

MPSoC Virtual Platforms

MPSoC Virtual Platforms CASTNESS 2007 Workshop MPSoC Virtual Platforms Rainer Leupers Software for Systems on Silicon (SSS) RWTH Aachen University Institute for Integrated Signal Processing Systems Why focus on virtual platforms?

More information

Codesign: The World Of Practice

Codesign: The World Of Practice Codesign: The World Of Practice D. Sreenivasa Rao Senior Manager, System Level Integration Group Analog Devices Inc. May 2007 Analog Devices Inc. ADI is focused on high-end signal processing chips and

More information

Computer Systems Structure Input/Output

Computer Systems Structure Input/Output Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices

More information

Application Note: AN00141 xcore-xa - Application Development

Application Note: AN00141 xcore-xa - Application Development Application Note: AN00141 xcore-xa - Application Development This application note shows how to create a simple example which targets the XMOS xcore-xa device and demonstrates how to build and run this

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data White Paper Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data What You Will Learn Financial market technology is advancing at a rapid pace. The integration of

More information

Testing of Digital System-on- Chip (SoC)

Testing of Digital System-on- Chip (SoC) Testing of Digital System-on- Chip (SoC) 1 Outline of the Talk Introduction to system-on-chip (SoC) design Approaches to SoC design SoC test requirements and challenges Core test wrapper P1500 core test

More information

Chapter 6, The Operating System Machine Level

Chapter 6, The Operating System Machine Level Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General

More information

Computer System Design. System-on-Chip

Computer System Design. System-on-Chip Brochure More information from http://www.researchandmarkets.com/reports/2171000/ Computer System Design. System-on-Chip Description: The next generation of computer system designers will be less concerned

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Tensilica Software Development Toolkit (SDK)

Tensilica Software Development Toolkit (SDK) Tensilica Datasheet Tensilica Software Development Toolkit (SDK) Quickly develop application code Features Cadence Tensilica Xtensa Xplorer Integrated Development Environment (IDE) with full graphical

More information

ZigBee Technology Overview

ZigBee Technology Overview ZigBee Technology Overview Presented by Silicon Laboratories Shaoxian Luo 1 EM351 & EM357 introduction EM358x Family introduction 2 EM351 & EM357 3 Ember ZigBee Platform Complete, ready for certification

More information

Electronic system-level development: Finding the right mix of solutions for the right mix of engineers.

Electronic system-level development: Finding the right mix of solutions for the right mix of engineers. Electronic system-level development: Finding the right mix of solutions for the right mix of engineers. Nowadays, System Engineers are placed in the centre of two antagonist flows: microelectronic systems

More information

Router Architectures

Router Architectures Router Architectures An overview of router architectures. Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers 2 1 Router Components

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Networking Remote-Controlled Moving Image Monitoring System

Networking Remote-Controlled Moving Image Monitoring System Networking Remote-Controlled Moving Image Monitoring System First Prize Networking Remote-Controlled Moving Image Monitoring System Institution: Participants: Instructor: National Chung Hsing University

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Multiprocessor System-on-Chip

Multiprocessor System-on-Chip http://www.artistembedded.org/fp6/ ARTIST Workshop at DATE 06 W4: Design Issues in Distributed, CommunicationCentric Systems Modelling Networked Embedded Systems: From MPSoC to Sensor Networks Jan Madsen

More information

Design and Implementation of the Heterogeneous Multikernel Operating System

Design and Implementation of the Heterogeneous Multikernel Operating System 223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,

More information

BDTI Solution Certification TM : Benchmarking H.264 Video Decoder Hardware/Software Solutions

BDTI Solution Certification TM : Benchmarking H.264 Video Decoder Hardware/Software Solutions Insight, Analysis, and Advice on Signal Processing Technology BDTI Solution Certification TM : Benchmarking H.264 Video Decoder Hardware/Software Solutions Steve Ammon Berkeley Design Technology, Inc.

More information

What is LOG Storm and what is it useful for?

What is LOG Storm and what is it useful for? What is LOG Storm and what is it useful for? LOG Storm is a high-speed digital data logger used for recording and analyzing the activity from embedded electronic systems digital bus and data lines. It

More information

Multi-objective Design Space Exploration based on UML

Multi-objective Design Space Exploration based on UML Multi-objective Design Space Exploration based on UML Marcio F. da S. Oliveira, Eduardo W. Brião, Francisco A. Nascimento, Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Brazil

More information

Example of Standard API

Example of Standard API 16 Example of Standard API System Call Implementation Typically, a number associated with each system call System call interface maintains a table indexed according to these numbers The system call interface

More information

- Nishad Nerurkar. - Aniket Mhatre

- Nishad Nerurkar. - Aniket Mhatre - Nishad Nerurkar - Aniket Mhatre Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project,

More information

SystemC - NS-2 Co-simulation using HSN

SystemC - NS-2 Co-simulation using HSN Verona, 12/09/2005 SystemC - NS-2 Co-simulation using HSN Giovanni Perbellini 1 REQUIRED BACKGROUND 2 2 GOAL 2 3 WHAT IS SYSTEMC? 2 4 WHAT IS NS-2? 2 5 WHAT IS HSN? 3 6 HARDWARE-NETWORK COSIMULATION 3

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

Operating Systems: Basic Concepts and History

Operating Systems: Basic Concepts and History Introduction to Operating Systems Operating Systems: Basic Concepts and History An operating system is the interface between the user and the architecture. User Applications Operating System Hardware Virtual

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

More information

Motivation: Smartphone Market

Motivation: Smartphone Market Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics

More information

Performance Comparison of RTOS

Performance Comparison of RTOS Performance Comparison of RTOS Shahmil Merchant, Kalpen Dedhia Dept Of Computer Science. Columbia University Abstract: Embedded systems are becoming an integral part of commercial products today. Mobile

More information

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture Review from last time CS 537 Lecture 3 OS Structure What HW structures are used by the OS? What is a system call? Michael Swift Remzi Arpaci-Dussea, Michael Swift 1 Remzi Arpaci-Dussea, Michael Swift 2

More information