Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC & SCML

Transcription

1 Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC & SCML A Feasibility Study PRIYA AGRAWAL VLSI DESIGN TOOLS & TECHNOLOGY INDIAN INSTITUTE OF TECHNOLOGY, DELHI 2009

2 Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC & SCML A Feasibility Study A thesis submitted in partial fulfilment of requirements for the degree of MASTER OF TECHNOLOGY in VLSI DESIGN TOOLS & TECHNOLOGY by Priya Agrawal 2007JVL2170 Under the guidance of Prof. Anshul Kumar Mr. Desingh Devibalan B (NXP Semiconductors) VLSI DESIGN TOOLS & TECHNOLOGY Indian Institute of Technology, Delhi 2009

3 CERTIFICATE This is to certify that the thesis titled Hybrid Simulation Framework for Virtual Prototyping Using OVP, SystemC and SCML A Feasibility Study being submitted by Priya Agrawal to the Indian Institute of Technology, Delhi for the award of the degree of Master of Technology in VLSI Design Tools and Technology is a bonafide work carried out by her under our supervision and guidance. The research reports and the results presented in the thesis have not been submitted in parts or in full to any other University or Institute for the award of any degree or diploma. Dr. Anshul Kumar Professor Department of Computer Science & Engg. Indian Institute of Technology Delhi Desingh Devibalan B Technical Leader CTO & IC Design Cluster NXP Semiconductors Bangalore i

4 ACKNOWLEDGEMENT I would like to express my heartily thanks to Professor Anshul Kumar, Department for Computer Science and Engineering, IIT Delhi, my academic guide for overall motivation, support and guidance during this project. I would like to sincerely thank my supervisor Desingh Devibalan B for providing me such a challenging project to work on. His constant guidance and invaluable suggestions throughout the project and his critical approach to problems has led to the successful completion of this project. Furthermore, I am also thankful to Duncan Graham, Lee Moore and Larry Lapides (Imperas), Raghunandan Balasubramaniam, Chandrashekhar and Mischa Jonker (NXP Semiconductors) for their support and encouragement during this project. It has been very enlightening and enjoyable experience to work with them. I wish to express my great thanks to my family members who supported me in all the endeavors I had during thesis work. Finally, I owe many thanks to my colleagues and friends for making my stay in IIT Delhi and NXP Semiconductors, Bangalore memorable. Priya Agrawal M. Tech (VLSI design Tools and Technology) IIT Delhi ii

5 ABSTRACT The increasing software development cost and effort and decreasing turnaround time requirement for Multiprocessor SoC has made the designers strive for fast virtual prototyping solutions capable of simulating the system at speed of several hundreds of MIPS. Several fast prototyping solutions are provided by the ESL designers worldwide. Open Virtual Platforms enables simulating embedded systems running real application code. This project aims at exploring this new technology and its interoperability with the existing TLM based SystemC platforms. The present work addresses the details of the technology, experiments done with it to check its simulation performance and possibility for hybrid simulation with SCML. Experimentation for simulation speed comparison of OVP with existing proprietary prototyping solutions and hybrid simulation discloses some important observations which are also reported. iii

6 Table of Contents 1. INTRODUCTION Overview Motivation Organization BACKGROUND Need of Prototyping SystemC / Transaction Level Modeling System Simulators OPEN VIRTUAL PLATFORMS Introduction OVP APIs Innovative CPU Manager (ICM) Virtual Machine Interface (VMI) Behavioral Hardware Modeling (BHM) and Peripheral Programming Model (PPM) OVP models OVPSim The OVP Simulator Additional Features of OVP WHY is OVP fast? Hybrid Simulation support for OVP Approach to TLM Details of OVP inside TLM SIMULATION PERFORMANCE EXPERIMENTS Application under test - JPEG Decoder Application Task Graph Mapping for Dual Core Platform Single Core Platform Backdoor Mode Simulations...22 iv

7 4.5 Dual Core Platform RESULTS AND ANALYSIS Single Core Platform Dual Core Platform EXPERIMENTATION FOR HYBRID SIMULATION WITH SCML Proposed Wrapper for Hybrid Simulation of OVP, SystemC and SCML Initial Experimentation Integrating SCML modeled SystemC TLM peripheral Important Observations Proposed Solutions CONCLUSIONS Summary Future Scope...38 REFERENCES v

8 List of Figures Figure No. Caption Page No OVP Interfaces Wrapper for Processor Model Processor Wrapper Implementation JPEG Decoder Dual Core System Architecture Partitioning for JPEG Decoder Single Core Platform Backdoor Memory Access Dual Core Platform Input Image Time variation with Nominal MIPs for single core platform Time variation with Quantum size for single core platform Time variation with Nominal MIPs for dual core platform Time variation with Quantum size for dual core platform Hybrid OVP/SCML Simulation Inter Processor Communication Block Interrupt Driven Dual Core System Dual core inter processor communication flow vi

9 List of Tables Table No. Caption Page No System Load distribution for JPEG Decoder Speed Comparison for Single Core System Speed Comparison for Dual Core System Simulation Statistics for IPC Based System vii

10 Chapter 1 INTRODUCTION 1.1 Overview Today s embedded systems need to verify that the combination of hardware and software matches the expected functionality and performance. The turnaround time requirement of any project design is decreasing every year. In order to design and verify the prototype of large systems, fast simulation requirement is a must. This project aims to investigate the feasibility of adopting a virtual Prototyping technology based on binary translation to improve the simulation speed of Software verification. Imperas, on March 07, 2008 announced the release of a virtual platform and modeling technology to enable simulating embedded systems running real application code. This technology is called OPEN VIRTUAL PLATFORMS. In this project, an attempt has been made to explore the technology provided by Imperas. 1.2 Motivation Virtual platforms (VP) have been used for some years to develop, analyze, optimize and validate system - level hardware architecture. Today s offerings of prototypes are architected for single core SoC and does not scale to large number of embedded processors, specifically when it comes to simulation speed and debugging usability. IMPERAS provide multi-processor (MP) virtual prototyping, simulation, and debugging. Building a virtual prototype with Imperas tools simulate efficiently at speeds of 100s and 100s of MIPS on desktop PCs.They are completely Instruction Accurate and model the whole system. OVP and its APIs help foster model interoperability, which is vitally needed now in electronic system level (ESL) design. [1][2][3] As the virtual platform solutions offered by Imperas seems to be quite promising, we have intended to analyze the feasibility of stitching OVP processor models together with other peripherals modeled in SystemC/Open SCML. The key objective is to create a proof-of-concept platform to demonstrate this hybrid simulation framework for virtual platforms. Also, we have tried to benchmark and compare the simulation performance 1

11 of OVP for single/multi-core platforms against the simulation framework provided by one of the industry leading ESL vendor. This hybrid simulation framework shall lead to new avenues in simulation of complex SoC Platforms built from various ESL/IP vendor supplied IPs in SystemC/SCML (eg. CoWare, ARM), testing the true inter-operability defined by the TLM2.0 standard. Thus it shall reduce the engineering effort in creating high speed multi-core virtual platforms for early software development to meet the tight time-to-market windows. 1.3 Organization The entire work is organized as follows. Chapter 2 presents the background base needed. Some existing prototyping solutions are also discussed. Chapter 3 highlights the details and components of the Open Virtual Platform technology. The hybrid simulation support provided to integrate OVP models in SystemC environment is also presented. Chapter 4 contains detailed description of platforms constructed in OVP and the proprietary modeling environment with corresponding simulation statistics presented in Chapter 5. The proposed wrapper for hybrid simulation of OVP with SCML, simulation experiments and observations for the same are discussed in Chapter 6. Finally Chapter 7 provides the summary of all the experimentation and suggestions for future exploration. 2

12 Chapter 2 BACKGROUND 2.1 Need of Prototyping With the increasing complexity and integration in SoC, software development costs are rising very high. A simulation environment is necessary to simulate the system under design so that software developers can test the software and hardware developers can investigate design alternatives. Traditionally, techniques like FPGA prototyping and emulation have been proposed for software validation. [4] However, these solutions are available too late (once the RTL is available) and significantly impact the design cycle. With software development determining project success/failures, modularity and fast prototyping have become important aspects of simulation framework. The SystemC and TLM based new approaches of system level modeling helps provide fast prototyping solutions. 2.2 SystemC/ Transaction Level Modeling SystemC supports modeling of complex hardware systems with different abstraction levels, with modeling of hierarchical components, as it is build over C++. No doubt, the achievable simulation speed depends on the level of model abstraction, which also determines the platform s accuracy. SystemC has always been intended to support the actual embedded software development, but SystemC has not possessed all the necessary technology components to fully enable it. Within the Transaction-Level Modeling (TLM) working group of OSCI, several different abstraction levels are introduced which enable faster simulations. [5][6] The transaction mechanism allows a process of an initiator module to call methods exported by a target module, thus allowing communication between TLM modules with very little synchronization code thereby significantly reducing communication overhead in modeling of SoCs. The draft of TLM2.0 standard introduces new transaction abstractions so that platform components can communicate and be interoperable [7]. The use of default tlm_generic_payload transaction type enables this. 3

13 The further improvements provided by TLM2.0 which results in faster simulations of models are listed as follows: 1. Direct Memory Interface (DMI): This allows direct backdoor access to memory and thus allows un-inhibited Instruction set simulator execution as the transport call does not actually goes over the bus avoiding any bus conflicts. 2. Loosely Timed modeling: There is no timing annotation in the model. This has speed- accuracy tradeoffs. 3. Temporal Decoupling: The models can have their own local clock which synchronizes with the SystemC global clock only at adequate synchronization points. This allows simulation speed up for multi-core platforms. Much work has been done in embedded software generation from transaction level description. Some examples of this are discussed in the following section. 2.3 System Simulators Full system simulation makes it possible to run the exact binary embedded software including the operating system on a totally simulated hardware platform. The simulation environments thus need to support full system simulation, and should use some hardware modeling techniques. Moreover, the simulations should be fast to enable early software development. The most challenging way to enhance simulation speed is to simulate the processors. Processor simulation is achieved with Instruction Set Simulation (ISS) [8]. Instruction set Simulators can be: - Interpretive ISS - Static compiled ISS - Dynamic compiles ISS In the past decade, dynamic translation technology has favored many ISS [9]. The binary target code to be executed is dynamically translated into an executable representation. There are typically two variants of dynamic translation technology: 1. The target code is translated directly into machine code for the simulation host. 2. The target code is translated into an intermediate representation that makes it possible to execute the code with fast speed. 4

14 Dynamic translation introduces a compile time phase as part of the overall simulation time. But as the resulting code is re-used, the compilation time is amortized over time. Based on dynamic translation, some simulators have been designed. SimSoC demonstrates an integrated simulation framework relying only upon SystemC and transaction-level modeling [10]. The ISS uses dynamic binary translation using the second technology stated above. The speed results are lower than what are achieved using the binary translation to host machine code. Moreover, the solution do not scale well with multi-core platforms as it uses lot of time costly wait() instruction for simulating parallel executing cores. Moreover, if wait () is used after a large number of instruction to avoid simulation overhead, simulations are not faithful enough. Virtual Machines such as QEMU and GXEMUL [11] also emulate to a large extent the behavior of a particular hardware platform. QEMU is a form of dynamic translation based on technique 1. Though QEMU and GXEMUL include many device models of open-source C code, but these models lack interoperability. Besides, QEMU enables simulation of fixed defined single processor simulators. Several providers of virtual platform technology have also come up with their own platform-driven electronic system-level (ESL) design solutions which promise high simulation speed and accuracy. Technologies like that of Virtio, Simics from Virutech, Platform Architect from CoWare, Design Ware from Synopsys are few examples.[12] The fate is that all of them develop proprietary modeling solutions. Imperas on the other hand provides Open Virtual Platforms, the infrastructure technology which is open source and free, focused on multi-core platform development and high simulation speed for embedded software development. In the following chapter we discuss the basic know-how of the OVP technology, its core components, significant features and the extensions that enable it to work in SystemC platforms with TLM2.0. 5

15 Chapter 3 OPEN VIRTUAL PLATFORMS 3.1 Introduction Imperas announced the formation of the Open Virtual Platforms alliance, or OVP, and seeded it with some of their technology serving the market requirements. This includes programming models, verification/debug/analysis tools, and simulation platforms. The interfaces provided by Imperas address the model interoperability problem. The primary entity is that, Imperas have made their technology public. OVP has three main components [13] 1. The OVP APIs that enable C models to be written. 2. A collection of open source processor and peripheral models. 3. OVPsim, a simulator that executes these models. 3.2 OVP APIs To model an embedded system there are several main items to be modeled: Platforms, Processors, Peripherals and environment. The platform connects and configures the behavioral components. The processors fetch and execute object code instructions from the memories, and the peripherals model the components and environment that the operating system and application software interacts with. OVP is thus made of four interfaces. - Innovative CPU Manager - Virtual Machine Interface - Behavioral Hardware Modeling Interface - Peripheral Programming Model Interface The combination of these interfaces makes the complete Platform. The interaction between these interfaces can be shown in figure Innovative CPU Manager (ICM) The ICM is a C API used to create the platform netlist of the design/system for use with OVPsim simulator. It allows instantiation of multiple processors, buses, memories and 6

16 peripherals that can further be connected together and application programs executables can be loaded in simulated memories. [14] Virtual Machine Interface (VMI) Figure 3.1 OVP Interfaces VMI is the C based processor interface, allowing the processor model to communicate with the simulation kernel and the other components of the system. VMI is the heart and soul of the high performance execution provided by OVP. Processors in OVP use a code morphing approach which is coupled with a just-in-time (JIT) compiler to map the processor instructions into those provided by the host machine. In between are a set of optimized opcodes into which the processor operations are mapped, and OVPsim provides fast and efficient interpretation or compilation into the native machine capabilities. Some of the capabilities of VMI are listed below [15]: 1. VMI allows a form of virtualization for capabilities such as file I/O. This allows direct execution on the host using the standard libraries provided. 2. Encapsulating existing ISS models within OVPsim, provided that they export some basic features (for example, the existing ISS model should be available as a shared object, provide an API to allow it to be run instruction-by-instruction or for a number of instructions, and provide an API allowing memory to be modeled externally) is possible through VMI. 3. VMI enables modeling of the mode dependent behavior (kernel/user mode) of an instruction. Using the VMI, OVPsim can implement arbitrary multiprocessor systems. 4. The VMI can be used for both RISC and CISC processors. Any instruction format can be supported. 7

17 5. VMI also allows modeling of L2 caches and other extensions around the processor Behavioral Hardware Modeling (BHM) and Peripheral Programming Model (PPM) They are used to write behavioral models of hardware/software systems which are peripheral to the processors in the platform being developed. Each instance of a peripheral model runs on its own virtual machine with an address space large enough for the model. This processor and its memory are separate from any processors, memories and buses in the platform being simulated; they exist only to execute the code of the peripheral model. This processor is called a Peripheral Simulation Engine or PSE for short [16]. The difference between PPM and BHM is: BHM This API gives access to Behavioral modeling processes (threads) Simulated delays Events Diagnostic control and simulator message stream. This API can support more general forms of communication and provides the piece that TLM is missing. PPM This API gives access to Connectivity of peripherals in platforms. o o o o Creation and control of Ports and nets Address spaces Windows into memory address space Create behavior on memory region accesses Install callbacks Thus this API understands about buses and networks and is similar in terms of functionality with the OSCI TLM interface proposal. The BHM/PPM has similar concepts to SystemC, but each instance of each model exists in its own private address space. It is normally pretty easy and simple to wrap existing C functions in a BHM/PPM peripheral model. 3.3 OVP models OVP provides with some processor models like ARM7, several MIPS processors, Tensilica and the OpenRISC OR1K. 8

18 A number of standard embedded devices to allow assembly of a complete platform, including various types of memories, traps, bridges, DMA engines, and UARTs, to name a few are also modeled. OVP processor models are instruction accurate. In the realm of ISS models, the instruction accurate models are approximately timed in that they claim to, on most occasions, execute each instruction using the correct number of clock cycles and they perform their I/O operations at sort of the right place within the instruction [17]. OVP processor models are however, instruction accurate in purely functional space and not in the behavioral space. To make this point clear functional models and behavioral models are strictly defined to be different. A functional model does not include timing, although it may include sequence. A behavioral model includes timing although the level of detail of timing is not defined. Both models can exist at any level of abstraction. Thus the ISS models which are generally used by several prototyping environments (eg. PV abstracton level of TLM compliant SystemC) are the behavioral models. OVP models are functional models. Instruction accuracy in terms of OVP means that the registers hold the correct values at the end of each instruction and create the right side effects from executing that instruction. They progress one instruction at a time and do not know anything about multi-execution pipelines, out of order execution or anything of those sorts. 3.4 OVPSim The OVP Simulator OVPsim provides infrastructure for describing multicore platforms. The OVPsim simulator can simulate arbitrary multiprocessor shared memory configurations and heterogeneous multiprocessor platforms. OVPsim is a very fast simulator. Performance of OVPsim depends on several factors (for example, the processor variants used in the platform and the exact nature of the application itself), but typically speeds of hundreds of millions of simulated instructions per second can be expected. The simulation experiments conducted for similar platforms in OVP and one of the proprietary modeling industry standards reiterates the claim of greater speed efficiency of OVP. Since OVPsim platform models can be compiled as shared objects, they can be encapsulated in any simulation environment that is able to load shared objects. This includes C, C++, and SystemC simulation environments. The commercial Linux based Imperas simulator supports multiprocessor debugging, not provided in Windows based free simulator and provides even higher simulation performance. 9

19 3.5 Additional Features of OVP 1. Semi-hosting: Semihosting is the ability to provide host functionality to the simulated processor or peripheral. The semihosting library has full access to the simulated processors registers, stack and memory space. Far more complex scenarios can be envisaged including for example, using the host network interface, host USB port in order to get connectivity to the outside world, from the simulated platform. The capabilities within the semihosting library interface provided by OVP can be used to model a huge range of system functions. 2. Mapping the processor address region to external memory: The processor address space can be explicitly specified to contain separate RAMs and ROMs. It is also possible to specify that certain address ranges will be modeled by callback functions in the ICM platform itself, which is useful for modeling memorymapped devices. Such a capability is exploited in the current work to make OVP processor work with SystemC/TLM based models thereby establishing a hybrid simulation framework. 3. Integration with other environments: Normally, simulators tend to want to be masters and can call into other models or simulators. This creates a conflict when two simulators need to be bolted together because neither of them really wants to relinquish control. Imperas OVP simulator is built as a slave and thus callable from other environments such as SystemC. The reverse is however not true. OVPsim cannot call a SystemC model. This is quite natural since the calling of SystemC would bring the entire simulator performance back down to the very thing it is trying to replace. On the other hand, substituting part of the system which is a SystemC based platform with an OVP model may bring about a large performance gain in relative terms. However, Amdahl s Law tells us that we get diminishing returns dominated by the slowest running piece of the entire system, and thus even one slow SystemC model will make the entire system crawl along at the slow rates. Putting OVP models in SystemC environment therefore requires careful scheduling. OVP models and subsystems can be encapsulated in SystemC platforms and harnessed using: - sc_clock(), i.e. at the detailed instruction or clock level - TLM 2.0, i.e. the new OSCI transaction level approach 10

20 Both cases of integration have been tested in this work. Since modeling in pure SystemC brings down the simulation speed and is not desirable for Software development use case, we emphasize on integrating OVP models at the transaction level. 3.6 WHY is OVP fast? The OVP technology from Imperas enables to create faster virtual platforms for software development. This includes several key components to enable fast simulation speed. As a result of the following key technologies incorporated in OVP, virtual platforms are able to run at several 100 MIPS of execution speed. Some of these features are mentioned below: 1. Just in Time Code Morphing: Conventional processor models written in an HDL or similar modeling language might be implemented by a loop that is activated by a clock signal. On activation of the system clock, the model might fetch the next instruction, decode it, and call specific functions to execute the instruction. Certain optimizations however may be performed to speed up execution. Although models written in this conventional style can be accurate and straightforward in structure, they are not fast. Processor models designed for the Imperas tool set instead use just-in-time (JIT) code morphing technology. [15] The technique is quite similar to dynamically compiled ISS. This works as follows: 1. As each new processor instruction is encountered during program execution, the instruction is translated (morphed) into equivalent native machine code. The exact translations to be made are specified by the processor modeler using the Imperas Virtual Machine Interface (VMI) API. 2. Contiguous sections of translated processor instructions are gathered into code blocks, which are held in a dictionary for the processor. Separate dictionaries are held for supervisor mode code fragments and user mode code fragments. 3. If a processor performs a jump to a simulated address that has already been translated to a code block held in the dictionary, there is no need to perform the translation again: the simulator simply re-executes the existing code block. Imperas technology handles the generation of native machine code and the efficient management of code blocks and dictionaries to give extremely fast simulation. This is possible because, as simulation proceeds, run time (execution of translated code blocks) 11

21 dominates morph time (JIT compilation). High processor models are created by doing as much work as possible at morph time and as little as possible at run time. It may be possible that not all instructions map closely to the Just-In-Time code morphing opcode set. Those that don t can be implemented using function calls from Just-In-Time morphed code at run-time. Such a simulation method is capable of providing speed improvements if the application under test has a portion of code used repeatedly, which in general, all the real time applications do. 2. Program Counter Modeling: The simulator always knows the address of the current instruction. Instead of maintaining the program counter value each time in the processor model, it is fetched directly from the simulator when required. Thus the processor models do not explicitly model the register values that are infrequently referenced and can be created easily on demand. The same is the case very often for processor status registers. This makes processor models execute at a faster rate. 3. Simulation Performance Options: ICM_ATTR_RELAXED_SCHED: The standard multiprocessor scheduling algorithm built-in to the simulator normally simulates each processor for exactly the number of instructions implied by the processor MIPS rate and time slice before moving on to the next processor in that time slice. Using the instance attribute ICM_ATTR_RELAXED_SCHED indicates to the scheduling algorithm that a closely-approximate number of instructions can be used for that instance. This makes simulations much faster. This could be explained in detail as follows: The exact number of instructions for which the processor needs to execute can be calculated as: No. of instructions = Processor Nominal MIPS 10 time slice duration Consider an example of a single code block containing native code implementing four simulated arithmetic instructions and one simulated jump instruction, so five simulated instructions in total. Suppose that relaxed scheduling isn't enabled and the simulation is reaching the end of a time slice, with just three instructions left to perform in that time 12

22 slice, and that the next block to run is the one described above, which actually contains five instructions. In this case, the simulator won't be able to use the code block as it stands, as that would result in execution of too many instructions in this time slice. It therefore has to discard that code block and generate a new one, containing only three instructions, so that the instruction count is exactly correct at the end of the time slice. This incurs significant overhead. In relaxed scheduling mode, the simulator won't execute the code block in this time slice, and won't discard it. This is much more efficient, but it means that not quite enough instructions have been executed in the time slice (e.g instead of ). The simulator will attempt to make up the difference in the next time slice (i.e. it will try to execute instead of instructions next time round) so errors do not build up over time. ICM_ATTR_APPROX_TIMER: Processor models often contain countdown timers that expire after a certain number of instructions, causing an exception. Once again, modeling these timers to an exact instruction imposes a significant simulation overhead. If a closely-approximate number will do (as is usually the case, as instruction countdown timers are themselves often approximations of cycle countdown timers) simulation is much faster when the countdown counter expires frequently. Using the instance attribute ICM_ATTR_APPROX_TIMER indicates to the scheduling algorithm that a closelyapproximate number of instructions can be used for countdown timer expiry. Besides the key features mentioned above which enable fast system simulations, OVP also comes with the capability to be integrated with the existing SystemC based platforms. In order to achieve this, a wrapper is needed to be put around the OVP models. The next section describes the methodology which enables hybrid virtual prototyping using OVP models. 3.7 Hybrid Simulation Support for OVP SoC makes intensive use of various IPs. Components reuse becomes necessity to reduce the design challenge. This requires design methodologies for inter IP communication and implementation. This flattening of the design process can be best managed through platform based design at transaction level. TLM2.0 provides new level of performance and interoperability. With TLM2.0 it is possible to enable models from different vendors to work together in a virtual platform. The OVP provides C++ interface to encapsulate OVP models in the SystemC environment. New developments have been made to make OVP models work in TLM2.0 compliant SystemC platforms. The availability of SystemC TLM2.0 technology to use with OVP CPU models allows the encapsulation of OVP models in existing TLM2.0 compliant SystemC platforms, thereby solving the model 13

23 interoperability issue and enabling fast solutions for successful deployment of virtual platforms by hybrid simulation of OVP and SystemC. 3.8 Approach to TLM2.0 In order to integrate already existing OVP models, wrappers are written that is put around the existing code for making it compatible with the OSCI TLM APIs. The conventional APIs in OVP are built in C. To make TLM2.0 compliant SystemC wrapper several new classes are constructed in which the conventional C routines for the models are called. These classes build the wrapper around the binaries of the OVP processor, peripheral, memories and bus models allowing them to be exported to an outer simulation environment other than OVP. Once exported to SystemC environment, these models can then be controlled from the SystemC interfaces. Of the various abstraction levels provided by TLM2.0, it is the loosely timed modeling that gives a higher performance. It enables processes to run ahead of simulation time (temporal decoupling) and uses a quantum keeper. It is this abstraction level on which wrappers have been built so that the models could be run as fast as possible. Features like Direct Memory Interface are used to provide direct pointer to memory in the target bypassing the sockets in the transport calls enabling a faster simulation needed for software development use case. The processor has the option to invalidate DMI in which the transport calls goes over the bus. The wrappers are supported for TLM2.0 blocking transport interface with timing annotation. 3.9 Details of OVP inside TLM2.0 The wrapper to put OVP processor models in the TLM2.0 environment is a generic wrapper that can further be extended according to the processor under use. The wrapper allows free-running of each processor for a large number of instructions rather than advancing all processors in lock-step. [18] The generic wrapper for the processor model is described in the form of a class derived from SC_MODULE. The details of the wrapper are shown in figure 3.2. The implementation methodology can be seen in figure 3.3. To enable encapsulation at TLM level, first very basic C++ wrappers are built that put every instance of a processor, bus etc inside separate classes (Processor/Bus object shown in figure 3.2). These classes access the core OVP functionality of the respective model. The outer SC_MODULE then calls objects of these processor and bus classes. (CPU object shown in figure 3.2). The specific processor for e.g. MIPS, ARM can then be derived from this basic processor 14

24 providing a third layer for the wrapper. Based on this hierarchy of wrappers, thus the module of processor shown in the figure 3.3 has objects of the bus instantiated inside it. This allows mapping of the OVP processor address space to a local OVP memory/peripheral (through OVP Bus) as well as an external memory or peripheral with a TLM2.0 target socket. When the processor is connected to an internal OVP memory or a peripheral, the connection is made directly from the OVP bus shown. In order to connect to an external memory/peripheral, a portion of address space of the local OVP bus, directly connected to the OVP processor is bridged to another bus (TLM Bus shown in figure) over which read/write callbacks are registered. Initiator sockets are opened on the processor model. Any access to this TLM bus address space which is mapped to an external memory/peripheral will trigger these read/write callback functions on the TLM bus indirectly connected to the processor. The callback functions then create the appropriate transaction request and forward the transport call with its generic payload over the initiator sockets. MIPS/ ARM Processor Object (Layer 3) CPU Object (Layer 2) Processor Object (Layer 1) Processor Model (OVP) Bus Object (Layer 1) Bus Model (OVP) Fig 3.2 Wrapper for Processor Model This is a generic wrapper put around CPU models and is used in a processor configuration specific layer to create specific processor wrappers like that for ARM, MIPS etc. which is then instantiated into the SystemC platform. The processor thus, on encountering an instruction that do a load/store to/from memory location on the bus, will call a function in the wrapper code which in turn issues the necessary blocking transactions on the bus. Wrappers for the peripheral model are also constructed in a similar fashion using the read/write callbacks registered on the bus connected to the peripheral model within an SC_MODULE. The TLM2.0 wrapper also provides a bus decoder with a configurable number of initiator and target sockets which is used to forward the transaction arriving on its target port to the proper initiator port based on the bus address map. 15

25 SC_MODULE TLM2 Initiator Socket OVP Processor OVP Bus Bus Bridge TLM Bus Fig 3.3 Processor Wrapper Implementation The SystemC environment thus calls the OVP simulator through this wrapper. Proper synchronization between the two simulators needs to be maintained to achieve correct working of the models in the platform. As the simulation starts, each processor runs from a SystemC thread. The thread executes IPQ instructions on the processor without advancing SystemC time where: IPQ = Processor Nominal MIPS QuantumSize The function call asking the processor to simulate for IPQ instructions is from OVP environment through the wrapper. When the allotted instructions have completed, the thread calls SystemC wait() to advance time. The OVP simulator synchronizes with the SystemC simulation kernel every time the quantum is over. Thus each processor executes a number of instructions at a time in a round-robin schedule. Based on this background, a wrapper is prepared to enable OVP models communicate with Open SCML based models. The details of the wrapper and the experimentations done with that are presented in chapter 6. The following chapter presents the simulation performance experiments done with OVP. 16

26 CHAPTER 4 SIMULATION PERFORMANCE EXPERIMENTS Imperas solutions claims to simulate platforms consisting of one or more processors running real time application, at speed of hundreds of MIPS which is needed for today s embedded software development environments. In order to validate this claim put by OVP, we have compared the simulation statistics for similar platforms constructed using OVP and some other modeling technology. The proprietary virtual prototyping solutions provided by leading ESL designer are chosen to be compared against Open Virtual Platforms Technology. Similar single and dual core platforms are constructed in different environments and their simulation statistics are compared. 4.1 Application under test - JPEG Decoder In order to simulate the platforms, there is a need to choose proper application which could be executed on the processor. The choice of application should be such that the workload on the processor is quite high. Baseline JPEG Decoder is chosen as a benchmark application for our current simulation framework. Joint Photographic Experts Group or in short, JPEG is a widely used image compression technique. It is used in image processing systems such as copiers, scanners and digital cameras. A JPEG decoder is capable of reconstructing image data from a stream of compressed image data. This requires that some transformations be applied to the compressed image data. This results in the reconstruction of the image data. The fact that this coding method forms the basic coding method for all DCT-based JPEG decoders makes it an interesting decoding method. For that reason it was selected to be implemented in this project. JPEG decoder is a streaming multimedia application which has a degree of parallelism and consists of 5-6 tasks. The JPEG decoding process is graphically depicted in Figure 4.1. Before the operations performed by the decoder are explained, we look at the encoder. The JPEG encoder divides an image in blocks of 8 by 8 pixels. The encoder then has a number of blocks, which when placed in the right order, form the original image. The encoder applies a number of operations on each of these blocks. These operations include a discrete 17

27 cosine transform, quantization, zigzag scan and variable length encoding. The result of these operations, and of the encoder, is a compressed image. Compressed Image Data VL ZZ D IDC Color Conversion Re-order JPEG Reconstructed Image Fig 4.1 JPEG Decoder The decoder reverts the transformations applied by the encoder to the image data. The decoder takes the compressed image data as its input. It then subsequently applies following operations to the compressed image 1. Variable Length Decoding (VLD) 2. Zigzag scan (ZZ) 3. De-quantization (DQ) 4. Inverse Discrete Cosine Transform (IDCT) 5. Color Conversion 6. Reordering The decoder then obtains the reconstructed bitmap image. The compressed image data forms a byte stream input for the decoder. This byte stream contains so called markers. A marker is a two-byte combination, which identifies a structural part of the compressed image data. The incoming bit stream is parsed to get header information and image data, based on the markers and various transformations are then applied. Details about different transformations and markers can be found in [19]. 4.2 Application Task Graph Mapping for Dual Core Platform Besides single core, homogeneous multi-core core platforms using MIPS processor are also constructed over which the same JPEG decoder application is tested. The case is limited to dual core systems but could be extended to several cores depending on the workload of the application. In order to execute the same application on two processors, we need to partition the total tasks among two processors in such a way that each processor has almost equal computation and communication load. [20] 18

28 As seen from figure 4.1, the various tasks in the decoder are performed one after the other. Thus the platform will be having processor cores which are active one after the other. Therefore the architecture of the dual core system looks something like in Figure Core 1 2 Core 2 3 Fig 4.2 Dual Core System Architecture To select proper task partitioning for the application under experimentation, the careful study of the application is done to find the match between JPEG decoder and the dual core platform. The compressed image data uses connection 1 for our twin-processor system. The compressed image data is connected to the VLD in the JPEG decoder. Therefore, the VLD must be incorporated in the first processor. As seen in Figure 4.1, re-ordering is connected to the output, which is connection number 3 in the system so is mapped to core 2. In order to divide the ZZ, DQ, IDCT and color conversion over the two cores, the data consumption and production rate of the various parts of the system is looked upon. The VLD consumes data from the outside world and produces data in blocks. The zigzag scan, de-quantization and IDCT also consume and produce one block at a time. The color conversion and re-ordering requires one or more (up to 10) blocks before they can run. The color conversion however produces data in a block-by-block basis and sends this to the re-ordering unit which then produces output data. This implies that the communication over connection 2 of our dual core system is always in blocks. Thus every division of the JPEG decoder in two cores requires the same data rate. The subdivision of the JPEG decoder does not influence the communication load of the system. Still, for the proper partitioning of decoder among the 2 cores, the computation load on two cores must be more or less same. This enables core 2 to start as soon as core 1 has produced one block. The survey result of the system load for various parts of the JPEG decoder is shown in Table 1. The table 4.1 shows that partitioning just before and after IDCT-function is the easiest to realize. This choice enables almost 50-50% of load sharing among two cores. It also has the advantage that the Huffman decoding and de-quantization tables required by the VLD and DQ units respectively do not need to be shared by both processors. Based on this 19

29 task partitioning, the data flow among the two processors in the system is shown in Figure 4.3. Part Task in JPEG decoder CORE 1 VLD 35 ZZ 5 DQ 10 CORE 2 IDCT 20 Color Conversion 15 Re-ordering 15 Computation Load (% of total load) Table 4.1 System Load distribution for JPEG Decoder Input: Proc 1 Image Properties JPEG Image VLD ZZ DQ Proc 2 FBlocks Output: BITMAP image Color Conv. IDCT Re-order Fig 4.3 Partitioning for JPEG Decoder 4.3 Single Core Platform To study the simulation performance, platforms having the same configuration are constructed for the following three cases - Pure OVP simulation framework ESL vendor supplied virtual Prototyping framework Hybrid OVP+ SystemC in OVP simulation framework Hybrid OVP+ Open SCML in OVP simulation framework 20

30 For the simplicity of the experimentation, dedicated peripherals are not added to the platform. Also to maintain the fairness of the comparison, it is necessary that the same variant of processor model should be used in all the three cases. We have chosen Instruction Accurate (IA) MIPS32_24Kc processor model for the same. The details of the platform can be seen in Figure 4.4. The program memory shown in the figure is used to store the executable binary of the application to be executed. This binary is in standard executable and linking (.elf) file format. The input and output image memories are used to read the compressed image data and store the final reconstructed image data after decoding respectively. The process of loading and storing of image in memory is automated in the platform. Once the image is loaded, the processor executes the application in which the data is read from the input image memory. The Quantization and Huffman tables needed for various transformation during the decode process are present in the image. As the image is read on byte-by-byte basis from the memory, based on various markers that are found, these tables are read and stored in the local storage with the processor. These are then used to decode the image pixel data and the final reconstructed image data is obtained in a sun-raster image format. This image is stored in an output image memory. The hybrid single core platform where the OVP processor model is put around TLM2.0 compliant SystemC wrapper is also experimented. In such platforms, OVP processor with a TLM2.0 compliant SystemC wrapper is made to interact with simple TLM2.0 target memories. This is done by connecting a processor model to a SystemC based bus decoder which has TLM2.0 target and initiator sockets. The bus decodes the incoming address and based on the address, forwards the transaction to one of its initiator port which is connected to TLM2.0 target socket of the memory. Program Memory Input image Memory Output image Memory Bus Decoder MIPS32_24Kc Processor Model (IA) Fig 4.4 Single Core Platform 21

31 4.4 Backdoor Mode simulations In all the cases, simulations have been carried out in backdoor mode. This is a way to access memory/peripheral in which the transaction request does not actually goes over the bus. For the case of pure OVP environment, the processor accesses the memory through a pointer. The proprietary solutions also provide options for simulation in backdoor mode. For the Hybrid simulation case, it is the Direct Memory Interface support provided by TLM2.0 that enables backdoor mode simulation. The DMI provides a means by which an initiator can get direct access to an area of memory owned by a target, thereafter accessing that memory using a direct pointer rather than through the transport interface. This offers a potential increase in simulation speed for memory access between initiator and target models. Figure 4.5 gives a representation of how DMI works. Once established, DMI is able to bypass the normal path of multiple transport (blocking transport in current framework) calls from initiator through interconnect components to target. Wrapper OVP MIPS32_24K CPU Memory Direct Memory Interface Fig 4.5 Backdoor Memory Access 4.5 Dual Core Platform Based on the task partitioning explained in section 4.3, the application was split into two parts, each part being executed on a separate processor. The platform constructed to simulate such a system is illustrated in Figure 4.6. Each processor has its own program memory which contains the executable in.elf format. In the current framework, the cores communicate with each other via shared memory. This shared memory is used to transfer necessary information among the processors like the image properties consisting parameters as image size, number of components, sampling rate and the necessary blocks from core 1 to core 2 for IDCT computation. To main correctness of the application, the two processors synchronize via polling mechanism in which the semaphore present in shared memory is constantly polled by both processors. Thus processor 1 reads the input image from the memory. Each time processor 1 generates 22

32 an 8x8 block after de-quantization, it places the block into the shared memory and sets the semaphore to high. It then waits for this semaphore to set back to a low value, which is done by Processor 2. Processor 1 writes the block to the shared memory only when the semaphore is low. Processor 2 on the other hand, waits for the high value of semaphore. When semaphore is found high, it reads the block from the shared memory, and reset the semaphore to a low value so that next block could be written by Processor 1. Processor 2 then performs IDCT on the block. When sufficient numbers of blocks are obtained, color-conversion and re-ordering is performed and the reconstructed data is stored back in the output memory. MIPS32_24K Core 1 (IA) MIPS32_24K Core 2 (IA) Bus 1 Bus 2 Program Memory Input Memory Program Memory Output Memory Bus 3 Shared Memory Fig 4.6 Dual Core Platform Based on this description the next chapter presents the results of our experimentation. 23