Developing Applications on GPUs: Discrete-Event Simulator Implementation

Transcription

1 Developing Applications on GPUs: Discrete-Event Simulator Implementation A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES 2011 By Farhad Kajabadi School of Computer Science

2 Table of Contents 1. INTRODUCTION AIMS AND OBJECTIVES STRUCTURE OF THE REPORT GPU COMPUTING HETEROGENEOUS MASSIVELY PARALLEL ARCHITECTURES AND DATA PARALLEL TASKS THE HISTORY OF GPU COMPUTING CUDA ARCHITECTURE GENERAL PURPOSE GPU COMPUTING, OFFERINGS, AND TRADE-OFFS DISCRETE EVENT SIMULATION DES ARCHITECTURE PARALLEL DISCRETE EVENT SIMULATION Conservative approach Optimistic Approach PDES IMPLEMENTATION ON GP-GPU NEEDED INFRASTRUCTURES OF PDES AGAINST GP-GPUS CAPABILITIES Computational Capabilities Synchronization and Communication Methods Data Types PERFORMANCE TRADE-OFFS EXPECTATIONS OF THE RESULTS RESEARCH METHOD PROGRAMMING MODEL DEVELOPMENT PLATFORM CORRECTNESS VERIFICATION PERFORMANCE EVALUATION PROJECT PLAN SUMMARY APPENDIX A APPENDIX B REFERENCES II P a g e

3 Table of Figures FIGURE 2.1: NVIDIA GEFORCE 8800 HARDWARE ARCHITECTURE... 6 FIGURE 3.1: THE FLOW CHART OF THE THREE PHASE METHOD FOR DISCRETE EVENT SIMULATION III P a g e

4 List of Abbreviations ASIC API CPU CUDA DES DSP DX FPGA GPU GP-GPU HPC ILP PDES PRNG ROP SFU SIMT SM SP TPC Application Specific Integrated Circuit Application Programming Interface Central Processing Unit Compute Unified Device Architecture Discrete Event Simulator Digital Signal Processor DirectX Field Programmable Gate Array Graphics Processing Unit General Purpose Graphics Processing Unit High Performance Computing Instruction Level Parallelism Parallel Discrete Event Simulator Pseudo Random Number Generator Raster Operation Processor Special Function Units Single Input Multiple Threads Streaming Multiprocessor Streaming Processor Texture/Processor Cluster IV P a g e

5 Abstract With the development of general purpose graphics processing unit (GP-GPU) in 2006, GPUs have evolved into flexible and powerful parallel processors giving higher levels of parallel performance in comparison with the conventional state of the art multi-core processors at lower costs. GP-GPUs are widely used for general purpose and scientific computations; however, regarding their synchronous organization and their single-instruction multiple-thread execution model, they are not the primary target for task parallel applications. Regarding the growing demand for the system simulation during the recent years, more efficient simulation methods such as discrete event simulation (DES) have become the target of much research. Discrete event simulation is widely used in science, military, and industry [5] as a powerful method of determining systems characteristics, and their needed resources. This simulation method is based on reviewing the random occurrence of different independent instantaneous asynchronous events changing the system state. Most of the research done on carrying out simulations on GPUs has been based on the system models that are using regular time increments. These increments can be accomplished either by adding constant time deltas, like numerical integration, or event scheduling based on time deltas, as used in discrete event approximations of continuous-time systems [14]. However, due to its asynchronous nature, DES is characterized as not being efficient on GP-GPU architecture; as a result, to be able to map the irregular time advances of discrete event models to the GPU s many-core architecture efficiently a new strategy should be devised. To achieve this goal, we aim to develop a new time approximation mechanism using the NVIDIA s compute unified device architecture (CUDA) platform. The evaluation stage accounts for verifying the code s correctness and measuring its performance. After the code is developed its correctness and performance is going to be tested against several independent parallel implementations of discrete event simulators. Also, a CPU-only version of the code will be developed to be compared to the GPU version. Both performance and correctness verification tests and benchmarks will be carried out on different scenarios based on different system models so that the obtained results and drawn conclusions are reliable. Although GP-GPUs are not originally compatible with the asynchronous nature of DES computations, we expect to be able to exploit their massively parallel architecture to achieve satisfactory speed ups; however, a very substantial level of performance is not expected to be achieved. V P a g e

6 Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. VI P a g e

7 Copyright I. The author of this thesis (including any appendices and/or schedules to this thesis) owns any copyright in it (the Copyright ) and s/he has given The University of Manchester the right to use such Copyright for any administrative, promotional, educational and/or teaching purposes. II. III. IV. Copies of this thesis, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these regulations may be obtained from the Librarian. This page must form part of any such copies made. The ownership of any patents, designs, trademarks and any and all other intellectual property rights except for the Copyright (the Intellectual Property Rights ) and any reproductions of copyright works, for example graphs and tables ( Reproductions ), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property Rights and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions. Further information on the conditions under which disclosure, publication and exploitation of this thesis, the Copyright and any Intellectual Property Rights and/or Reproductions described in it may take place is available from the Head of School of School of Computer Science (or the Vice-President). VII P a g e

8 Acknowledgements I would like to take this opportunity to thank Dr. Christopher Kirkham for supervising, guiding, and providing me with substantial and valuable information without which this project was not possible to be accomplished. I would like to thank all those who contributed to this research. Special thanks to Dr. Mikel Lujan for providing me with the needed information and additional useful references. Many thanks go to various people I met at the University of Manchester who assisted me in developing my skills. I gratefully thank my parents for their unconditional support and love during my life. VIII P a g e

9 1. Introduction As soon as processor performance started to be gained through adding multiple cores on a single chip, a new trend toward developing new simpler, yet more powerful, parallel architectures started to emerge. After the Microsoft DirectX 10 specification forced Graphics Processing Unit (GPU) developers to develop new chips adopting unified graphics processors capable of doing both low-latency highprecision, and high-latency integer and floating point operations, manufacturers came up with the idea of using the newly developed chips for general computations [3]. To make their unified architecture exploitable by general programmers, they needed to develop a programming model hiding the graphics processing concepts; hence, they started to develop a new platform accessible through well known C and C++ codes [1, 2]. Simplicity of the new programming model and its well known C syntax helped the new platform to find its position in the competition of parallel architectures very quickly. The increasing number of companies and applications making use of the new platform caused major hardware manufacturers to invest more in the field; as a consequence, general code execution capabilities of the GPUs started to increase very quickly to the point that new architectures allow GPU and CPU to be integrated in a single chip. Integration of the GPU-CPU heterogeneous parallel architecture on a single chip allows graphics processor to directly access the main memory through DMA without CPU intervention, solving one of the main deficiencies of GPU computing namely indirect memory access. Direction of the development of general purpose graphics processing units (GP-GPUs) and their widespread use in different software domains shows that they are moving toward unification with central processors. Furthermore, next generation operating systems are trying to exploit their power in performing heavy data parallel tasks. Also, in the domain of scientific computing they have become significantly successful and widely-used [2]. All the mentioned attributes of the GPU computing architecture lead it to be referred to as one of the most significant blocks of the future of parallel computing. Nearly five decades ago, computer assisted design and modeling techniques started to show up, since then, the presence of such techniques has significantly changed the process of designing and testing new products before even the first prototype is made. According to heavy costs of producing and testing a large number of real-world models, for most products, such as electronics, it is not efficient to entirely carry out the verification process using prototypes. Also, in some fields, like the aero space industry, it is not feasible to produce real world prototypes for different tests; as the real-world tests might be very dangerous, sometimes even costing human lives. To design efficient systems and complex structures besides verifying the produced design, computer simulation is widely trusted and used. Its broad application in different industries makes discrete event simulation one of the most widely used types of computer simulation and puts it at the centre of attention in much research. However, as the systems that are going to be modeled get more complex, their simulation needs more time and memory to complete. Due to the fact that the resources in sequential computers are physically limited [12], to be able to reduce the simulation time, parallelizing the simulator is the main solution.

10 1.1. Aims and objectives In spite of the widespread use of GP-GPUs in different scientific problem domains, there has not been much work done in implementing parallel discrete event simulators (PDES) on this platform; hence, this project aims to fill part of the gap by implementing PDES on NVIDIA s widely-used GP-GPU platform, namely Compute Unified Device Architecture (CUDA). The project starts with understanding the CUDA architecture, its programming model, PDES architecture, and possible implementation approaches, then, the research continues by starting the development of fundamental structures and data types that a PDES needs. In the next stage, PDES will be developed and tested. Finally, a parallel CPU version of PDES is developed and compared with the GPU version to verify the performance and correctness achievements. Besides the developed CPU version of the simulator it will be, also, compared to a number of other implementations so that more reliable conclusions can be drawn Structure of the Report Chapter 2 reviews different aspects of GPU computing as the underlying architecture on which the project is going to be developed. Initially, it describes the movement towards parallel computing and the reasons behind it. It continues by depicting heterogeneous parallel architectures, especially the CPU-GPU hybrid platform, and its potential to carry out heavy data parallel tasks efficiently. A short history of GPUs development and the technologies that they used to exploit will be surveyed to give an overview of the evolution of GPUs to their current state. Finally, their strengths and weaknesses in performing general purpose computations are reviewed. Chapter 3 is composed of general background information related to the structure and the implementation of DES and PDES. To illustrate the different aspects of DES implementation and its purposes in science and industry, this section starts with a short review. In the next section parallelization of DES and different possible approaches are described. The last section reviews the mentioned approaches of parallelization of DES based on their main proposed limitations and trade-offs, also, some of the surveyed solutions for the main problems are discussed. Chapter 4 tries to create a logical relation between two previous chapters so as to draw conclusions about the needed infrastructure, possible trade-offs and bottlenecks, and the expected results. The main difference between this chapter and the previous chapter is that it reviews the implementation of PDES specifically on GPU; hence, there are different limitations, trade-offs, and solutions in comparison with the previous chapter. Chapter 5 involves the definition of the research methods that are going to be adopted for the project so as to derive road map strategies for the tasks that are going to be carried out. The main tasks are as follows: (a) Choosing the Suitable Development Platform, (b) Developing the Code, (c) correctness verification, and (d) Performance Evaluation. Also, this section describes the developed plan to clarify the scheme on which the tasks are scheduled to be done. 2 P a g e

11 2. GPU Computing For more than a decade, from the start of 90s until mid 2000s, it was a trend to increase the processor s performance by increasing its clock rate besides reaching a higher level of instruction level parallelism in the processor design; however, this trend stopped continuing after the market leading processor manufacturers faced several serious physical limitations in achieving more performance following the old trend. As the main problem, power consumption and as a result heat dissipation of the chip was shown to grow super-linearly with increasing the clock rate beyond some point around 4 GHz, also, the complexity and cost of designing chips achieving higher levels of instruction level parallelism on a single core started to grow exponentially. From another point of view, memory bandwidth, as one of the most important factors affecting computer s performance, was limited; as a consequence, memories were not able to keep up with the processor s performance growth. All the mentioned barriers, which are, also, referred to as: power wall, ILP wall, and memory wall, forced the computer manufacturers to move to a multi-core processor design to be able to continue increasing their products performance efficiently. In multi-core architecture, the increasing number of available transistors on a specific surface of silicon, as described by Moore s law, is used to add more cores on a single chip. In this approach each core has a less complex architecture, as opposed to the previous trend of putting a single core with much more complex design on the same surface. Benefiting from the new multi-core architecture besides using multi channel memories, computer manufacturers are able to pass all the mentioned barriers; however, the new parallel architecture introduces many new problems and bottlenecks for the software developers. Unlike the sequential execution, parallel execution of the code does not natively guarantee the correct order of execution of the instructions, as the instructions are running on several independent parallel cores simultaneously. To solve this problem, parallel units of execution should be synchronized at certain points of execution to make sure that the code runs correctly. Synchronizations can be significantly costly in terms of execution time; thus, it is a big trade-off to develop the correct code with the least possible synchronization overheads. Also, by moving toward parallel computing, not all the needed data is local anymore; hence, it is very important to design the code so that it keeps the locality of data accesses as much as possible. Apart from the mentioned trade-offs there are a number of other concerns with the parallel codes performance that are outside this discussion s borders; however, they can be named as: load imbalance, deadlock and live lock, and scheduling overheads. In addition, it is important to choose the correct sequential version of an algorithm to parallelize, as the best sequential version is not always the best candidate for parallelization. The Rise of the multi-core era has made the application s performance more dependent on the code s design and its level of success in processor utilization; consequently, the performance, which comes from the underlying hardware, is now more determined by the software. 3 P a g e

12 2.1. Heterogeneous Massively Parallel Architectures and Data Parallel Tasks Facing the mentioned physical barriers in increasing the single core processor s performance, the high performance computing (HPC) community developed new strategies to augment Moore s law and started exploring new techniques to tackle the problems of conventional systems. Besides the innovation of multi-core processors, computer developers tried to also, invest on the development of some specialized processors aiming to give more performance where conventional general purpose multi-core processors had been shown to perform poorly. The development of the new specialized processors, allowed HPC community to design new heterogeneous computer architectures, in which conventional and specialized processors are integrated to work cooperatively [12]. In other words, the need for heterogeneous platforms comes from the demand for high performance, highly responsive systems capable of interacting with other environments more efficiently. The new systems based on heterogeneous architectures have shown to be able to achieve better performance characteristics while in many cases consuming less power, and having lower total development costs. The mentioned attributes of these architectures attracted a lot of scientific interest; in fact, in the long term, it s believed that heterogeneous computing holds tremendous potential for accelerating applications beyond what one would expect from Moore's Law, while overcoming many of the barriers that can limit conventional architectures [12]. In general, heterogeneous architectures are composed of processors with different instruction set architectures, different application binary interfaces, different APIs, or different memory interfaces. Normally, these processors involve general purpose processors covering single core and multi-core processors, special purpose processors such as DSPs and GPUs, custom logic accelerators such as ASICs and FPGAs, and some specialized on-chip computational units such as encryption co-processors, and onchip network processors. Since the development of the first GPU, GPU-CPU heterogeneous architecture is widely used to calculate the motion of and render high resolution graphics. By their nature, GPUs have a large number of computational cores processing different types of graphics data at lower clock rates than the conventional commodity multi-core processors; hence, exposing a higher level of parallel performance while consuming less power due to their lower core clock rate. Nevertheless, their potential of being used as a general-purpose massively parallel processor had never been widely exploited until the development of the general purpose GPUs, as doing general computations with traditional GPUs needed comprehensive knowledge of at least one of the graphical programming languages [2, 3]. However, with the development of GP-GPUs, developers derived a new programming model allowing the programmer to write the codes performing general computations on GPUs using extensions over conventional C and C++ compilers. The new programming model brought about power and simplicity in GPU programming and allowed software developers to develop their complex applications based on GPUs without having to have the knowledge of OpenGL or DirectX programming. Regarding the proven capability of GP-GPUs in providing high levels of parallel performance, their application in science and industry for solving complex problems has shown significant growth during the recent years. In spite of the fact that GP- GPUs are, also, capable of processing task parallel jobs, the synchronous nature of their underlying 4 P a g e

13 hardware makes them good targets for data parallel tasks involving doing the same computations on large sets of different data. In other words, GP-GPUs are natively designed and optimized to do heavy data parallel tasks, such as rendering high resolution graphics. However, they are, also, capable of running multi stream parallel tasks and task parallel applications, yet with less potential performance The History of GPU Computing The first GPU, named GeForce 256, was developed by NVIDIA in In fact, the term GPU, is defined by NVIDIA for the architecture that integrates transforming, lighting, triangle clipping, rendering and other graphics processing engines on a single chip [4]. Prior to the development of the first GPU, for more than a decade, the term, graphics accelerator, was being used for the2d/3d graphics processing chipsets. Graphics accelerators used to make use of several different task specific graphics processing chips together to be able to render the final image. The separation of different graphics processing pipelines was making it necessary to go off-chip after each stage of the process to move to the next stage by feeding the processed data into the next chip. The demand for going off chip several times to finish rendering a single frame was one of the main drawbacks of the old graphics accelerators; hence, NVIDIA tried to improve the graphic accelerator s design by integrating all the major graphics processing hardware and components on a single chip called GPU. With the start of the GPU era, lots of companies started to invest in graphics processors, leading to an explosion in development of new technologies and standards in graphics processing. Modern GPUs started utilizing a number of specialized vertex and pixel-fragment processors named shaders. Vertex processors are in charge of carrying out the needed computations on primitives such as points, lines, and triangles. The operations of vertex processors are mainly based on transformation of the coordinates into the screen space, while pixel-fragment processors typically operate on filled primitives and interpolated parameters which are fed to it by a rasterizer. According to the fact that GPUs typically need to process a larger number of pixels than vertices, pixel fragment processors, normally, outnumber the vertex processors by the ratio of about three to one [4]. Yet, workloads of graphics are not always well balanced, leading to inefficiency in hardware utilization and performance. To root out these inefficiencies, NVIDIA GeForce 8800 GPU was composed of several unified processors capable of operating on both pixels and vertices. Traditionally, Vertex processors were designed for low-latency, high-precision math operations, whereas pixel-fragment processors were optimized for high-latency, lower-precision texture filtering [4]. The new unified processors were able to carry out both tasks efficiently; thus, NVIDIA s developers decided to convert the new GPU architecture into a general purpose parallel processor capable of carrying out many different general and scientific types of computations. To achieve this goal, also, a new programming model allowing efficient parallel application development without having any knowledge of graphics processing languages was needed; hence, they tried to develop and integrate their general purpose GPU programming compiler, named CUDA, with C and C++ languages. Nowadays, CUDA power is even accessible in other programming languages such as FORTRAN and Python. 5 P a g e

14 2.3. CUDA Architecture All CUDA-enabled GPUs, typically, follow the same fundamental architecture; however, based on their feature sets and optimizations, such as the compute capability version, some architectural differences are observed among different generations of NVIDIA GP-GPUs. To give a general overview, basic CUDA architecture of GeForce 8800 GPU is reviewed as a reference; yet, newer generations of CUDA-enabled chips may have some additional features and hardware resources. Figure 2.1 [3]: NVIDIA GeForce 8800 Hardware Architecture As figure 2.1 shows, a CUDA-enabled GPU is utilized with a large number of streaming processors (SPs), also known as CUDA cores. A group of eight joint SPs with instruction and constants caches, two special function units (SFUs), and a dedicated shared memory form a streaming multiprocessor (SM). A texture/processor cluster (TPC) consists of two streaming multiprocessors in addition to Geometry and SM controllers, and its dedicated texture unit containing the texture cache. All the TPCs are connected to an interconnection network linking them together, and to other resources such as raster operation processor (ROP), and DRAM memory banks [3]. Unlike the cores in the state of the art multicore processors, all the SPs in a single SM can only process the same code at a time; making GP-GPU architecture a promising target for data parallel tasks. 6 P a g e

15 2.4. General Purpose GPU Computing, Offerings, and Trade-offs Regarding the special architecture of GP-GPUs, they offer different opportunities, and potentials in comparison with the conventional commodity multi-core processors. This section aims to give an overview of different aspects of GPU computing; nevertheless, before starting the description, some terms should be defined. The term host involves the entire hardware operating in the CPU side of the system to run more critical parts of the code besides controlling the CPU-GPU interactions. The term device refers to the GPU that is going to carry out part of the computations. CUDA- enabled GPUs use the single-instruction, multiple-threads (SIMT) architecture in scheduling and processing the threads. Warp is the smallest group of the threads that can be issued to a CUDA multiprocessor to be processed. Warp size is a fixed number by which the threads executing the same instructions are grouped, in other words, warp size is the smallest number of threads that can be issued to a streaming multiprocessor to be processed, if a warp contains less than 32 threads, part of the computational resource assigned to the code would remain idle. When switching threads, a complete warp of 32 threads is switched. Warps together create larger groups of threads named blocks. Unlike the warp size, the block size is arbitrary. Although, the block size limitations are hardware dependant, all the CUDA-enabled GPUs support three dimensional blocks of the size, typically, 512 threads per dimension. The largest group of computational threads in CUDA architecture is called a grid; it can accommodate up to blocks of the threads in each of its two dimensions. This means that a grid can contain up to four billion threads in total. Starting with their capabilities, it is needed to be mentioned that GP-GPUs, typically, spend a small amount of die size implementing caches and branch predictors. The reason lies beneath their zerooverhead hardware level support for thread scheduling. General purpose GPUs, typically use warp switching to hide the effect of high latency operations such as global memory accesses. In other words, whenever a group of the threads executing the same instructions reach a point that they need to perform a high-latency operation, the scheduler automatically switches to the next available warp; hence, while a group of the threads are waiting to finish a costly memory operation, another warp takes their place and avoids the SM getting idle. The mentioned approach removes the need for adding large cache memories or complex branch predictors to avoid processor clock waste; nevertheless, it is the programmer s duty to write their code so that it always offers at least an additional ready to be processed warp to each of the used SMs to avoid streaming multiprocessors getting idle when a warp performs a high-latency operation. Besides the many-core architecture of GP-GPUs and their hardware level support for thread scheduling, being the most prominent architectural achievement of such chips, there are a number of additional architectural features allowing GPUs to perform general purpose integer and floating point operations more efficiently. These facilities can be listed as follows: shared constant memory with dedicated cache on each SM, 16KB of dedicated shared memory on each SM, dedicated texture unit including the L1 texture cache on each SM, and dedicated instruction cache on each SM. These features allow faster data access by decreasing the number of off-chip global memory accesses making the code run more efficiently with a higher level of performance. To make coding the solutions to the special multi-dimensional problem domains easier, GP-GPU platforms allow the thread blocks of up to three dimensions to be defined, also, grids can be defined as one or two dimensional matrices of blocks; 7 P a g e

16 hence, the programmer has more control to decide on the needed block and grid structure of the code based on the domain of the problem that is going to be solved. In addition to all the mentioned structural strengths of general purpose graphics processors, they, also, allow graphics interoperations with CUDA functions, called kernels; thus, the result of the CUDA computations can be directly rendered and drawn on the screen without host interference. Despite the fact that GP-GPUs adopt many exclusive facilities to achieve higher levels of performance in carrying out general computations, they lack many of the synchronization, control, and communication capabilities that state of the art multi-core processors have. In fact, the only synchronization facility that is natively implemented in CUDA architecture is the barrier. CUDA barriers have a simple format and are called easily by calling the function syncthreads() within the device code; however, they do not give any options to choose the threads that are going to be synchronized. This means that, with each call of the barrier all the threads in the grid would be synchronized. Although, the needed operations for the barriers are implemented in hardware level, synchronizing a large number of threads may put a significant amount of overhead on the system; to avoid this inefficiency, the programmer should be careful in making use of the synchronizations. As another downside of GPU computing, there is no way for the threads running on different SMs to communicate efficiently, they can use the off-chip, high-latency global memory to exchange data, though. In some applications, the threads need to repeatedly check one or more shared state variables. In such cases, an attempt should be made to simulate the cache behavior by the means of on-chip shared memory. In other words, all the threads should check their copy of the state variable located in the SM s local shared memory. The shared memories should be updated with the global memory values periodically. After each update operation, if any change in the value of the global memory is observed, all the threads checking that shared memory should roll back and reprocess that iteration of the code. State variable updates should be directly written back to the global memory using atomics. Lack of the support for recursion is another noticeable drawback of GPU computing; however, heterogeneity of the architecture allows parts of the code, such as recursive parts, to be run sequentially on the CPU; resulting in less performance. Another limitation of the CUDA architecture is that it does not allow any type of memory allocation to be done within the device code; hence, the allocated memory for a device kernel is fixed unless it returns and asks CPU to allocate a new memory of a different size to it. Even such strategy should be implemented manually by the developer and there is no language intrinsic method for a kernel to ask for more memory allocation. 8 P a g e

17 3. Discrete Event Simulation Having a long history in gambling, Monte Carlo, a city located in Monaco, has always had lots of potential for the ambitious gamblers to show off their talents by developing new strategies in predicting the random results of the games and earn large sums of money. More experienced gamblers usually followed their instincts while some more talented beginners tried to make use of statistical strategies to estimate when and how to play. Their approach in predicting the values of unknown variables inspired statistics scientists to name a set of computational algorithms Monte Carlo methods after the city. These methods are especially useful in simulation of complex systems with computers when application of deterministic algorithms is infeasible [6]. Referring to Theodore Allen s book (2011) [5]: as a derivative of, Monte Carlo simulation, Discrete Event Simulation tries to simulate the system model based on a number of independent events changing the system state at different points of simulation time. In other words, discrete event simulation is using the event-based model of the system besides the system s initial state to predict the system s new state at any point of time; yet, its main goal is, typically, simulating the system behavior not only on independent specific points of time but also in periods to help administrators make more rational decisions about the system. Unlike most of the more complicated simulations, for less complex systems, discrete event simulation can even be carried out manually on a number of spreadsheets [5]. This feature comes from DES s natural approach in modeling the system s behavior, inspiring the widespread use of this method in science and industry. The main application domains using DES are as follows: manufacturing, health care, call centre support services, military, and logistics [5]. The mentioned list clearly specifies the broad applicability and considerable significance of DES in different fields DES Architecture There are several proposed approaches to implementation of the simulator; however, reviewing the complete list of these strategies is beyond the boundaries of the context of this discussion; hence, only the most widely trusted and used one, named the three-phase method, will be described in this section. Michael Pidd, in his book (1998) [10] describes this powerful method for discrete event simulation. As figure 3.1 shows, he suggests that after initializing the simulation with the initial state and the list of initial events, phase A should find the time of the next event and advance the simulated clock to that time. In the next phase, which is called phase B, all Bounded, or Booked, events that are scheduled unconditionally are executed and the simulator advances to the next phase. In phase C Conditional events, that involve state changes dependent on the conditions in the model, will be run. During the execution of phase C, system conditions may change; hence, if any changes are observed in this phase, simulator should iterate and check if any of the conditions of the remaining conditional events can be met after the changes. This loop is repeated until no change to the system state is made in the last iteration of phase C execution. As soon as the simulator observed no new event calls in the previous 9 P a g e

18 iteration of C phase, it checks for the termination conditions, if the conditions are not met, it jumps back to phase A and fetches the next event from the queue, otherwise the simulation terminates and the results are returned. Figure 3.1 [10]: The flow chart of the three phase method for discrete event simulation Regardless of the execution model that the simulator follows, to achieve more reliable results, it needs to schedule random events based on the suitable statistical distribution observed from the original system. Research shows that a large number of the real world events follow one of the main well-studied continuous distributions, namely: uniform distribution, triangular distribution, exponential distribution, and normal distribution [5]. To mimic the real world random distribution of the events based on the mentioned patterns, a sequence of uniformly distributed random numbers as long as the number of events to be distributed is needed. The sequence of random numbers is then passed to the 10 P a g e

19 distribution function to be mapped into the desired distribution of the events in the system. Even in cases that the events distribution does not satisfy any of the mentioned patterns, it can be modeled using new hybrid patterns derived from other main distribution functions; hence, regardless of the events distribution pattern, the simulation always needs to be able to generate sequences of random, or pseudo random, numbers to reproduce the list of events resembling the original system s event distribution. For this purpose, pseudo random number generators (PRNG) are widely used in simulators; yet, as there are a number of different types of PRNGs with varying levels of performance, choosing the suitable generator algorithm has always been a trade-off in implementing such simulators. From another point of view, to be able to serve the events separately in a chronological order, implementation of a discrete event simulator is strongly relying on implementation of the ordered queues. There are different types of queues proposed for simulation purposes; however, based on the specific structural details of the simulator and the targeted underlying hardware, the suitable solution to implement the underlying queue varies. Making a poor decision about the queue implementation method may seriously affect the simulator s performance, making the simulator impractical; in consequence, implementation of the underlying queue is one of the most challenging parts of the simulator implementation. Some of the most widely adopted approaches to the queue implementation for the simulation purposes are as follows: implicit lists, linear lists, leftist trees, two lists, binomial queues, Henriksen s, pagodas, skew heaps, splay trees, and pairing heaps [7]. Also, there are more recent implementations of DES aiming at higher levels of performance using highly optimized versions of calendar queues, and skip lists [8]. In addition to the mentioned infrastructure, an event server is needed to fetch the events from the queue and process them, applying the needed state changes to the system model after processing each event. As the core of simulator, event server is, also, in charge of managing the queue and controlling the generation of random numbers based on the simulation needs. In fact, it is a supervisor, controlling all the simulator resources along with the simulation s flow. Having the initial state of the system and its statistical model, the server asks for the generation of an initial list of events, then, it starts the simulation by picking up the first event from the queue and processing it. During the process of each event two main changes may happen; the system state may change, and new events may be created. The event server should be able to effectively deal with all the mentioned changes so as to reflect the observed changes in the simulated system model. The server is also in charge of detecting the termination of simulation. Naturally, discrete event simulations have the potential to run infinitely; thus, the termination conditions should be defined based on the value of simulation variables such as simulation time, or some specific statistics of the model. Also, the termination can be scheduled based on the occurrence of specific events or system state conditions. As the last vital part of DES architecture, an automated reporting mechanism is needed to gather the system model s statistics during the simulation. Also, at the end of the simulation it should be able to sum up the gathered statistics and derive the universal simulation statistics, forming the final simulation outcomes. It needs to be mentioned that there are different approaches in gathering the information; however, most of the simulators either gather the statistics after the occurrence of each event or after specific time intervals. Some implementations integrate this mechanism into the event 11 P a g e

20 server; this integration, allows seamless operation of event processing and reporting, leading to a higher level of performance. To clarify the defined concepts, a simple customer service desk is going to be described as an eventbased model. Assume a service desk serving a single queue of customers. Costumers arrive at random points of time, notwithstanding the fact that their average rate of arrival over the specified period of time is constant. The arrival of customers can be assumed as a bounded event, as its occurrence is not dependant on any external conditions. From another aspect, the service desk calls and serves the clients one by one based on their order in the queue. The action of serving a customer can be assumed as a conditional event, as it can happen only if there is a customer in the queue waiting for the service. The main possible events in a service desk model can be named as: customer arrival, customer departure, starting the service, and finishing the service Parallel Discrete Event Simulation As the physical systems get bigger and more complex it gets substantially harder to simulate them with the level of resources available in sequential systems, in a reasonable amount of time; processor clock and system memory are two major bottlenecks of sequential simulation making the simulation of large systems nearly impossible. To avoid this problem and ensure the future of discrete event simulation, it should be parallelized. From another standpoint, parallel discrete event simulation is academically important due to the considerable amount of parallelism that its underlying concept represents; yet, it is one of the most challenging tasks to parallelize a simulator so that it achieves substantial levels of performance while keeping its original level of accuracy and correctness [11]. As Fujimoto defines; Parallel Discrete Event Simulation (PDES) refers to the execution of a single discrete event simulation on a parallel computer *11+. A parallel version of the simulator demands simultaneous execution of the events during the simulation; however, this parallel execution leads to a number of very challenging problems. Although creating a strong initial impression about being an ideal target for parallelization, discrete event simulation has, paradoxically, shown to be substantially hard to be parallelized so that it can achieve remarkable speedups by effectively utilizing the parallel processors. In this section an attempt is made to define the main problems in parallelizing the simulation, and review some of the proposed solutions. In a sequential execution of the simulation, it is naturally guaranteed that the events will be executed in a chronological order based on their timestamps while in the parallel simulation several parallel processors are executing the events simultaneously; hence, there is no guarantee that a state change to the system caused by event B, having a bigger timestamp than event A, does not affect the execution of event A. If this happens, it can be concluded that a future event has affected the system s past, which is not acceptable. This group of problems is categorized as causality problems by Fujimoto in his paper (1989) [11]. Running events with different timestamps in parallel does not always directly lead to causality problems; in some cases while two events are executing in parallel, the event with the smaller timestamp creates a new event with a timestamp bigger than that of the parent event, yet 12 P a g e

21 smaller than the timestamp of the other parallel event. In such case, to guarantee the correct execution of the simulation, it should be ensured that the newly created event is run before the other parallel event; however, as the other event with the higher timestamp is already executing it is really challenging to develop a mechanism to postpone its execution without making its dedicated computational resources idle. From another standpoint, the difficulty in dealing with causality problems is not only to solve them without decreasing processor utilization, but also to minimize the synchronization overheads. Parallel discrete event simulators, normally, divide the job among several processes using the shared queues. To avoid causality problems, most of the existing parallelization strategies provide a mechanism to prevent the processes from having direct access to the shared state variables [13]. In general, solutions to parallelizing the simulation can be categorized in two main categories: conservative approaches and optimistic approaches. 13 P a g e Conservative approach The first implementations of parallel discrete event simulation were based on the conservative approach. Conservative techniques try to strictly avoid the possibility of causality errors by using some mechanism to find safe points in time to execute an event without facing causality problems. For these types of strategies to work correctly the safe time detection mechanism should verify that all the events with the potential to affect the current event execution are already executed. The main strategy of conservative approaches can be clarified by a simple example; assume that a process contains an event, say E 1, with the assigned timestamp T 1 being the smallest timestamp among all the events that are contained in the process. Based on the conservative approach the process can start executing the event only if it can determine that it is impossible for it to later receive another event with the smaller timestamp than T 1. Following this technique local causality will be preserved and it can safely execute E 1. In conservative strategies processes with no safe event must block until one is available; this may lead to a deadlock in which a circular chain is formed of processes waiting for each other to execute their events first. There are two main proposed solutions to the deadlock problem in the conservative approach; deadlock avoidance, and deadlock detection and recovery. Although dealing with the possibility of the deadlocks is a substantial problem in conservative methods, it is not the main drawback of such strategies. While guaranteeing the correct execution of the simulation, conservative strategies may significantly affect the code s performance by causing a great portion of the total processor time to be spent waiting for other possibly dependant events on other processes to finish execution. Research shows that in some cases optimizing the simulator based on application specific knowledge may lead to a substantial performance gain [11]. In spite of their poor general performance, conservative approaches have been shown to give the same level of accuracy as the equivalent sequential methods. Such approaches are mainly involved with different concepts, some of which are listed by Fujimoto as follows: deadlock avoidance, deadlock detection and recovery, synchronous operation, conservative time windows, improving lookahead, and conditional knowledge [11]. For each of these concepts he has referred to one or more prior works describing their progress in practical use; nevertheless, full description of the concepts and their significance is beyond the scope of this context.

22 Optimistic Approach Unlike conservative strategies, optimistic techniques do not impair the simulator s performance by preventing all potentially dependant events to execute in parallel. They reach this goal by employing the mechanisms that detect the effect of causality conflicts on execution of the events. One of the most effective proposed methods for doing this is time warping. In this method all the processes start executing their statically assigned events in parallel marking the state changes with the relevant event s timestamp. If later an event is observed with the timestamp smaller than that of the last state change of the system, the process in charge of executing the previous event should roll-back and change the system s state to its state before the last change and restart processing the conflicted event. Execution of a conflicting event may make two types of change in the system; it may change the system state, and also, it may send wrong messages to other processes. To deal with the first problem optimistic methods, such as Time Warping, must periodically backup the system state to be able to roll it back when necessary. In case of wrong messages the process should send a counteracting message so as to tell the recipient process to ignore the previous message; however, if the message is already processed the recipient process, also, must roll back. This chain must continue recursively until all the changes related to the execution of the conflicting event are completely rolled back. From the performance standpoint, optimistic methods have great potential for achieving higher levels of performance than the conventional conservative strategies. Fujimoto reports that in many of the prior works, researchers were able to achieve significant levels of speedups, for example 56 using 64 processors in one of the cases [11]. However, optimistic approaches generally consume more memory to save the system state, leading to a trade-off between performance and memory consumption. It needs to be mentioned that, although much work has been done about optimistic methods the spacetime trade-offs in these methods are not fully understood yet [11]. From another perspective, unlike conservative approaches, optimistic strategies must be able to recover from the arbitrary errors that may arise during the execution; otherwise, the errors may be erased by roll-backs causing the computations to be trapped in infinite loops. In such cases the Time Warp executive must intervene and recover the computation. It is proposed that to get the best possible results the optimistic approach should have dedicated hardware support allowing it to do the roll-backs and error recoveries more efficiently. 14 P a g e