Developing Applications on GPUs: Discrete-Event Simulator Implementation

Size: px
Start display at page:

Download "Developing Applications on GPUs: Discrete-Event Simulator Implementation"

Transcription

1 Developing Applications on GPUs: Discrete-Event Simulator Implementation A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES 2011 By Farhad Kajabadi School of Computer Science

2 Table of Contents 1. INTRODUCTION AIMS AND OBJECTIVES STRUCTURE OF THE REPORT GPU COMPUTING HETEROGENEOUS MASSIVELY PARALLEL ARCHITECTURES AND DATA PARALLEL TASKS THE HISTORY OF GPU COMPUTING CUDA ARCHITECTURE GENERAL PURPOSE GPU COMPUTING, OFFERINGS, AND TRADE-OFFS DISCRETE EVENT SIMULATION DES ARCHITECTURE PARALLEL DISCRETE EVENT SIMULATION Conservative approach Optimistic Approach PDES IMPLEMENTATION ON GP-GPU NEEDED INFRASTRUCTURES OF PDES AGAINST GP-GPUS CAPABILITIES Computational Capabilities Synchronization and Communication Methods Data Types PERFORMANCE TRADE-OFFS EXPECTATIONS OF THE RESULTS RESEARCH METHOD PROGRAMMING MODEL DEVELOPMENT PLATFORM CORRECTNESS VERIFICATION PERFORMANCE EVALUATION PROJECT PLAN SUMMARY APPENDIX A APPENDIX B REFERENCES II P a g e

3 Table of Figures FIGURE 2.1: NVIDIA GEFORCE 8800 HARDWARE ARCHITECTURE... 6 FIGURE 3.1: THE FLOW CHART OF THE THREE PHASE METHOD FOR DISCRETE EVENT SIMULATION III P a g e

4 List of Abbreviations ASIC API CPU CUDA DES DSP DX FPGA GPU GP-GPU HPC ILP PDES PRNG ROP SFU SIMT SM SP TPC Application Specific Integrated Circuit Application Programming Interface Central Processing Unit Compute Unified Device Architecture Discrete Event Simulator Digital Signal Processor DirectX Field Programmable Gate Array Graphics Processing Unit General Purpose Graphics Processing Unit High Performance Computing Instruction Level Parallelism Parallel Discrete Event Simulator Pseudo Random Number Generator Raster Operation Processor Special Function Units Single Input Multiple Threads Streaming Multiprocessor Streaming Processor Texture/Processor Cluster IV P a g e

5 Abstract With the development of general purpose graphics processing unit (GP-GPU) in 2006, GPUs have evolved into flexible and powerful parallel processors giving higher levels of parallel performance in comparison with the conventional state of the art multi-core processors at lower costs. GP-GPUs are widely used for general purpose and scientific computations; however, regarding their synchronous organization and their single-instruction multiple-thread execution model, they are not the primary target for task parallel applications. Regarding the growing demand for the system simulation during the recent years, more efficient simulation methods such as discrete event simulation (DES) have become the target of much research. Discrete event simulation is widely used in science, military, and industry [5] as a powerful method of determining systems characteristics, and their needed resources. This simulation method is based on reviewing the random occurrence of different independent instantaneous asynchronous events changing the system state. Most of the research done on carrying out simulations on GPUs has been based on the system models that are using regular time increments. These increments can be accomplished either by adding constant time deltas, like numerical integration, or event scheduling based on time deltas, as used in discrete event approximations of continuous-time systems [14]. However, due to its asynchronous nature, DES is characterized as not being efficient on GP-GPU architecture; as a result, to be able to map the irregular time advances of discrete event models to the GPU s many-core architecture efficiently a new strategy should be devised. To achieve this goal, we aim to develop a new time approximation mechanism using the NVIDIA s compute unified device architecture (CUDA) platform. The evaluation stage accounts for verifying the code s correctness and measuring its performance. After the code is developed its correctness and performance is going to be tested against several independent parallel implementations of discrete event simulators. Also, a CPU-only version of the code will be developed to be compared to the GPU version. Both performance and correctness verification tests and benchmarks will be carried out on different scenarios based on different system models so that the obtained results and drawn conclusions are reliable. Although GP-GPUs are not originally compatible with the asynchronous nature of DES computations, we expect to be able to exploit their massively parallel architecture to achieve satisfactory speed ups; however, a very substantial level of performance is not expected to be achieved. V P a g e

6 Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. VI P a g e

7 Copyright I. The author of this thesis (including any appendices and/or schedules to this thesis) owns any copyright in it (the Copyright ) and s/he has given The University of Manchester the right to use such Copyright for any administrative, promotional, educational and/or teaching purposes. II. III. IV. Copies of this thesis, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these regulations may be obtained from the Librarian. This page must form part of any such copies made. The ownership of any patents, designs, trademarks and any and all other intellectual property rights except for the Copyright (the Intellectual Property Rights ) and any reproductions of copyright works, for example graphs and tables ( Reproductions ), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property Rights and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions. Further information on the conditions under which disclosure, publication and exploitation of this thesis, the Copyright and any Intellectual Property Rights and/or Reproductions described in it may take place is available from the Head of School of School of Computer Science (or the Vice-President). VII P a g e

8 Acknowledgements I would like to take this opportunity to thank Dr. Christopher Kirkham for supervising, guiding, and providing me with substantial and valuable information without which this project was not possible to be accomplished. I would like to thank all those who contributed to this research. Special thanks to Dr. Mikel Lujan for providing me with the needed information and additional useful references. Many thanks go to various people I met at the University of Manchester who assisted me in developing my skills. I gratefully thank my parents for their unconditional support and love during my life. VIII P a g e

9 1. Introduction As soon as processor performance started to be gained through adding multiple cores on a single chip, a new trend toward developing new simpler, yet more powerful, parallel architectures started to emerge. After the Microsoft DirectX 10 specification forced Graphics Processing Unit (GPU) developers to develop new chips adopting unified graphics processors capable of doing both low-latency highprecision, and high-latency integer and floating point operations, manufacturers came up with the idea of using the newly developed chips for general computations [3]. To make their unified architecture exploitable by general programmers, they needed to develop a programming model hiding the graphics processing concepts; hence, they started to develop a new platform accessible through well known C and C++ codes [1, 2]. Simplicity of the new programming model and its well known C syntax helped the new platform to find its position in the competition of parallel architectures very quickly. The increasing number of companies and applications making use of the new platform caused major hardware manufacturers to invest more in the field; as a consequence, general code execution capabilities of the GPUs started to increase very quickly to the point that new architectures allow GPU and CPU to be integrated in a single chip. Integration of the GPU-CPU heterogeneous parallel architecture on a single chip allows graphics processor to directly access the main memory through DMA without CPU intervention, solving one of the main deficiencies of GPU computing namely indirect memory access. Direction of the development of general purpose graphics processing units (GP-GPUs) and their widespread use in different software domains shows that they are moving toward unification with central processors. Furthermore, next generation operating systems are trying to exploit their power in performing heavy data parallel tasks. Also, in the domain of scientific computing they have become significantly successful and widely-used [2]. All the mentioned attributes of the GPU computing architecture lead it to be referred to as one of the most significant blocks of the future of parallel computing. Nearly five decades ago, computer assisted design and modeling techniques started to show up, since then, the presence of such techniques has significantly changed the process of designing and testing new products before even the first prototype is made. According to heavy costs of producing and testing a large number of real-world models, for most products, such as electronics, it is not efficient to entirely carry out the verification process using prototypes. Also, in some fields, like the aero space industry, it is not feasible to produce real world prototypes for different tests; as the real-world tests might be very dangerous, sometimes even costing human lives. To design efficient systems and complex structures besides verifying the produced design, computer simulation is widely trusted and used. Its broad application in different industries makes discrete event simulation one of the most widely used types of computer simulation and puts it at the centre of attention in much research. However, as the systems that are going to be modeled get more complex, their simulation needs more time and memory to complete. Due to the fact that the resources in sequential computers are physically limited [12], to be able to reduce the simulation time, parallelizing the simulator is the main solution.

10 1.1. Aims and objectives In spite of the widespread use of GP-GPUs in different scientific problem domains, there has not been much work done in implementing parallel discrete event simulators (PDES) on this platform; hence, this project aims to fill part of the gap by implementing PDES on NVIDIA s widely-used GP-GPU platform, namely Compute Unified Device Architecture (CUDA). The project starts with understanding the CUDA architecture, its programming model, PDES architecture, and possible implementation approaches, then, the research continues by starting the development of fundamental structures and data types that a PDES needs. In the next stage, PDES will be developed and tested. Finally, a parallel CPU version of PDES is developed and compared with the GPU version to verify the performance and correctness achievements. Besides the developed CPU version of the simulator it will be, also, compared to a number of other implementations so that more reliable conclusions can be drawn Structure of the Report Chapter 2 reviews different aspects of GPU computing as the underlying architecture on which the project is going to be developed. Initially, it describes the movement towards parallel computing and the reasons behind it. It continues by depicting heterogeneous parallel architectures, especially the CPU-GPU hybrid platform, and its potential to carry out heavy data parallel tasks efficiently. A short history of GPUs development and the technologies that they used to exploit will be surveyed to give an overview of the evolution of GPUs to their current state. Finally, their strengths and weaknesses in performing general purpose computations are reviewed. Chapter 3 is composed of general background information related to the structure and the implementation of DES and PDES. To illustrate the different aspects of DES implementation and its purposes in science and industry, this section starts with a short review. In the next section parallelization of DES and different possible approaches are described. The last section reviews the mentioned approaches of parallelization of DES based on their main proposed limitations and trade-offs, also, some of the surveyed solutions for the main problems are discussed. Chapter 4 tries to create a logical relation between two previous chapters so as to draw conclusions about the needed infrastructure, possible trade-offs and bottlenecks, and the expected results. The main difference between this chapter and the previous chapter is that it reviews the implementation of PDES specifically on GPU; hence, there are different limitations, trade-offs, and solutions in comparison with the previous chapter. Chapter 5 involves the definition of the research methods that are going to be adopted for the project so as to derive road map strategies for the tasks that are going to be carried out. The main tasks are as follows: (a) Choosing the Suitable Development Platform, (b) Developing the Code, (c) correctness verification, and (d) Performance Evaluation. Also, this section describes the developed plan to clarify the scheme on which the tasks are scheduled to be done. 2 P a g e

11 2. GPU Computing For more than a decade, from the start of 90s until mid 2000s, it was a trend to increase the processor s performance by increasing its clock rate besides reaching a higher level of instruction level parallelism in the processor design; however, this trend stopped continuing after the market leading processor manufacturers faced several serious physical limitations in achieving more performance following the old trend. As the main problem, power consumption and as a result heat dissipation of the chip was shown to grow super-linearly with increasing the clock rate beyond some point around 4 GHz, also, the complexity and cost of designing chips achieving higher levels of instruction level parallelism on a single core started to grow exponentially. From another point of view, memory bandwidth, as one of the most important factors affecting computer s performance, was limited; as a consequence, memories were not able to keep up with the processor s performance growth. All the mentioned barriers, which are, also, referred to as: power wall, ILP wall, and memory wall, forced the computer manufacturers to move to a multi-core processor design to be able to continue increasing their products performance efficiently. In multi-core architecture, the increasing number of available transistors on a specific surface of silicon, as described by Moore s law, is used to add more cores on a single chip. In this approach each core has a less complex architecture, as opposed to the previous trend of putting a single core with much more complex design on the same surface. Benefiting from the new multi-core architecture besides using multi channel memories, computer manufacturers are able to pass all the mentioned barriers; however, the new parallel architecture introduces many new problems and bottlenecks for the software developers. Unlike the sequential execution, parallel execution of the code does not natively guarantee the correct order of execution of the instructions, as the instructions are running on several independent parallel cores simultaneously. To solve this problem, parallel units of execution should be synchronized at certain points of execution to make sure that the code runs correctly. Synchronizations can be significantly costly in terms of execution time; thus, it is a big trade-off to develop the correct code with the least possible synchronization overheads. Also, by moving toward parallel computing, not all the needed data is local anymore; hence, it is very important to design the code so that it keeps the locality of data accesses as much as possible. Apart from the mentioned trade-offs there are a number of other concerns with the parallel codes performance that are outside this discussion s borders; however, they can be named as: load imbalance, deadlock and live lock, and scheduling overheads. In addition, it is important to choose the correct sequential version of an algorithm to parallelize, as the best sequential version is not always the best candidate for parallelization. The Rise of the multi-core era has made the application s performance more dependent on the code s design and its level of success in processor utilization; consequently, the performance, which comes from the underlying hardware, is now more determined by the software. 3 P a g e

12 2.1. Heterogeneous Massively Parallel Architectures and Data Parallel Tasks Facing the mentioned physical barriers in increasing the single core processor s performance, the high performance computing (HPC) community developed new strategies to augment Moore s law and started exploring new techniques to tackle the problems of conventional systems. Besides the innovation of multi-core processors, computer developers tried to also, invest on the development of some specialized processors aiming to give more performance where conventional general purpose multi-core processors had been shown to perform poorly. The development of the new specialized processors, allowed HPC community to design new heterogeneous computer architectures, in which conventional and specialized processors are integrated to work cooperatively [12]. In other words, the need for heterogeneous platforms comes from the demand for high performance, highly responsive systems capable of interacting with other environments more efficiently. The new systems based on heterogeneous architectures have shown to be able to achieve better performance characteristics while in many cases consuming less power, and having lower total development costs. The mentioned attributes of these architectures attracted a lot of scientific interest; in fact, in the long term, it s believed that heterogeneous computing holds tremendous potential for accelerating applications beyond what one would expect from Moore's Law, while overcoming many of the barriers that can limit conventional architectures [12]. In general, heterogeneous architectures are composed of processors with different instruction set architectures, different application binary interfaces, different APIs, or different memory interfaces. Normally, these processors involve general purpose processors covering single core and multi-core processors, special purpose processors such as DSPs and GPUs, custom logic accelerators such as ASICs and FPGAs, and some specialized on-chip computational units such as encryption co-processors, and onchip network processors. Since the development of the first GPU, GPU-CPU heterogeneous architecture is widely used to calculate the motion of and render high resolution graphics. By their nature, GPUs have a large number of computational cores processing different types of graphics data at lower clock rates than the conventional commodity multi-core processors; hence, exposing a higher level of parallel performance while consuming less power due to their lower core clock rate. Nevertheless, their potential of being used as a general-purpose massively parallel processor had never been widely exploited until the development of the general purpose GPUs, as doing general computations with traditional GPUs needed comprehensive knowledge of at least one of the graphical programming languages [2, 3]. However, with the development of GP-GPUs, developers derived a new programming model allowing the programmer to write the codes performing general computations on GPUs using extensions over conventional C and C++ compilers. The new programming model brought about power and simplicity in GPU programming and allowed software developers to develop their complex applications based on GPUs without having to have the knowledge of OpenGL or DirectX programming. Regarding the proven capability of GP-GPUs in providing high levels of parallel performance, their application in science and industry for solving complex problems has shown significant growth during the recent years. In spite of the fact that GP- GPUs are, also, capable of processing task parallel jobs, the synchronous nature of their underlying 4 P a g e

13 hardware makes them good targets for data parallel tasks involving doing the same computations on large sets of different data. In other words, GP-GPUs are natively designed and optimized to do heavy data parallel tasks, such as rendering high resolution graphics. However, they are, also, capable of running multi stream parallel tasks and task parallel applications, yet with less potential performance The History of GPU Computing The first GPU, named GeForce 256, was developed by NVIDIA in In fact, the term GPU, is defined by NVIDIA for the architecture that integrates transforming, lighting, triangle clipping, rendering and other graphics processing engines on a single chip [4]. Prior to the development of the first GPU, for more than a decade, the term, graphics accelerator, was being used for the2d/3d graphics processing chipsets. Graphics accelerators used to make use of several different task specific graphics processing chips together to be able to render the final image. The separation of different graphics processing pipelines was making it necessary to go off-chip after each stage of the process to move to the next stage by feeding the processed data into the next chip. The demand for going off chip several times to finish rendering a single frame was one of the main drawbacks of the old graphics accelerators; hence, NVIDIA tried to improve the graphic accelerator s design by integrating all the major graphics processing hardware and components on a single chip called GPU. With the start of the GPU era, lots of companies started to invest in graphics processors, leading to an explosion in development of new technologies and standards in graphics processing. Modern GPUs started utilizing a number of specialized vertex and pixel-fragment processors named shaders. Vertex processors are in charge of carrying out the needed computations on primitives such as points, lines, and triangles. The operations of vertex processors are mainly based on transformation of the coordinates into the screen space, while pixel-fragment processors typically operate on filled primitives and interpolated parameters which are fed to it by a rasterizer. According to the fact that GPUs typically need to process a larger number of pixels than vertices, pixel fragment processors, normally, outnumber the vertex processors by the ratio of about three to one [4]. Yet, workloads of graphics are not always well balanced, leading to inefficiency in hardware utilization and performance. To root out these inefficiencies, NVIDIA GeForce 8800 GPU was composed of several unified processors capable of operating on both pixels and vertices. Traditionally, Vertex processors were designed for low-latency, high-precision math operations, whereas pixel-fragment processors were optimized for high-latency, lower-precision texture filtering [4]. The new unified processors were able to carry out both tasks efficiently; thus, NVIDIA s developers decided to convert the new GPU architecture into a general purpose parallel processor capable of carrying out many different general and scientific types of computations. To achieve this goal, also, a new programming model allowing efficient parallel application development without having any knowledge of graphics processing languages was needed; hence, they tried to develop and integrate their general purpose GPU programming compiler, named CUDA, with C and C++ languages. Nowadays, CUDA power is even accessible in other programming languages such as FORTRAN and Python. 5 P a g e

14 2.3. CUDA Architecture All CUDA-enabled GPUs, typically, follow the same fundamental architecture; however, based on their feature sets and optimizations, such as the compute capability version, some architectural differences are observed among different generations of NVIDIA GP-GPUs. To give a general overview, basic CUDA architecture of GeForce 8800 GPU is reviewed as a reference; yet, newer generations of CUDA-enabled chips may have some additional features and hardware resources. Figure 2.1 [3]: NVIDIA GeForce 8800 Hardware Architecture As figure 2.1 shows, a CUDA-enabled GPU is utilized with a large number of streaming processors (SPs), also known as CUDA cores. A group of eight joint SPs with instruction and constants caches, two special function units (SFUs), and a dedicated shared memory form a streaming multiprocessor (SM). A texture/processor cluster (TPC) consists of two streaming multiprocessors in addition to Geometry and SM controllers, and its dedicated texture unit containing the texture cache. All the TPCs are connected to an interconnection network linking them together, and to other resources such as raster operation processor (ROP), and DRAM memory banks [3]. Unlike the cores in the state of the art multicore processors, all the SPs in a single SM can only process the same code at a time; making GP-GPU architecture a promising target for data parallel tasks. 6 P a g e

15 2.4. General Purpose GPU Computing, Offerings, and Trade-offs Regarding the special architecture of GP-GPUs, they offer different opportunities, and potentials in comparison with the conventional commodity multi-core processors. This section aims to give an overview of different aspects of GPU computing; nevertheless, before starting the description, some terms should be defined. The term host involves the entire hardware operating in the CPU side of the system to run more critical parts of the code besides controlling the CPU-GPU interactions. The term device refers to the GPU that is going to carry out part of the computations. CUDA- enabled GPUs use the single-instruction, multiple-threads (SIMT) architecture in scheduling and processing the threads. Warp is the smallest group of the threads that can be issued to a CUDA multiprocessor to be processed. Warp size is a fixed number by which the threads executing the same instructions are grouped, in other words, warp size is the smallest number of threads that can be issued to a streaming multiprocessor to be processed, if a warp contains less than 32 threads, part of the computational resource assigned to the code would remain idle. When switching threads, a complete warp of 32 threads is switched. Warps together create larger groups of threads named blocks. Unlike the warp size, the block size is arbitrary. Although, the block size limitations are hardware dependant, all the CUDA-enabled GPUs support three dimensional blocks of the size, typically, 512 threads per dimension. The largest group of computational threads in CUDA architecture is called a grid; it can accommodate up to blocks of the threads in each of its two dimensions. This means that a grid can contain up to four billion threads in total. Starting with their capabilities, it is needed to be mentioned that GP-GPUs, typically, spend a small amount of die size implementing caches and branch predictors. The reason lies beneath their zerooverhead hardware level support for thread scheduling. General purpose GPUs, typically use warp switching to hide the effect of high latency operations such as global memory accesses. In other words, whenever a group of the threads executing the same instructions reach a point that they need to perform a high-latency operation, the scheduler automatically switches to the next available warp; hence, while a group of the threads are waiting to finish a costly memory operation, another warp takes their place and avoids the SM getting idle. The mentioned approach removes the need for adding large cache memories or complex branch predictors to avoid processor clock waste; nevertheless, it is the programmer s duty to write their code so that it always offers at least an additional ready to be processed warp to each of the used SMs to avoid streaming multiprocessors getting idle when a warp performs a high-latency operation. Besides the many-core architecture of GP-GPUs and their hardware level support for thread scheduling, being the most prominent architectural achievement of such chips, there are a number of additional architectural features allowing GPUs to perform general purpose integer and floating point operations more efficiently. These facilities can be listed as follows: shared constant memory with dedicated cache on each SM, 16KB of dedicated shared memory on each SM, dedicated texture unit including the L1 texture cache on each SM, and dedicated instruction cache on each SM. These features allow faster data access by decreasing the number of off-chip global memory accesses making the code run more efficiently with a higher level of performance. To make coding the solutions to the special multi-dimensional problem domains easier, GP-GPU platforms allow the thread blocks of up to three dimensions to be defined, also, grids can be defined as one or two dimensional matrices of blocks; 7 P a g e

16 hence, the programmer has more control to decide on the needed block and grid structure of the code based on the domain of the problem that is going to be solved. In addition to all the mentioned structural strengths of general purpose graphics processors, they, also, allow graphics interoperations with CUDA functions, called kernels; thus, the result of the CUDA computations can be directly rendered and drawn on the screen without host interference. Despite the fact that GP-GPUs adopt many exclusive facilities to achieve higher levels of performance in carrying out general computations, they lack many of the synchronization, control, and communication capabilities that state of the art multi-core processors have. In fact, the only synchronization facility that is natively implemented in CUDA architecture is the barrier. CUDA barriers have a simple format and are called easily by calling the function syncthreads() within the device code; however, they do not give any options to choose the threads that are going to be synchronized. This means that, with each call of the barrier all the threads in the grid would be synchronized. Although, the needed operations for the barriers are implemented in hardware level, synchronizing a large number of threads may put a significant amount of overhead on the system; to avoid this inefficiency, the programmer should be careful in making use of the synchronizations. As another downside of GPU computing, there is no way for the threads running on different SMs to communicate efficiently, they can use the off-chip, high-latency global memory to exchange data, though. In some applications, the threads need to repeatedly check one or more shared state variables. In such cases, an attempt should be made to simulate the cache behavior by the means of on-chip shared memory. In other words, all the threads should check their copy of the state variable located in the SM s local shared memory. The shared memories should be updated with the global memory values periodically. After each update operation, if any change in the value of the global memory is observed, all the threads checking that shared memory should roll back and reprocess that iteration of the code. State variable updates should be directly written back to the global memory using atomics. Lack of the support for recursion is another noticeable drawback of GPU computing; however, heterogeneity of the architecture allows parts of the code, such as recursive parts, to be run sequentially on the CPU; resulting in less performance. Another limitation of the CUDA architecture is that it does not allow any type of memory allocation to be done within the device code; hence, the allocated memory for a device kernel is fixed unless it returns and asks CPU to allocate a new memory of a different size to it. Even such strategy should be implemented manually by the developer and there is no language intrinsic method for a kernel to ask for more memory allocation. 8 P a g e

17 3. Discrete Event Simulation Having a long history in gambling, Monte Carlo, a city located in Monaco, has always had lots of potential for the ambitious gamblers to show off their talents by developing new strategies in predicting the random results of the games and earn large sums of money. More experienced gamblers usually followed their instincts while some more talented beginners tried to make use of statistical strategies to estimate when and how to play. Their approach in predicting the values of unknown variables inspired statistics scientists to name a set of computational algorithms Monte Carlo methods after the city. These methods are especially useful in simulation of complex systems with computers when application of deterministic algorithms is infeasible [6]. Referring to Theodore Allen s book (2011) [5]: as a derivative of, Monte Carlo simulation, Discrete Event Simulation tries to simulate the system model based on a number of independent events changing the system state at different points of simulation time. In other words, discrete event simulation is using the event-based model of the system besides the system s initial state to predict the system s new state at any point of time; yet, its main goal is, typically, simulating the system behavior not only on independent specific points of time but also in periods to help administrators make more rational decisions about the system. Unlike most of the more complicated simulations, for less complex systems, discrete event simulation can even be carried out manually on a number of spreadsheets [5]. This feature comes from DES s natural approach in modeling the system s behavior, inspiring the widespread use of this method in science and industry. The main application domains using DES are as follows: manufacturing, health care, call centre support services, military, and logistics [5]. The mentioned list clearly specifies the broad applicability and considerable significance of DES in different fields DES Architecture There are several proposed approaches to implementation of the simulator; however, reviewing the complete list of these strategies is beyond the boundaries of the context of this discussion; hence, only the most widely trusted and used one, named the three-phase method, will be described in this section. Michael Pidd, in his book (1998) [10] describes this powerful method for discrete event simulation. As figure 3.1 shows, he suggests that after initializing the simulation with the initial state and the list of initial events, phase A should find the time of the next event and advance the simulated clock to that time. In the next phase, which is called phase B, all Bounded, or Booked, events that are scheduled unconditionally are executed and the simulator advances to the next phase. In phase C Conditional events, that involve state changes dependent on the conditions in the model, will be run. During the execution of phase C, system conditions may change; hence, if any changes are observed in this phase, simulator should iterate and check if any of the conditions of the remaining conditional events can be met after the changes. This loop is repeated until no change to the system state is made in the last iteration of phase C execution. As soon as the simulator observed no new event calls in the previous 9 P a g e

18 iteration of C phase, it checks for the termination conditions, if the conditions are not met, it jumps back to phase A and fetches the next event from the queue, otherwise the simulation terminates and the results are returned. Figure 3.1 [10]: The flow chart of the three phase method for discrete event simulation Regardless of the execution model that the simulator follows, to achieve more reliable results, it needs to schedule random events based on the suitable statistical distribution observed from the original system. Research shows that a large number of the real world events follow one of the main well-studied continuous distributions, namely: uniform distribution, triangular distribution, exponential distribution, and normal distribution [5]. To mimic the real world random distribution of the events based on the mentioned patterns, a sequence of uniformly distributed random numbers as long as the number of events to be distributed is needed. The sequence of random numbers is then passed to the 10 P a g e

19 distribution function to be mapped into the desired distribution of the events in the system. Even in cases that the events distribution does not satisfy any of the mentioned patterns, it can be modeled using new hybrid patterns derived from other main distribution functions; hence, regardless of the events distribution pattern, the simulation always needs to be able to generate sequences of random, or pseudo random, numbers to reproduce the list of events resembling the original system s event distribution. For this purpose, pseudo random number generators (PRNG) are widely used in simulators; yet, as there are a number of different types of PRNGs with varying levels of performance, choosing the suitable generator algorithm has always been a trade-off in implementing such simulators. From another point of view, to be able to serve the events separately in a chronological order, implementation of a discrete event simulator is strongly relying on implementation of the ordered queues. There are different types of queues proposed for simulation purposes; however, based on the specific structural details of the simulator and the targeted underlying hardware, the suitable solution to implement the underlying queue varies. Making a poor decision about the queue implementation method may seriously affect the simulator s performance, making the simulator impractical; in consequence, implementation of the underlying queue is one of the most challenging parts of the simulator implementation. Some of the most widely adopted approaches to the queue implementation for the simulation purposes are as follows: implicit lists, linear lists, leftist trees, two lists, binomial queues, Henriksen s, pagodas, skew heaps, splay trees, and pairing heaps [7]. Also, there are more recent implementations of DES aiming at higher levels of performance using highly optimized versions of calendar queues, and skip lists [8]. In addition to the mentioned infrastructure, an event server is needed to fetch the events from the queue and process them, applying the needed state changes to the system model after processing each event. As the core of simulator, event server is, also, in charge of managing the queue and controlling the generation of random numbers based on the simulation needs. In fact, it is a supervisor, controlling all the simulator resources along with the simulation s flow. Having the initial state of the system and its statistical model, the server asks for the generation of an initial list of events, then, it starts the simulation by picking up the first event from the queue and processing it. During the process of each event two main changes may happen; the system state may change, and new events may be created. The event server should be able to effectively deal with all the mentioned changes so as to reflect the observed changes in the simulated system model. The server is also in charge of detecting the termination of simulation. Naturally, discrete event simulations have the potential to run infinitely; thus, the termination conditions should be defined based on the value of simulation variables such as simulation time, or some specific statistics of the model. Also, the termination can be scheduled based on the occurrence of specific events or system state conditions. As the last vital part of DES architecture, an automated reporting mechanism is needed to gather the system model s statistics during the simulation. Also, at the end of the simulation it should be able to sum up the gathered statistics and derive the universal simulation statistics, forming the final simulation outcomes. It needs to be mentioned that there are different approaches in gathering the information; however, most of the simulators either gather the statistics after the occurrence of each event or after specific time intervals. Some implementations integrate this mechanism into the event 11 P a g e

20 server; this integration, allows seamless operation of event processing and reporting, leading to a higher level of performance. To clarify the defined concepts, a simple customer service desk is going to be described as an eventbased model. Assume a service desk serving a single queue of customers. Costumers arrive at random points of time, notwithstanding the fact that their average rate of arrival over the specified period of time is constant. The arrival of customers can be assumed as a bounded event, as its occurrence is not dependant on any external conditions. From another aspect, the service desk calls and serves the clients one by one based on their order in the queue. The action of serving a customer can be assumed as a conditional event, as it can happen only if there is a customer in the queue waiting for the service. The main possible events in a service desk model can be named as: customer arrival, customer departure, starting the service, and finishing the service Parallel Discrete Event Simulation As the physical systems get bigger and more complex it gets substantially harder to simulate them with the level of resources available in sequential systems, in a reasonable amount of time; processor clock and system memory are two major bottlenecks of sequential simulation making the simulation of large systems nearly impossible. To avoid this problem and ensure the future of discrete event simulation, it should be parallelized. From another standpoint, parallel discrete event simulation is academically important due to the considerable amount of parallelism that its underlying concept represents; yet, it is one of the most challenging tasks to parallelize a simulator so that it achieves substantial levels of performance while keeping its original level of accuracy and correctness [11]. As Fujimoto defines; Parallel Discrete Event Simulation (PDES) refers to the execution of a single discrete event simulation on a parallel computer *11+. A parallel version of the simulator demands simultaneous execution of the events during the simulation; however, this parallel execution leads to a number of very challenging problems. Although creating a strong initial impression about being an ideal target for parallelization, discrete event simulation has, paradoxically, shown to be substantially hard to be parallelized so that it can achieve remarkable speedups by effectively utilizing the parallel processors. In this section an attempt is made to define the main problems in parallelizing the simulation, and review some of the proposed solutions. In a sequential execution of the simulation, it is naturally guaranteed that the events will be executed in a chronological order based on their timestamps while in the parallel simulation several parallel processors are executing the events simultaneously; hence, there is no guarantee that a state change to the system caused by event B, having a bigger timestamp than event A, does not affect the execution of event A. If this happens, it can be concluded that a future event has affected the system s past, which is not acceptable. This group of problems is categorized as causality problems by Fujimoto in his paper (1989) [11]. Running events with different timestamps in parallel does not always directly lead to causality problems; in some cases while two events are executing in parallel, the event with the smaller timestamp creates a new event with a timestamp bigger than that of the parent event, yet 12 P a g e

21 smaller than the timestamp of the other parallel event. In such case, to guarantee the correct execution of the simulation, it should be ensured that the newly created event is run before the other parallel event; however, as the other event with the higher timestamp is already executing it is really challenging to develop a mechanism to postpone its execution without making its dedicated computational resources idle. From another standpoint, the difficulty in dealing with causality problems is not only to solve them without decreasing processor utilization, but also to minimize the synchronization overheads. Parallel discrete event simulators, normally, divide the job among several processes using the shared queues. To avoid causality problems, most of the existing parallelization strategies provide a mechanism to prevent the processes from having direct access to the shared state variables [13]. In general, solutions to parallelizing the simulation can be categorized in two main categories: conservative approaches and optimistic approaches. 13 P a g e Conservative approach The first implementations of parallel discrete event simulation were based on the conservative approach. Conservative techniques try to strictly avoid the possibility of causality errors by using some mechanism to find safe points in time to execute an event without facing causality problems. For these types of strategies to work correctly the safe time detection mechanism should verify that all the events with the potential to affect the current event execution are already executed. The main strategy of conservative approaches can be clarified by a simple example; assume that a process contains an event, say E 1, with the assigned timestamp T 1 being the smallest timestamp among all the events that are contained in the process. Based on the conservative approach the process can start executing the event only if it can determine that it is impossible for it to later receive another event with the smaller timestamp than T 1. Following this technique local causality will be preserved and it can safely execute E 1. In conservative strategies processes with no safe event must block until one is available; this may lead to a deadlock in which a circular chain is formed of processes waiting for each other to execute their events first. There are two main proposed solutions to the deadlock problem in the conservative approach; deadlock avoidance, and deadlock detection and recovery. Although dealing with the possibility of the deadlocks is a substantial problem in conservative methods, it is not the main drawback of such strategies. While guaranteeing the correct execution of the simulation, conservative strategies may significantly affect the code s performance by causing a great portion of the total processor time to be spent waiting for other possibly dependant events on other processes to finish execution. Research shows that in some cases optimizing the simulator based on application specific knowledge may lead to a substantial performance gain [11]. In spite of their poor general performance, conservative approaches have been shown to give the same level of accuracy as the equivalent sequential methods. Such approaches are mainly involved with different concepts, some of which are listed by Fujimoto as follows: deadlock avoidance, deadlock detection and recovery, synchronous operation, conservative time windows, improving lookahead, and conditional knowledge [11]. For each of these concepts he has referred to one or more prior works describing their progress in practical use; nevertheless, full description of the concepts and their significance is beyond the scope of this context.

22 Optimistic Approach Unlike conservative strategies, optimistic techniques do not impair the simulator s performance by preventing all potentially dependant events to execute in parallel. They reach this goal by employing the mechanisms that detect the effect of causality conflicts on execution of the events. One of the most effective proposed methods for doing this is time warping. In this method all the processes start executing their statically assigned events in parallel marking the state changes with the relevant event s timestamp. If later an event is observed with the timestamp smaller than that of the last state change of the system, the process in charge of executing the previous event should roll-back and change the system s state to its state before the last change and restart processing the conflicted event. Execution of a conflicting event may make two types of change in the system; it may change the system state, and also, it may send wrong messages to other processes. To deal with the first problem optimistic methods, such as Time Warping, must periodically backup the system state to be able to roll it back when necessary. In case of wrong messages the process should send a counteracting message so as to tell the recipient process to ignore the previous message; however, if the message is already processed the recipient process, also, must roll back. This chain must continue recursively until all the changes related to the execution of the conflicting event are completely rolled back. From the performance standpoint, optimistic methods have great potential for achieving higher levels of performance than the conventional conservative strategies. Fujimoto reports that in many of the prior works, researchers were able to achieve significant levels of speedups, for example 56 using 64 processors in one of the cases [11]. However, optimistic approaches generally consume more memory to save the system state, leading to a trade-off between performance and memory consumption. It needs to be mentioned that, although much work has been done about optimistic methods the spacetime trade-offs in these methods are not fully understood yet [11]. From another perspective, unlike conservative approaches, optimistic strategies must be able to recover from the arbitrary errors that may arise during the execution; otherwise, the errors may be erased by roll-backs causing the computations to be trapped in infinite loops. In such cases the Time Warp executive must intervene and recover the computation. It is proposed that to get the best possible results the optimistic approach should have dedicated hardware support allowing it to do the roll-backs and error recoveries more efficiently. 14 P a g e

PARALLEL DISCRETE EVENT SIMULATION OF QUEUING NETWORKS USING GPU-BASED HARDWARE ACCELERATION

PARALLEL DISCRETE EVENT SIMULATION OF QUEUING NETWORKS USING GPU-BASED HARDWARE ACCELERATION PARALLEL DISCRETE EVENT SIMULATION OF QUEUING NETWORKS USING GPU-BASED HARDWARE ACCELERATION By HYUNGWOOK PARK A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

GPGPU Computing. Yong Cao

GPGPU Computing. Yong Cao GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

GPU Architecture. Michael Doggett ATI

GPU Architecture. Michael Doggett ATI GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

An examination of the dual-core capability of the new HP xw4300 Workstation

An examination of the dual-core capability of the new HP xw4300 Workstation An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

GPU Computing - CUDA

GPU Computing - CUDA GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

More information

Radeon HD 2900 and Geometry Generation. Michael Doggett

Radeon HD 2900 and Geometry Generation. Michael Doggett Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university Real-Time Realistic Rendering Michael Doggett Docent Department of Computer Science Lund university 30-5-2011 Visually realistic goal force[d] us to completely rethink the entire rendering process. Cook

More information

high-performance computing so you can move your enterprise forward

high-performance computing so you can move your enterprise forward Whether targeted to HPC or embedded applications, Pico Computing s modular and highly-scalable architecture, based on Field Programmable Gate Array (FPGA) technologies, brings orders-of-magnitude performance

More information

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005 Recent Advances and Future Trends in Graphics Hardware Michael Doggett Architect November 23, 2005 Overview XBOX360 GPU : Xenos Rendering performance GPU architecture Unified shader Memory Export Texture/Vertex

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Switched Interconnect for System-on-a-Chip Designs

Switched Interconnect for System-on-a-Chip Designs witched Interconnect for ystem-on-a-chip Designs Abstract Daniel iklund and Dake Liu Dept. of Physics and Measurement Technology Linköping University -581 83 Linköping {danwi,dake}@ifm.liu.se ith the increased

More information

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems. ASSURING PERFORMANCE IN E-COMMERCE SYSTEMS Dr. John Murphy Abstract Performance Assurance is a methodology that, when applied during the design and development cycle, will greatly increase the chances

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

15-418 Final Project Report. Trading Platform Server

15-418 Final Project Report. Trading Platform Server 15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

evm Virtualization Platform for Windows

evm Virtualization Platform for Windows B A C K G R O U N D E R evm Virtualization Platform for Windows Host your Embedded OS and Windows on a Single Hardware Platform using Intel Virtualization Technology April, 2008 TenAsys Corporation 1400

More information

Teaching Methodology for 3D Animation

Teaching Methodology for 3D Animation Abstract The field of 3d animation has addressed design processes and work practices in the design disciplines for in recent years. There are good reasons for considering the development of systematic

More information

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

Hardware design for ray tracing

Hardware design for ray tracing Hardware design for ray tracing Jae-sung Yoon Introduction Realtime ray tracing performance has recently been achieved even on single CPU. [Wald et al. 2001, 2002, 2004] However, higher resolutions, complex

More information

Guided Performance Analysis with the NVIDIA Visual Profiler

Guided Performance Analysis with the NVIDIA Visual Profiler Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

Informatica Ultra Messaging SMX Shared-Memory Transport

Informatica Ultra Messaging SMX Shared-Memory Transport White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Design and Implementation of the Heterogeneous Multikernel Operating System

Design and Implementation of the Heterogeneous Multikernel Operating System 223 Design and Implementation of the Heterogeneous Multikernel Operating System Yauhen KLIMIANKOU Department of Computer Systems and Networks, Belarusian State University of Informatics and Radioelectronics,

More information

The Methodology of Application Development for Hybrid Architectures

The Methodology of Application Development for Hybrid Architectures Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

The Fastest, Most Efficient HPC Architecture Ever Built

The Fastest, Most Efficient HPC Architecture Ever Built Whitepaper NVIDIA s Next Generation TM CUDA Compute Architecture: TM Kepler GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0 Table of Contents Kepler GK110 The Next Generation GPU Computing

More information

Driving force. What future software needs. Potential research topics

Driving force. What future software needs. Potential research topics Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Writing Applications for the GPU Using the RapidMind Development Platform

Writing Applications for the GPU Using the RapidMind Development Platform Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION 1.1 Background The command over cloud computing infrastructure is increasing with the growing demands of IT infrastructure during the changed business scenario of the 21 st Century.

More information

Supercomputing applied to Parallel Network Simulation

Supercomputing applied to Parallel Network Simulation Supercomputing applied to Parallel Network Simulation David Cortés-Polo Research, Technological Innovation and Supercomputing Centre of Extremadura, CenitS. Trujillo, Spain david.cortes@cenits.es Summary

More information

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Interactive Level-Set Deformation On the GPU

Interactive Level-Set Deformation On the GPU Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation

More information

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

Using Power to Improve C Programming Education

Using Power to Improve C Programming Education Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se

More information

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1 Silverlight for Windows Embedded Graphics and Rendering Pipeline 1 Silverlight for Windows Embedded Graphics and Rendering Pipeline Windows Embedded Compact 7 Technical Article Writers: David Franklin,

More information

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

Data Center and Cloud Computing Market Landscape and Challenges

Data Center and Cloud Computing Market Landscape and Challenges Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution

More information

Parallel Computing with MATLAB

Parallel Computing with MATLAB Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

An Introduction to Parallel Computing/ Programming

An Introduction to Parallel Computing/ Programming An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European

More information

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental

More information

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Applied Technology Abstract Microsoft SQL Server includes a powerful capability to protect active databases by using either

More information

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

More information

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS ICCVG 2002 Zakopane, 25-29 Sept. 2002 Rafal Mantiuk (1,2), Sumanta Pattanaik (1), Karol Myszkowski (3) (1) University of Central Florida, USA, (2) Technical University of Szczecin, Poland, (3) Max- Planck-Institut

More information

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008 Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

Software Synthesis from Dataflow Models for G and LabVIEW

Software Synthesis from Dataflow Models for G and LabVIEW Presented at the Thirty-second Annual Asilomar Conference on Signals, Systems, and Computers. Pacific Grove, California, U.S.A., November 1998 Software Synthesis from Dataflow Models for G and LabVIEW

More information

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

More information