ACCELERATION OF SPIKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS

Size: px
Start display at page:

Download "ACCELERATION OF SPIKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS"

Transcription

1 ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS Thesis Submitted to The School of Engineering of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree Master of Science in Electrical Engineering by Bing Han UNIVERSITY OF DAYTON Dayton, Ohio May 2010

2 ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS APPROVED BY: Tarek M. Taha, Ph.D. Advisory Committee Chairman Department of Electrical and Computer Engineering John Loomis, Ph.D. Committee Member Department of Electrical and Computer Engineering Balster, Eric Ph.D. Committee Member Department of Electrical and Computer Engineering Malcolm W. Daniels, Ph.D. Associate Dean School of Engineering Tony E. Saliba, Ph.D. Dean, School of Engineering ii

3 C Copyright by Bing Han All rights reserved 2010

4 ABSTRACT ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS Name: Han, Bing University of Dayton Advisor: Dr. Tarek M. Taha There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and have generally utilized more accurate neuron models, such as the Izhikevich and Hodgkin-Huxley models, in favor of the more popular integrate and fire model. This thesis examines the feasibility of using GPGPUs for accelerating a spiking neural network based character recognition network to enable large scale neural systems. Two versions of the network utilizing the Izhikevich and Hodgkin- Huxley models are implemented. Three NVIDIA GPGPU platforms and one GPGPU cluster were examined. These include the GeForce 9800 GX2, the Tesla C1060, the Tesla S1070 platforms, and the 32-node Tesla S1070 GPGPU cluster. Our results show that the GPGPUs can provide significant speedups over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided speedups of 5.6 and 84.4 time over highly optimized implementations on the fastest CPU tested, a quad core 2.67 GHz Xeon processor, for the Izhikevich and Hodgkin Huxley models respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPGPUs are well suited for this application domain. iii

5 A large portion of the results presented in this thesis have been published in the April 2010 issue of Applied Optics [1]. iv

6 ACKNOWLEDGEMENTS My special thanks are in order to Dr. Tarek M. Taha, my advisor, for providing the time and equipment necessary for the work contained herein, and for directing this thesis and bringing it to its conclusion with patience and expertise. v

7 TABLE OF CONTENTS ABSTRACT... iii ACKNOWLEDGEMENTS... v LIST OF FIGURES... viii LIST OF TABLES... x CHAPTERS: I. Introduction... 1 II. Background Hodgkin-Huxley Model Izhikevich Model GPGPU Architecture Related work III. Network Design IV. GPGPU Mapping Neuronal Mapping Synapses Mapping MPI Mapping Experimental Setup GPGPU cluster configurations vi

8 V. Results Single GPGPU card performances MPI performances Memory analysis Firing rate analysis VI. Conclusion Bibliography Appendix vii

9 LIST OF FIGURES Figure Page 2.1 Spikes produced with the Hodgkin Huxley model Spikes produced with the Izhikevich model Floating point operations per second for the Intel CPUs and NVIDIA GPUs Memory bandwidth for the Intel CPUs and NVIDIA GPUs Processing flow on CUDA A simplified CUDA GPGPU architecture Network used for testing spiking models Algorithm flowchart GPGPU cluster mapping Architecture of the GPGPU cluster Training images Additional test images Neurons/second throughput of GPGPU and CPU processors for the (a) the Izhikevich model and (b) the Hodgkin Huxley model Timing breakdown by percentage of overall runtime for processing the 768x768 image using (a) Izhikevich model and (b) Hodgkin Huxley model Neurons/second throughput of GPGPU cluster MPI communication time Memory requirement for different configurations viii

10 5.7. Variation in speedup of GPGPU implementations on 9800 GX2 platform with firing rate in the level 1 neurons for the: (a) Izhikevich model, and (b) Hodgkin Huxley Model ix

11 LIST OF TABLES Table Page 4.1. Threads hierarchy and warps Spiking network configurations evaluated GPGPU cluster configuration Speedups achieved on four GPGPU platforms for Izhikevich model for different image sizes Speedups achieved on four GPGPU platforms for Hodgkin Huxley model for different image sizes Bytes/Flop requirement of the models and capabilities of the GPGPU cards Bytes/Flop achieved by different GPGPU cards GPGPU cluster run time MPI timing breakdown Memory required by different network component x

12 CHAPTER I Introduction At present there is a strong interest in the research community to model biological scale image recognition systems such as the visual cortex of a rat or a human. Most small scale neural network image recognition systems generally utilize the integrate and fire spiking neuron model. Izhikevich points out however that this model is one of the least biologically accurate models. He states that the integrate and fire model cannot exhibit even the most fundamental properties of cortical spiking neurons, and for this reason it should be avoided by all means [2]. Izhikevich compares a set of 11 spiking neuron models [2] in terms of their biological accuracy and computational load. He shows that the five most biologically accurate models are (in order of biological accuracy) the: 1) Hodgkin-Huxley, 2) Izhikevich, 3) Wilson, 4) Hindmarsh-Rose, and 5) Morris-Lecar models. Of these the Hodgkin-Huxley model is the most compute intensive, while the Izhikevich model is the most computationally efficient. As a result, current studies of large scale models of the cortex are generally using either the Izhikevich [3] or the Hodgkin-Huxley [4] spiking neuron models. One of the main problems with using the Hodgkin-Huxley model for large scale implementations is that it is computationally intensive. The recently published Izhikevich model is attractive 1

13 for large scale implementations as it is close to the Hodgkin-Huxley model in biological accuracy, but is similar to the integrate and fire model in computational intensity. Given that the visual cortex contains 10 9 neurons, biological scale vision systems based on spiking neural networks would require high performance computation capabilities. General purpose graphics processing units (GPGPUs) have attracted significant attention recently for their low cost and high performance capabilities on scientific applications. These architectures are well suited for applications with high levels of data parallelism, such as biological scale cortical vision models. In this thesis we examine the performance of a spiking neural network based character recognition model on three NVIDIA GPGPU platforms and one NVIDIA GPGPU cluster: the GeForce 9800 GX2, the Tesla C1060, the Tesla S1070 platforms, and the 32 nodes Tesla S1070 GPGPU cluster. The recognition phase of two versions of this pattern recognition model, based on the Izhikevich and Hodgkin-Huxley spiking neuron models, are examined. The model utilizes pulse coding to mimic the high speed recognition taking place in the mammalian brain [5] and spike timing dependent plasticity (STDP) for training [6]. It was trained to recognize a set of pixel images of characters scaled up to pixels and was able to recognize noisy versions of these images. The largest networks tested had over 84 million neurons and 4 billion synapses implemented. An initial version of this model was presented in [7]. A large portion of the results presented in this thesis have been published in the April 2010 issue of Applied Optics [1]. The GPGPUs examined in this study utilize the Compute Unified Device Architecture (CUDA) from NVIDIA. This architecture allows code compatibility 2

14 between different generations of NVIDIA GPGPUs. All the GPGPU platforms utilized the same CUDA code. For each GPGPU, we also examined the performance of the general purpose processor on that platform using a highly optimized C version of the code. This C code utilized both data and thread parallelism (through explicit parallel coding in C), to ensure that all cores and the vector parallelism within the cores were utilized. The processors utilized were a 2.59 GHz dual core AMD Opteron processor, a 2.67 GHz quad core Intel Xeon processor, and a 2.4 GHz dual core AMD Opteron processor. The fastest GPGPU utilized, the Tesla S1070, provided speedups of 5.6 and 84.4 for the Izhikevich and Hodgkin Huxley models respectively, when compared to highly optimized code implementations on the fastest CPU tested, the quad core 2.67 GHz Xeon processor. Based on a thorough review of the literature, this is the first study examining the application of GPGPUs for the acceleration of the more biologically accurate Izhikevich and Hodgkin-Huxley models applied to image recognition. Our results indicate that these models are well suited to GPGPU platforms. Chapter II of this thesis discusses background material including a brief introduction to the two spiking network models considered for this study and a discussion of related work. Chapter III presents the character recognition model and examines its mapping to the GPGPU platforms. Chapters IV and V describe the experimental setup and results of the model performance, while chapter VI concludes the thesis. 3

15 CHAPTER II Background 2.1 Hodgkin-Huxley Model The Hodgkin-Huxley model [8] is considered to be one of the most biologically accurate spiking neuron models. It consists of four differential equations (eq. 1-4) and a large number of parameters. The differential equations describe the neuron membrane potential, activation of Na and K currents, and inactivation of Na currents. The model can exhibit almost all types of neuronal behavior if its parameters are properly tuned. This model is very important to the study of neuronal behavior and dynamics as its parameters are biophysically meaningful and measurable. A time step of 0.01 ms was utilized to update the four differential equations as this is the most commonly used value. Fig. 2.1 shows the spikes produced with this model in our study (using the parameters in Appendix A). dv dt 1 C 4 3 ( ){ I gkn ( V EK ) gnam h( V ENa ) gl( V EL)} (1) dn ( n ( V ) n) / n( V ) (2) dt dm ( m ( V ) m ) / m( V ) dt (3) dh ( h ( V ) h) / h( V ) (4) dt 4

16 Membrane Voltage (mv) Time (ms) Fig 2.1. Spikes produced with the Hodgkin-Huxley model. 2.2 Izhikevich Model Izhikevich proposed a new spiking neuron model in 2003 [9] that is based on only two differential equations (eq. 5-6) and four parameters. In a study comparing various spiking neuron models [2], Izhikevich found that the Izhikevich model required 13 flops to simulate one millisecond of a neuron s activity, while the Hodgkin Huxley model required 1200 flops. The former was still able to reproduce almost all types of neural responses seen in biological experiments, in spite of the low flop count. In our study, a time step of 1ms was utilized (as was done by Izhikevich in [9]). Fig. 2.2 shows the spikes produced with this model. dv dt V 5V 140 u I (5) du a( bv u) dt (6) V c if V 30 mv, then u u d 5

17 Membrane Voltage (mv) Time (ms) Fig 2.2. Spikes produced with the Izhikevich model. 2.3 GPGPU Architecture Graphics processing units (GPUs) were originally designed to alleviate the CPU of computational tasks related to drawing onto the external display by utilizing a number of graphics primitive operations. It was dedicated to calculate floating point operations, and for many years, GPUs were only utilized for graphics pipeline accelerations. The concept of General Purpose GPU (GPGPU) was not feasible until recent years since GPUs did not have high enough computational capability in the past decades. In recent years, strong market demands for real-time, high-definition 3D graphics have helped GPUs to evolve into very powerful many-core processors with highly parallel, multithreaded architectures utilizing very high bandwidth memories [10]. Using the massive floating-point computational power of a modern graphics accelerator s shader pipeline, these powerful GPUs can now be utilized for general purpose computations (as opposed to graphics operations). GPGPUs provide a new low cost but high performance platform for many scientific studies and applications such as the physic based simulation, CT reconstruction, and GPGPU based pattern recognition. 6

18 Bandwidth GB/s Peak GFLOPs NVIDIA GPU GT Intel CPU G80 G80 Ultra G G NV30 NV35 NV40 G GHz Core2 Duo 3.2 GHz Harpertown Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 GT200=GeForce GTX280 G71=GeForce7900GTX NV35=GeForce FX 5950Untra G92=GeForce 9800GTX G70=GeForce 7800 GTX NV30=GeForce FX 5800 G80=GeForce 8800GTX NV40=GeForce 6800Ultra Fig 2.3. Floating point operations per second for Intel CPUs and NVIDIA GPUs. Data from [10] NVIDIA GPU Intel CPU G80 G80 Ultra G71 40 NV40 20 NV30 Prescott EE Woodcrest Harpertown Northwood Fig 2.4. Memory bandwidth for Intel CPUs and NVIDIA GPUs. Data from [10] 7

19 Fig 2.3 shows the discrepancy in floating point capability between different generations of Intel CPUs and NVIDIA GPUs. As shown in, there is an increasing discrepancy in the floating point performance between CPUs and GPGPUs. This is mainly because GPGPUs are specifically designed for compute-intensive and highly parallel applications. Unlike conventional CPUs, GPGPUs devoted most of their transistors for data processing as opposed to data caching and flow control. Modern mainstream processors, such as multi-core CPUs and many-core GPGPUs, are parallel systems that continue to scale with Moore s law [10]. The main challenge for these systems is the development of application software that scales its parallelism transparently with the increasing number of processor cores. The Compute Unified Device Architecture (CUDA) is a general purpose parallel computing architecture that was first released by NVIDIA in the late This has a software environment that allows developers to use C (with NVIDIA extensions) as the high level programming language to code algorithms for execution on their GPGPUs. According to NVIDIA documents [10], CUDA is parallel programming language designed to make parallel programming easy and with a low learning curve. The programs developed using this language are supposed to be portable across multiple NVIDIA GPGPU platforms. CUDA essentially hides the underlying hardware architecture of the GPU from the programmer. To do this CUDA utilizes a structure that has three basic features: hierarchy of threads, shared memories, and barrier synchronization [10]. These features are all accessible to the programmer via a set of language extensions. As a result, the programmer can easily partition a problem into coarse sub-problems that can be solved independently in parallel and then these sub-problems are split into finer pieces to be 8

20 solved cooperatively in parallel. Such a decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time, enables transparent scalability since each sub-problem can be scheduled to be solved on any of the available processor cores. A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count. The CUDA processing flow is shown in Fig Main Memory CPU GPU Memory 3 1. Copy processing data 2. Instruct the processing 3. Execute parallel in each core 4. Copy the result NVIDIA GPGPU Fig 2.5. Processing flow on CUDA. Based on figure in [11]. 9

21 NVIDIA s CUDA GPGPU architecture consists of a set of scalar processors (s) operating in parallel. Eight scalar processors are organized into a streaming multiprocessor (SM), with several SMs on a GPGPU (see Fig. 2.6). Each SM has an internal shared memory along with caches (a constant and a texture cache), a multithreaded instruction (MTI) unit, and special functional units. The SMs share a global memory. A large number of threads are typically assigned to each SM. The MTI can switch threads frequently to hide longer global memory accesses from any of the threads. Both integer and floating point data formats are supported in the s. The NVIDIA GeForce 9800 GX2 GPGPU card utilized in this study has two 1.5 GHz G92 graphics processing chips, with each G92 GPGPU containing 128 CUDA scalar processors [12]. The two G92 GPGPUs are placed on two separate printed circuit boards with their own power circuitry and memory while still only requiring a single PCI-Express 16x slot on the motherboard (a dual-pcb configuration). Each GPGPU has access to a 256-bit DDR3 memory interfaces with 512MB of memory. This amounts to 1GB of memory for the full card with a peak theoretical bandwidth of 128 GB/s (64 GB/s per GPGPU) [12]. The sustained performance of the NVIDIA 9800 GX2 is approximately 1152 GFLOPs (576 GFLOPS per GPGPU) [12]. 10

22 SM MTI Cache SM MTI Cache SM MTI Cache SM MTI Cache SFU SFU SFU SFU SFU SFU SFU SFU Shared memory Shared memory Shared memory Shared memory Global Memory Fig 2.6. A simplified CUDA GPGPU architecture. Here =Scalar Processor, SM=Streaming Multiprocessor, SFU=Special Functional Unit, and MTI=Multi-Threaded Instruction Unit. The Tesla C1060 GPGPU has a single graphic processing chip (based on the NVIDIA Tesla T10 GPGPU) with 240 processing cores working at 1.3 GHz [13]. The Tesla C1060 is capable of 933 GFLOPs of processing performance and comes standard with 4 GB of 512-bit DDR3 memory at 102 GB/s bandwidth [13]. The Tesla S1070 GPGPU has four 1.3GHz T10 GPGPU cores, each with access to 4GB of DDR3 memory (totaling 16 GB) [14]. The card is capable of 3.73 GFLOPs and has a combined memory bandwidth of 408 GB/s [14]. 2.4 Related work Several groups have studied image recognition using spiking neural networks. In general, these studies utilize the integrate and fire model. Thorpe developed SpikeNET [5], a large spiking neural network simulation software. The system can be used for several image recognition applications including identification of faces, fingerprints, and video images. Johansson and Lansner developed a large cluster based spiking network simulator of a rodent sized cortex [15]. They tested a small scale version of the system to 11

23 identify pixel images. Baig [16] developed a temporal spiking network model based on integrate and fire neurons and applied them to identify online cursive handwritten characters. Gupta and Long [17] investigated the application of spiking networks for the recognition of simple characters. Other applications of spiking networks include instructing robots in navigation and grasping tasks [18], recognizing temporal sequences [19][20], and the robotic modeling of mouse whiskers [21]. At present several groups are developing biological scale implementations of spiking networks, but are generally not examining the applications of these systems (primarily as they are modeling large scale neuronal dynamics seen in the brain). The Swiss institution EPFL and IBM are developing a highly biologically accurate brain simulation [4] at the subneuron level. They have utilized the Hodgkin-Huxley and the Wilfred Rall [22] models to simulate up to 100,000 neurons on an IBM BlueGene/L supercomputer. At the IBM Almaden Research Center, Ananthanarayanan and Modha [3] utilized the Izhikevich spiking neuron model to simulate 55 million randomly connected neurons (equivalent to a rat-scale cortical model) on a 32,768 processor IBM BlueGene/L supercomputer. Johansson et. al. simulated a randomly connected model of 22 million neurons and 11 billion synapses using an 8,192 processor IBM BlueGene/L supercomputer [23]. A few studies have recently examined the performance of spiking neural network models on GPGPUs. Tiesel and Maida [24] evaluated a grid of 400x400 integrate and fire neurons on an NVIDIA 9800 GX2 GPGPU. In their single layer network, each neuron was connected to its four closest neighbors. This network was not trained to recognize any image. They compared a serial implementation on an Intel Core 2 Quad 2.4 GHz 12

24 processor to a parallel implementation on an NVIDIA 9800 GX2 GPGPU and achieved a speedup of 14.8 times. Nageswarana et al. [25] examined the implementation of 100,000 randomly connected neurons based on the more biologically accurate Izhikevich model. They compared a serial implementation of the model on a 2.13 GHz Intel Core 2 Duo against a parallel implementation on an NVIDIA GTX280 GPGPU and achieved a speedup of about 24 times. Also, Nageswarana et al [26] developed the acceleration of spike based convolutions using GPGPUs. A kernel speedup of 35 times is achieved by a single NVIDIA GTX 280 GPGPU over a CPU only implementation. Fidjeland et al [27] presented a GPGPU implementation (on a platform named NeMo) of the Izhikevich model and achieved 400 million spikes per second throughput performance. Nere and Lipasti [28] developed a GPGPU implementation of their own neocortex inspired pattern recognition model. A speedup of 373 was achieved over their unoptimized C++ serial implementation. In contrast to these implementations, we implement two spiking neural network models (Izhikevich and Hodgkin-Huxley) capable of image recognition on three different GPGPU platforms as well as a GPGPU cluster and compare the performances of the three GPGPUs against their corresponding CPUs. Both data and thread level parallelism was utilized for parallel implementations on those CPUs to ensure a fare comparison. 13

25 CHAPTER III Network Design A two layer network structure was utilized in this study with the first layer acting as input neurons and the second layer as output neurons (see Fig. 3.1). Input images were presented to the first layer of neurons, with each image pixel corresponding to a separate input neuron. Thus the number of neurons in the first layer was equal to the number of pixels in the input image. Binary inputs were utilized in this study. If a pixel was on, a constant current was supplied to the input, while no current was supplied if a pixel was off. The number of output neurons was equal to the number of training images. Each input neuron was connected to all the output neurons. Level 2 Level 1 Fig 3.1. Network used for testing spiking models. In the training process, images from the training set were presented sequentially to the input neurons. The weight matrix of each output neuron was updated using the STDP rule each cycle. In the testing phase, an input image is presented to the input neurons and after a certain number of cycles, one output neuron fires, thus identifying the 14

26 input image. During each cycle, the level 1 neurons are first evaluated based on the input image and a firing vector is updated to indicate which of the level 1 neurons fired that cycle. In the same cycle, the firing vector generated in the previous cycle is used to calculate the input current to each level 2 neuron. The level 2 neuron membrane potentials are then calculated based on their input current. A level 2 neuron s overall input current is the sum of all the individual currents received from the level 1 neurons connected to it. This input current for a level 2 neuron is given by eq. 7 below: where I j w(i, j ) f ( i ) (7) i w is a weight matrix in which w(i,j) is the input weight from level 1 neuron i to level 2 neuron j. f is a firing vector in which f(i) is 0 if level 1 neuron i does not fire, and is 1 is the neuron does fire. Two versions of the model were developed where the main difference was the equations utilized to update the potential of the neurons. In one case, the Hodgkin-Huxley equations presented in section 2.1 were used, while in the second case, the Izhikevich equations in section 2.2 were used. The parameters utilized in each case are specified in Appendix I. 15

27 CHAPTER IV GPGPU Mapping In order to achieve high performance from the GPGPU, all the processing cores need to be utilized at the same time. Each streaming multiprocessors can execute a group of threads concurrently (call a warp) on its scalar processors and can easily switch between warps [29]. To hide long memory latencies, it is recommended that hundreds to thousands of threads be assign to the GPGPU. In case one warp is waiting on a memory access, the streaming multiprocessor can easily switch to other warps to hide this latency. Fig. 4.1 shows a flowchart of the main components of the algorithm. Three kernels are executed on the GPGPU to update the level 1 and level 2 neurons and their corresponding synaptic currents (level 2 neurons input weights). To achieve this high parallelism, two mapping strategy implemented is described as following. Start Read new Image (CPU) Update level 1 neurons (kernel 1) Update synaptic currents for level 2 neurons (kernel 2) Update level 2 neurons (kernel 3) No Any level 2 neurons fired? (CPU) Yes Fig 4.1. Algorithm flowchart. 16

28 4.1 Neuronal Mapping The networks utilized in this study have high amounts of data parallelism. For example, the network contains up to 589,824 neurons from level 1, 48 neurons from level 2 and over 28 million synapses. All the level 1 neuron evaluations are carried out by kernel 1 (shown in Fig. 4.1). Each neuron is evaluated by a separate thread on the GPGPU, thus resulting in the total number of threads available being equal to the total number of neurons in this layer. This amounts to a very large number of threads (up to 589,824) available for execution. Before being mapped to GPGPU streaming multiprocessors, threads are arranged into two level hierarchies: grid level and block level. All the threads of the same kernel [shown as in Fig 4] belong to the same grid. Multiple kernels could be called from the host. Each grid can contain up to thread blocks, while each thread block can contain up to 512 threads. To achieve high performance, no less than 128 threads need to be assigned to each thread block. Each thread block is distributed to different streaming multiprocessor. All the threads of the same thread block are consistently executed on the same streaming multiprocessor. Threads of a thread block are split into multiple warps, with 32 threads in a warp. One streaming multiprocessor can run 32 threads (1 warp) on its eight scalar processors concurrently. Streaming multiprocessors can efficiently switch among warps to hide long memory accesses. Unlike vector machines, which typically utilize SIMD (single instruction multiple data), operations in the CUDA architecture are carried out using SIMT (single instruction multiple thread). Each CUDA thread has its own program counter and register state, and thus can run independently of the other. In a SIMD processor, all elements of a vector are 17

29 operated on in lock step. To achieve high performance, it is necessary to make all threads in a warp have the same execution path. Therefore, branch or divergence in a warp should be avoided by all means to achieve high performances. Image sizes Table 4.1. Threads hierarchy and warps Izhikevich Hodgkin-Huxley Number of grids Number of threads Threads per block Number of thread Blocks Warp size Number of warps Synapses Mapping The summation of the neuron weights described in eq. 7 is carried out for one level 2 neuron at a time (kernel 2 in Fig. 4.1). Here neuronal mapping described in the previous section is no longer applicable to provide mass parallelism or achieve high performance, because only 48 level 2 neurons postsynaptic current need to be calculated and only 48 threads are provided by this method. In this case, much more working load is going to be assigned to much less number of threads and lead to very poor performance. So instead of mapping each neuron to a thread, each synaptic weight is then mapped to one thread. Since there are totally over 28 million synapses mass parallelism is achieved again by this mapping method. For a particular level 2 neuron, j, all the elements in the masked weight vector w(i,j)f(i) have to be added to each other to generate a final input current. This is done by splitting w(i,j)f(i) into multiple sub-vectors, with the sum of the all the sub-vectors accumulated in GPGPU shared memory. When all the sub-vectors 18

30 have been added, parallel reduction [30] is used to generate an input current by adding all the elements of the accumulated vector. Three kernels are executed on the GPGPU to update the level 1 and level 2 neurons and their corresponding synaptic currents (level 2 neurons input weights). Only the check of whether a level 2 neuron fired (and thus categorized an input image) is carried out on the CPU as this is a serial operation. All the neuron parameters are implemented using single floating point variables on the GPGPU. 4.3 MPI Mapping The limited memory associated with each GPGPU core (at most 4GB in the architectures considered) sets a limit on the largest network size that can be evaluated in parallel. Thus to evaluate a large neural network, it would be necessary to split the network into multiple parts and evaluate each part separately. This will make the evaluation slow. A cluster of GPGPUs offers a faster alternative where multiple components of a large network can be evaluated in parallel on separate GPGPU cores. The Message Passing Interface, or simply MPI, provides a good way for the different GPGPU cores to communicate. By using MPI networks of sizes , , and above can be implemented and achieve very high performance. The architecture of GPGPU cluster mapping is shown in Fig 4.2. In order to achieve good performance in the GPGPU cluster, the GPGPU to CPU ratio should be equal to one. So each GPGPU core can be attached to a unique CPU and works as its private co-processor. 19

31 Master CPU0 Worker CPU1 Worker CPU2 Worker CPU3 Worker CPUN. GPU1 GPU2 GPU3 GPUN Fig 4.2. GPGPU cluster mapping. From the MPI point of view, one Tesla S1070 card is considered as four individual T10 GPGPUs with 4 GB of memory each. Here the phase one node does not mean a physical computer node (see section 4.4) but a virtual node consisting of a CPU core with DRAM and a T10 GPGPU core as well as its 4 GB of on-board memory. In each node both the update of the subset of level 1 neurons and the corresponding weight summation are carried out in the same manner as the non-mpi version GPGPU implementation described in earlier sections. The 48 level 2 neurons are carried out by the master CPU instead. To make full use of each node, larger images are segmented into multiple subimages with the size equal to the largest size feasible in a single GPGPU core (described in detail in section 5.3). These sub-images are processed individually and concurrently using multiple GPGPU cores. In the N nodes GPGPU cluster based MPI implementation (as shown in Fig 4.2), the algorithm is mapped to the GPGPU cluster as following: 20

32 1. Each node reads its sub-image form the hard drive and transmits it to its GPGPU on-board memory. 2. The GPGPU core on each node calculates its sub-image until it finds level 1 neurons fired, after calculating the post-synaptic current for this sub image, a 48 length vector (post synaptic current of sub-image) is transmitted to the master CPU. 3. The master CPU then adds the 48 length vectors from the N nodes to get the total post synaptic current. Then the master CPU updates the level 2 neurons until it finds a level 2 neuron fired indicating the final recognition result. The number of nodes implemented in this thesis is from 4 to 9. The performance and timing breakdown of this MPI implementation is shown in the result chapter. 4.4 Experimental Setup The models were implemented on three GPGPU platforms. These are: 1) An Ubuntu Linux based PC with two 2.59 GHz dual core AMD Opteron 285 processors, 4 GB of DRAM, and an NVIDIA GeForce 9800 GX2 graphics card with two G92 GPGPUs. 2) A Red Hat Enterprise Linux based PC with one 2.67 GHz quad core Xeon 5550 processor, and 12 GB of DRAM. Two versions of this system were utilized. One had a single NVIDIA Tesla C1060 GPGPU. The other had two NVIDIA C1060 GPGPUs with both placed on the same motherboard, but on two different PCI-Express x slots. 3) A Fedora 10 Linux based PC with two 2.4 GHz dual core AMD Opteron 2216 processors, 8 GB of DRAM, and a NVIDIA Tesla S1070 GPGPU. 21

33 4) A 32-node Fedora 10 Linux based cluster with InfiniBand interconnection, each node is equipped with two 2.4 GHz dual core AMD Opteron 2216 processors, 8 GB DRAM and an NVIDIA Tesla S1070 GPGPU. A C implementation of the models was run on one of the AMD Opteron processors for systems 1 and 3 above, and on the Intel Xeon processor. The C code utilized the SSE3 SIMD extensions for data parallelism. It also utilized the POSIX threads library to run on all cores available (two on the AMD processors and four on the Xeon processor). The programs were compiled with -O3 optimizations using gcc. The GPGPU implementation of the model utilized CUDA 2.3 nvcc. Four networks with varying input image sizes were developed to examine the two spiking neural network models. The overall network structure was kept similar to the design shown in Fig. 3.1, with two layers of nodes per network. Each of the level 2 neurons was connected to all of the neurons from level 1. Table 4.2 shows the number of neurons per layer for all the network sizes implemented. In this study all the spiking networks of different sizes were trained to recognize the same number of images. A set of pixel images were generated initially and were then scaled linearly to the different network input sizes. Fig. 4.4 shows the training images used. The images represented the 26 upper case letters (A-Z), 10 numerals (0-9), 8 Greek letters, and 4 symbols. Table 4.2. Spiking network configurations evaluated. Input image size Total neurons 147, ,872 2,359,344 9,437,232 Level 2 neurons Level 1 neurons 147, ,824 2,359,296 9,437,184 22

34 4.5 GPGPU cluster configurations As shown in Fig 4.2, the three basic components of the GPGPU cluster utilized in this study are the host nodes, the GPGPUs and the interconnect between the host nodes. The GPGPU cluster being used in this study is the Accelerator Cluster (AC) [31] at National Center for Supercomputing Applications (NCSA) [32]. The structure of the AC cluster is shown in Fig 4.3. It has 32 nodes connected by InfiniBand. Each node is an HP wx9400 workstation equipped with two 2.4 GHz AMD Opteron dual-core 2216 processors with 1MB of L2 cache and 1GHz Hyper Transport link and 8GB (4x2GB) of DDR2-667 memory. Each node has two PCIe Gen1 x16 slots and one PCIe x8 slot. The two x16 slots are used to connect to a single Tesla S1070 Computing System (4 GPGPUs) and the x8 slot is used to connect to an InfiniBand QDR adapter. Each node of the cluster runs on Fedora 10 Linux. Detailed configurations and architecture are presented in Table 4.3 and Fig 4.3. This cluster is well balanced because: 1. Nodes interconnection: InfiniBand QDR interconnect is used to match the GPGPU/Host bandwidth. In contrast to slower Ethernet inter-connections (1-2GB/s) with network card, the InfiniBand QDR inter-connection (4GB/s) connects different nodes via PCIe slots and QDR adapters [31]. 2. CPU/GPGPU ratio: This is equal to one in order to simplify the MPI based implementations [31]. 3. Host memory: This is larger than the on-board memory of a GPGPU core in order to enable full utilization of the GPGPU core on that node [31]. 23

35 QDR IB InfiniBand HP xw9400 workstation Node 1 Node 2 Node N Dual core CPU Dual core CPU PCIe Bus PCIe Bus PCIe x 16 PCIe x 16 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070 Node 0 Fig 4.3. Architecture of the GPGPU cluster (Although all 32 nodes have the same configuration, only node 0 is shown in detail) (based on figure in [31]). Table 4.3. GPGPU cluster configuration (based on table in [31]). CPU Host HP xw9400 CPU Dual-core 2216 AMD Opteron CPU frequency 2.4 CPU cores per node 4 CPU memory host (GB) 8 GPU host Tesla S1070 GPU frequency (GHz) 1.3 GPU chips per node 4 GPU memory per host (GB) 16 CPU/GPU ratio 1 interconnect IB QDR # of cluster nodes 32 # of CPU cores 128 # of GPU chips 128 # of racks 5 Total power rating (kw) <45 24

36 Fig 4.4. Training images. 25

37 CHAPTER V Results Both the CPU and the GPGPU versions of the models were able to recognize all the training images (Fig. 4.4). They were also able to recognize nine out of ten noisy versions of these images accurately (Fig. 5.1), with image #6 not being recognized. (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Fig 5.1. Additional test images. 5.1 Single GPGPU card performances The non-mpi neurons per second throughput of the GPGPUs and the CPUs are shown in Fig. 5.2 for the Izhikevich and Hodgkin Huxley models. Note that the GeForce 9800 GX2 card was unable to implement the larger image sizes due to limited on-board memory. The performance of both the AMD processors were almost the same, and so only one of them (the 2.4GHz version) was plotted in Fig The performances of both Tesla C1060 based platforms, with one and two C1060s, are shown (the two C1060 platform is listed as C1060 2). The results show that the performance of the GPGPUs is 26

38 higher than that of the general purpose processors examined. The fastest GPGPU and CPU utilized in this study were the Tesla S1070 and the quad core 2.67 GHz Xeon processor respectively. Comparing the performances of these two systems in Fig. 5.2 shows that the fastest GPGPU utilized (S1070) provided speedups of 5.6 and 84.4 over the fastest CPU (Xeon) for the Izhikevich and Hodgkin Huxley models respectively (for the largest image size). The speedups of each GPGPU over the CPU on its own platform are shown in Tables 5.1 and 5.2. Given that the C1060 is paired with the fastest processor (with an Intel Xeon processor with four cores), the speedup for this GPGPU was the lowest. The Izhikevich model provided a lower speedup as it is less computationally demanding (further discussion below). Table 5.1. Speedups achieved on four GPGPU platforms for Izhikevich model for different image sizes S GX C1060 ( 1) C1060 ( 2) Table 5.2. Speedups achieved on four GPGPU platforms for Hodgkin Huxley model for different image sizes S GX C1060 ( 1) C1060 ( 2) For both Izhikevich and Hodgkin-Huxley models, image sizes of , , and pixels were examined. In the Hodgkin-Huxley model each neuron requires many more parameters, thus increasing the overall memory 27

39 Neurons/s Neurons/s footprint of the model. As a result the and images could not be implemented for the Hodgkin-Huxley model due to limited memory on the GeForce 9800 GX2 GPGPU. 3.5E E E E E E E+07 S1070 C C GX2 Xeon AMD 0.0E E E+06 Neurons 1.0E+07 (a) 2.5E E E E E+05 S1070 C C GX2 Xeon AMD 0.0E E E E+07 Neurons (b) Fig 5.2. Neurons/second throughput of GPGPU and CPU processors for the (a) the Izhikevich model and (b) the Hodgkin Huxley model. 28

40 The runtime breakdown of the GPGPU versions of models for the 768x768 image size are shown in Fig Both level 1 and level 2 neuron computations (kernels 1 and 3 in Fig. 4.1), along with the level 2 neuron input weight computations (kernel 2), take place on the GPGPU. The rest of the runtime is for: 1) main memory access, including data transfer between the host (CPU) memory and the device (GPGPU) memory, and 2) CPU operations, including initialization and checking all 48 level-two neurons voltages to find out which one fired (to determine the final recognition result). Given that the number of level 2 neurons does not vary with the input image size, the computation time for the level 2 neurons remains constant for a given model. The runtime for the remaining components do vary with the number of level 1 neurons, and hence the input image size. It is seen in Fig. 5.3 that the Izhikevich model has a higher percentage of time spent on CPU and memory access, compared to the Hodgkin-Huxley model. This leads to a higher overhead ratio in the Izhikevich model, and thus a lower speedup. The bytes of data required per flop for the models are listed in Table 5.3. It also lists the peak bytes per flop capability of the three GPGPUs examined (calculated from specification in [12] [13] and [14]). Table 5.4 also listed flops actually achieved by different GPGPU cards for network configuration The results show that the memory bandwidth requirement for the Izhikevich model is about 19 times the peak capability of the Tesla GPGPUs, while bandwidth requirement for the Hodgkin-Huxley model is about 3.5 times higher than capability of these cards. This also leads to the speedup of the Izhikevich model to be lower than the Hodgkin-Huxley model. 29

41 Table 5.3. Bytes/Flop requirement of the models and capabilities of the GPGPU cards. Model/Platform Bytes/Flop Model: Izhikevich 2.11 Model: Hodgkin Huxley 0.38 Card: S Card: C Card: 9800 GX Table 5.4. Bytes/Flop achieved by different GPGPU cards (in GFLOPs). System Hodgkin- GPGPUs Izhikevich Peak Huxley 9800GX C C S As shown in Table 5.4, the performance achieved by the double C1060 GPGPU configuration is about twice as high as the performance of a single C1060 GPGPU. Also, the performance of the Tesla S1070 GPGPU is about 4 times as high as a single C1060 GPGPU. This is because the Tesla C1060 has one Tesla T10 GPGPU inside with 4GB of memory, while the Tesla S1070 has four such T10 GPGPUs, each with its own 4 GB memory bank. Thus the performance of the S1070 is about 4 times that of the C1060. The performance increases linearly with the number of cores being used. Even though the C1060 card is a more advanced GPGPU compared to the 9800 GX2, the performance of a single C1060 card is no better than a 9800 GX2 card. This is because the 9800 GX2 has a dual-pcb configuration (as described in section 2.3) which includes two G92 cards. The total number of CUDA cores on a 9800 GX2 is 256 (128 per G92 GPGPU). This is more than the total of 240 CUDA cores in a C1060 card. Hence the 9800 GX2 card has a peak performance of 1152 GFLOPs while the C1060 has a peak 30

42 performance of 933 GFLOPs. What is more, the 9800 GX2 GPGPU has a smaller (1GB) but faster memory performance (128BG/s) compared with the lager (4GB) but slower (102 GB/s) memory of the C1060 GPGPU. So both more CUDA cores and a faster memory performance help the 9800 GX2 card to have a better performance than a single C1060 card. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 9800GX2 C1060 C S1070 CPU/memory L2 weight (GPU) L2 (GPU) L1 (GPU) (a) 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% 9800GX2 C1060 C S1070 CPU/memory L2 weight (GPU) L2 (GPU) L1 (GPU) (b) Fig 5.3. Timing breakdown by percentage of overall runtime for processing the 768x768 image using (a) Izhikevich model and (b) Hodgkin Huxley model. 31

43 5.2 MPI performances Since the performance of a single GPGPU is limited by both the capability of parallelization and the on-board memory space, MPI is utilized here for the high performance implementation of larger networks such as and Both the memory space and the total amount of parallelization increase linearly with the number of nodes devoted in this parallel computation. To make full use of each node, larger images are segmented into multiple sub-images, with the individual sub-images processed concurrently using multiple GPGPU cores. The size of each sub-image is equal to the largest size feasible in a single GPGPU core (described in detail in section 5.3). As shown in Table 5.5 and Fig 5.4, a image is broken into four segments for both the Izhikevich and Hodgkin-Huxley models. Similarly, image size is broken into nine sub-images and processed by nine GPGPU cores. The size of each sub image processed in each GPGPU core is consistently Therefore the speedup for larger images is slightly less than the maximum seen for each of the models in the previous results because of task synchronizations and memory bandwidth limitation between host (CPU) memory and device (GPGPU) memory. Table 5.5. GPGPU cluster run time (ms). Image sizes Nodes Izhikevich Hodgkin- Huxley

44 Neurons/s 1.00E E E E E E E E E E Number of T10 GPGPU cores Izhikevich Hodgkin-Huxley Fig 5.4. Neurons/second throughput of GPGPU cluster. The MPI timing breakdown is shown in Table 5.6. The MPI communication time is a very small fraction of the whole MPI run time. It has almost no influence on the system performance. But still it changes with the number of nodes used. The more nodes devoted to parallel processing, the longer time it takes to communicate via the InfiniBand connections between those worker nodes and the master node. The communication time versus the number of nodes are plotted in Fig 5.5. Table 5.6 MPI timing breakdown (ms). Image sizes Number of nodes Izhikevich Communication time Level 2 neuron update Level 1 neurons update Total run time Hodgkin-Huxley Communication time Level 2 neuron update Level 1 neurons update Total run time

45 MPI Communication Time (ms) communication time Number of T10 GPGPU cores Fig 5.5. MPI communication time. 5.3 Memory analysis For the two models (Izhikevich and Hodgkin-Huxley) implemented in this study, a network with N i level 1 neurons and N e level 2 neurons will have N i N e synapses. Since all neuronal parameters were represented with 32-bit single precision floating point numbers, each neuron parameter requires 4 bytes memory space. As a result, each neuron of the Izhikevich model (with 4 parameters) requires 16 bytes, and each neuron of the Hodgkin Huxley model (with 25 parameters) requires 100 bytes. The memory space required by different network elements for varying network configurations is shown in Table 5.7. Table 5.7. Memory required by different network component (Bytes). SNN components Memory required by the two models Izhikevich Hodgkin-Huxley N e Level 1 neurons (4 N e ) 4 (25 Ne) 4 Ni Level 2 neurons (4 Ni) 4 (25 Ni) 4 Synaptic weight (Ne Ni+2 Ni+Ne) 4 (Ne Ni+Ni 2+Ne) 4 Total memory (Ne Ni+6 Ni+5 Ne) 4 (Ne Ni+27 Ni+26 Ne) 4 34

46 Required GPU Memory (GB) GPU cluster Memory NV S1070 Memory NV C1060 Memory NV 9800 GX2 Memory Number of Neurons (X10 6 ) Izhikevich Hodgkin- Huxley Fig 5.6. Memory requirement for different configurations. Based on table 5.7, the total amount of memory space required by the different network configurations are plotted in Fig 5.6. As shown in Fig 5.6, the amount of memory space required increases almost linearly with the increasing number of neurons. The number of neurons is shown in the x-axis ranging from 147k to 84,935k. The amount of memory required in GB is shown in y-axis and ranges from 28MB to 23.4GB. The NVIDIA GeForce 9800 GX2 card includes two G92 GPGPUs, each with a dedicated global memory of 512 MB. So for the Izhikevich model, image sizes up to could be implemented. Each T10 core in both the C1060 and S1070 platforms includes 4GB global memory, so configurations up to can be implemented on each T10 GPGPU core. For the NVIDIA Tesla GPGPU cluster, 1 to 9 GPGPU cores are being used in the MPI implementations with image sizes up to implemented. 35

47 Even larger images are implemented using MPI by segmenting them into multiple sub-images. This is because a single T10 GPGPU core can accommodate at most image sizes of up to By using multiple T10 GPGPU cores, with each core processing one sub-image, very large image can be reconstructed and processed. In this thesis, images of the sizes and are processed with four and nine nodes respectively. The size of sub-images processed in each GPGPU core is consistently pixels. 5.4 Firing rate analysis In the CPU implementations of the models, the weight summations in eq. 7 can be carried out selectively by considering only the level 1 neurons that fired (ie. for those that f(i) in eq. 7 is 1). In the current version of the CUDA language constructs supported in the GPGPU utilized, we have not seen any method to easily determine whether any of the level 1 neurons evaluated in parallel (over 100) have fired (this can be done through a slow serial procedure). As a result, we evaluate eq. 7 for all level 1 neurons. Thus, a variation in the level 1 neuron firing rate would affect the CPU runtime, but not the GPGPU runtime. Fig. 5.7 examines the speedup seen with variations in the level 1 firing rate for the Izhikevich and Hodgkin-Huxley models respectively on the GeForce 9800 GX2 platform over the AMD processor paired with it. In the Izhikevich model, there is a significant change in the speedup, with the highest speedup seen when all the level 1 neurons fired (providing a speedup of about 20 times). In this case the CPU run time is the highest, while the GPGPU runtime remains the same. In the Hodgkin-Huxley model, the weight computation time is a tiny fraction of the overall runtime (as seen in Fig. 5.3), 36

Acceleration of Spiking Neural Networks in Emerging Multi-core and GPU Architectures

Acceleration of Spiking Neural Networks in Emerging Multi-core and GPU Architectures Acceleration of Spiking Neural Networks in Emerging Multi-core and GPU Architectures Mohammad A. Bhuiyan, Vivek K. Pallipuram and Melissa C. Smith Department of Electrical and Computer Engineering, Clemson

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

IP Video Rendering Basics

IP Video Rendering Basics CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

The Central Processing Unit:

The Central Processing Unit: The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

GPGPU Computing. Yong Cao

GPGPU Computing. Yong Cao GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power

More information

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

Real-time processing the basis for PC Control

Real-time processing the basis for PC Control Beckhoff real-time kernels for DOS, Windows, Embedded OS and multi-core CPUs Real-time processing the basis for PC Control Beckhoff employs Microsoft operating systems for its PCbased control technology.

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

An examination of the dual-core capability of the new HP xw4300 Workstation

An examination of the dual-core capability of the new HP xw4300 Workstation An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,

More information

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers WHITE PAPER Comparing the performance of the Landmark Nexus reservoir simulator on HP servers Landmark Software & Services SOFTWARE AND ASSET SOLUTIONS Comparing the performance of the Landmark Nexus

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD

More information

Improved LS-DYNA Performance on Sun Servers

Improved LS-DYNA Performance on Sun Servers 8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

HP Workstations graphics card options

HP Workstations graphics card options Family data sheet HP Workstations graphics card options Quick reference guide Leading-edge professional graphics February 2013 A full range of graphics cards to meet your performance needs compare features

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

Overview of HPC Resources at Vanderbilt

Overview of HPC Resources at Vanderbilt Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources

More information

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX Multiple Choice: 1. Processing information involves: A. accepting information from the outside world. B. communication with another computer. C. performing arithmetic

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Intro, GPU Computing February 9, 2012 Dan Negrut, 2012 ME964 UW-Madison "The Internet is a great way to get on the net. US Senator Bob Dole

More information

Discovering Computers 2011. Living in a Digital World

Discovering Computers 2011. Living in a Digital World Discovering Computers 2011 Living in a Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook computers, and mobile devices Identify chips,

More information

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad

More information

Lecture 1. Course Introduction

Lecture 1. Course Introduction Lecture 1 Course Introduction Welcome to CSE 262! Your instructor is Scott B. Baden Office hours (week 1) Tues/Thurs 3.30 to 4.30 Room 3244 EBU3B 2010 Scott B. Baden / CSE 262 /Spring 2011 2 Content Our

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Communicating with devices

Communicating with devices Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information