ACCELERATION OF SPIKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS

Transcription

1 ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS Thesis Submitted to The School of Engineering of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree Master of Science in Electrical Engineering by Bing Han UNIVERSITY OF DAYTON Dayton, Ohio May 2010

2 ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS APPROVED BY: Tarek M. Taha, Ph.D. Advisory Committee Chairman Department of Electrical and Computer Engineering John Loomis, Ph.D. Committee Member Department of Electrical and Computer Engineering Balster, Eric Ph.D. Committee Member Department of Electrical and Computer Engineering Malcolm W. Daniels, Ph.D. Associate Dean School of Engineering Tony E. Saliba, Ph.D. Dean, School of Engineering ii

4 ABSTRACT ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS Name: Han, Bing University of Dayton Advisor: Dr. Tarek M. Taha There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and have generally utilized more accurate neuron models, such as the Izhikevich and Hodgkin-Huxley models, in favor of the more popular integrate and fire model. This thesis examines the feasibility of using GPGPUs for accelerating a spiking neural network based character recognition network to enable large scale neural systems. Two versions of the network utilizing the Izhikevich and Hodgkin- Huxley models are implemented. Three NVIDIA GPGPU platforms and one GPGPU cluster were examined. These include the GeForce 9800 GX2, the Tesla C1060, the Tesla S1070 platforms, and the 32-node Tesla S1070 GPGPU cluster. Our results show that the GPGPUs can provide significant speedups over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided speedups of 5.6 and 84.4 time over highly optimized implementations on the fastest CPU tested, a quad core 2.67 GHz Xeon processor, for the Izhikevich and Hodgkin Huxley models respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPGPUs are well suited for this application domain. iii

5 A large portion of the results presented in this thesis have been published in the April 2010 issue of Applied Optics [1]. iv

6 ACKNOWLEDGEMENTS My special thanks are in order to Dr. Tarek M. Taha, my advisor, for providing the time and equipment necessary for the work contained herein, and for directing this thesis and bringing it to its conclusion with patience and expertise. v

7 TABLE OF CONTENTS ABSTRACT... iii ACKNOWLEDGEMENTS... v LIST OF FIGURES... viii LIST OF TABLES... x CHAPTERS: I. Introduction... 1 II. Background Hodgkin-Huxley Model Izhikevich Model GPGPU Architecture Related work III. Network Design IV. GPGPU Mapping Neuronal Mapping Synapses Mapping MPI Mapping Experimental Setup GPGPU cluster configurations vi

8 V. Results Single GPGPU card performances MPI performances Memory analysis Firing rate analysis VI. Conclusion Bibliography Appendix vii

9 LIST OF FIGURES Figure Page 2.1 Spikes produced with the Hodgkin Huxley model Spikes produced with the Izhikevich model Floating point operations per second for the Intel CPUs and NVIDIA GPUs Memory bandwidth for the Intel CPUs and NVIDIA GPUs Processing flow on CUDA A simplified CUDA GPGPU architecture Network used for testing spiking models Algorithm flowchart GPGPU cluster mapping Architecture of the GPGPU cluster Training images Additional test images Neurons/second throughput of GPGPU and CPU processors for the (a) the Izhikevich model and (b) the Hodgkin Huxley model Timing breakdown by percentage of overall runtime for processing the 768x768 image using (a) Izhikevich model and (b) Hodgkin Huxley model Neurons/second throughput of GPGPU cluster MPI communication time Memory requirement for different configurations viii

10 5.7. Variation in speedup of GPGPU implementations on 9800 GX2 platform with firing rate in the level 1 neurons for the: (a) Izhikevich model, and (b) Hodgkin Huxley Model ix

11 LIST OF TABLES Table Page 4.1. Threads hierarchy and warps Spiking network configurations evaluated GPGPU cluster configuration Speedups achieved on four GPGPU platforms for Izhikevich model for different image sizes Speedups achieved on four GPGPU platforms for Hodgkin Huxley model for different image sizes Bytes/Flop requirement of the models and capabilities of the GPGPU cards Bytes/Flop achieved by different GPGPU cards GPGPU cluster run time MPI timing breakdown Memory required by different network component x

12 CHAPTER I Introduction At present there is a strong interest in the research community to model biological scale image recognition systems such as the visual cortex of a rat or a human. Most small scale neural network image recognition systems generally utilize the integrate and fire spiking neuron model. Izhikevich points out however that this model is one of the least biologically accurate models. He states that the integrate and fire model cannot exhibit even the most fundamental properties of cortical spiking neurons, and for this reason it should be avoided by all means [2]. Izhikevich compares a set of 11 spiking neuron models [2] in terms of their biological accuracy and computational load. He shows that the five most biologically accurate models are (in order of biological accuracy) the: 1) Hodgkin-Huxley, 2) Izhikevich, 3) Wilson, 4) Hindmarsh-Rose, and 5) Morris-Lecar models. Of these the Hodgkin-Huxley model is the most compute intensive, while the Izhikevich model is the most computationally efficient. As a result, current studies of large scale models of the cortex are generally using either the Izhikevich [3] or the Hodgkin-Huxley [4] spiking neuron models. One of the main problems with using the Hodgkin-Huxley model for large scale implementations is that it is computationally intensive. The recently published Izhikevich model is attractive 1

13 for large scale implementations as it is close to the Hodgkin-Huxley model in biological accuracy, but is similar to the integrate and fire model in computational intensity. Given that the visual cortex contains 10 9 neurons, biological scale vision systems based on spiking neural networks would require high performance computation capabilities. General purpose graphics processing units (GPGPUs) have attracted significant attention recently for their low cost and high performance capabilities on scientific applications. These architectures are well suited for applications with high levels of data parallelism, such as biological scale cortical vision models. In this thesis we examine the performance of a spiking neural network based character recognition model on three NVIDIA GPGPU platforms and one NVIDIA GPGPU cluster: the GeForce 9800 GX2, the Tesla C1060, the Tesla S1070 platforms, and the 32 nodes Tesla S1070 GPGPU cluster. The recognition phase of two versions of this pattern recognition model, based on the Izhikevich and Hodgkin-Huxley spiking neuron models, are examined. The model utilizes pulse coding to mimic the high speed recognition taking place in the mammalian brain [5] and spike timing dependent plasticity (STDP) for training [6]. It was trained to recognize a set of pixel images of characters scaled up to pixels and was able to recognize noisy versions of these images. The largest networks tested had over 84 million neurons and 4 billion synapses implemented. An initial version of this model was presented in [7]. A large portion of the results presented in this thesis have been published in the April 2010 issue of Applied Optics [1]. The GPGPUs examined in this study utilize the Compute Unified Device Architecture (CUDA) from NVIDIA. This architecture allows code compatibility 2

14 between different generations of NVIDIA GPGPUs. All the GPGPU platforms utilized the same CUDA code. For each GPGPU, we also examined the performance of the general purpose processor on that platform using a highly optimized C version of the code. This C code utilized both data and thread parallelism (through explicit parallel coding in C), to ensure that all cores and the vector parallelism within the cores were utilized. The processors utilized were a 2.59 GHz dual core AMD Opteron processor, a 2.67 GHz quad core Intel Xeon processor, and a 2.4 GHz dual core AMD Opteron processor. The fastest GPGPU utilized, the Tesla S1070, provided speedups of 5.6 and 84.4 for the Izhikevich and Hodgkin Huxley models respectively, when compared to highly optimized code implementations on the fastest CPU tested, the quad core 2.67 GHz Xeon processor. Based on a thorough review of the literature, this is the first study examining the application of GPGPUs for the acceleration of the more biologically accurate Izhikevich and Hodgkin-Huxley models applied to image recognition. Our results indicate that these models are well suited to GPGPU platforms. Chapter II of this thesis discusses background material including a brief introduction to the two spiking network models considered for this study and a discussion of related work. Chapter III presents the character recognition model and examines its mapping to the GPGPU platforms. Chapters IV and V describe the experimental setup and results of the model performance, while chapter VI concludes the thesis. 3

15 CHAPTER II Background 2.1 Hodgkin-Huxley Model The Hodgkin-Huxley model [8] is considered to be one of the most biologically accurate spiking neuron models. It consists of four differential equations (eq. 1-4) and a large number of parameters. The differential equations describe the neuron membrane potential, activation of Na and K currents, and inactivation of Na currents. The model can exhibit almost all types of neuronal behavior if its parameters are properly tuned. This model is very important to the study of neuronal behavior and dynamics as its parameters are biophysically meaningful and measurable. A time step of 0.01 ms was utilized to update the four differential equations as this is the most commonly used value. Fig. 2.1 shows the spikes produced with this model in our study (using the parameters in Appendix A). dv dt 1 C 4 3 ( ){ I gkn ( V EK ) gnam h( V ENa ) gl( V EL)} (1) dn ( n ( V ) n) / n( V ) (2) dt dm ( m ( V ) m ) / m( V ) dt (3) dh ( h ( V ) h) / h( V ) (4) dt 4

16 Membrane Voltage (mv) Time (ms) Fig 2.1. Spikes produced with the Hodgkin-Huxley model. 2.2 Izhikevich Model Izhikevich proposed a new spiking neuron model in 2003 [9] that is based on only two differential equations (eq. 5-6) and four parameters. In a study comparing various spiking neuron models [2], Izhikevich found that the Izhikevich model required 13 flops to simulate one millisecond of a neuron s activity, while the Hodgkin Huxley model required 1200 flops. The former was still able to reproduce almost all types of neural responses seen in biological experiments, in spite of the low flop count. In our study, a time step of 1ms was utilized (as was done by Izhikevich in [9]). Fig. 2.2 shows the spikes produced with this model. dv dt V 5V 140 u I (5) du a( bv u) dt (6) V c if V 30 mv, then u u d 5

17 Membrane Voltage (mv) Time (ms) Fig 2.2. Spikes produced with the Izhikevich model. 2.3 GPGPU Architecture Graphics processing units (GPUs) were originally designed to alleviate the CPU of computational tasks related to drawing onto the external display by utilizing a number of graphics primitive operations. It was dedicated to calculate floating point operations, and for many years, GPUs were only utilized for graphics pipeline accelerations. The concept of General Purpose GPU (GPGPU) was not feasible until recent years since GPUs did not have high enough computational capability in the past decades. In recent years, strong market demands for real-time, high-definition 3D graphics have helped GPUs to evolve into very powerful many-core processors with highly parallel, multithreaded architectures utilizing very high bandwidth memories [10]. Using the massive floating-point computational power of a modern graphics accelerator s shader pipeline, these powerful GPUs can now be utilized for general purpose computations (as opposed to graphics operations). GPGPUs provide a new low cost but high performance platform for many scientific studies and applications such as the physic based simulation, CT reconstruction, and GPGPU based pattern recognition. 6

18 Bandwidth GB/s Peak GFLOPs NVIDIA GPU GT Intel CPU G80 G80 Ultra G G NV30 NV35 NV40 G GHz Core2 Duo 3.2 GHz Harpertown Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 GT200=GeForce GTX280 G71=GeForce7900GTX NV35=GeForce FX 5950Untra G92=GeForce 9800GTX G70=GeForce 7800 GTX NV30=GeForce FX 5800 G80=GeForce 8800GTX NV40=GeForce 6800Ultra Fig 2.3. Floating point operations per second for Intel CPUs and NVIDIA GPUs. Data from [10] NVIDIA GPU Intel CPU G80 G80 Ultra G71 40 NV40 20 NV30 Prescott EE Woodcrest Harpertown Northwood Fig 2.4. Memory bandwidth for Intel CPUs and NVIDIA GPUs. Data from [10] 7

19 Fig 2.3 shows the discrepancy in floating point capability between different generations of Intel CPUs and NVIDIA GPUs. As shown in, there is an increasing discrepancy in the floating point performance between CPUs and GPGPUs. This is mainly because GPGPUs are specifically designed for compute-intensive and highly parallel applications. Unlike conventional CPUs, GPGPUs devoted most of their transistors for data processing as opposed to data caching and flow control. Modern mainstream processors, such as multi-core CPUs and many-core GPGPUs, are parallel systems that continue to scale with Moore s law [10]. The main challenge for these systems is the development of application software that scales its parallelism transparently with the increasing number of processor cores. The Compute Unified Device Architecture (CUDA) is a general purpose parallel computing architecture that was first released by NVIDIA in the late This has a software environment that allows developers to use C (with NVIDIA extensions) as the high level programming language to code algorithms for execution on their GPGPUs. According to NVIDIA documents [10], CUDA is parallel programming language designed to make parallel programming easy and with a low learning curve. The programs developed using this language are supposed to be portable across multiple NVIDIA GPGPU platforms. CUDA essentially hides the underlying hardware architecture of the GPU from the programmer. To do this CUDA utilizes a structure that has three basic features: hierarchy of threads, shared memories, and barrier synchronization [10]. These features are all accessible to the programmer via a set of language extensions. As a result, the programmer can easily partition a problem into coarse sub-problems that can be solved independently in parallel and then these sub-problems are split into finer pieces to be 8

20 solved cooperatively in parallel. Such a decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time, enables transparent scalability since each sub-problem can be scheduled to be solved on any of the available processor cores. A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count. The CUDA processing flow is shown in Fig Main Memory CPU GPU Memory 3 1. Copy processing data 2. Instruct the processing 3. Execute parallel in each core 4. Copy the result NVIDIA GPGPU Fig 2.5. Processing flow on CUDA. Based on figure in [11]. 9

21 NVIDIA s CUDA GPGPU architecture consists of a set of scalar processors (s) operating in parallel. Eight scalar processors are organized into a streaming multiprocessor (SM), with several SMs on a GPGPU (see Fig. 2.6). Each SM has an internal shared memory along with caches (a constant and a texture cache), a multithreaded instruction (MTI) unit, and special functional units. The SMs share a global memory. A large number of threads are typically assigned to each SM. The MTI can switch threads frequently to hide longer global memory accesses from any of the threads. Both integer and floating point data formats are supported in the s. The NVIDIA GeForce 9800 GX2 GPGPU card utilized in this study has two 1.5 GHz G92 graphics processing chips, with each G92 GPGPU containing 128 CUDA scalar processors [12]. The two G92 GPGPUs are placed on two separate printed circuit boards with their own power circuitry and memory while still only requiring a single PCI-Express 16x slot on the motherboard (a dual-pcb configuration). Each GPGPU has access to a 256-bit DDR3 memory interfaces with 512MB of memory. This amounts to 1GB of memory for the full card with a peak theoretical bandwidth of 128 GB/s (64 GB/s per GPGPU) [12]. The sustained performance of the NVIDIA 9800 GX2 is approximately 1152 GFLOPs (576 GFLOPS per GPGPU) [12]. 10

22 SM MTI Cache SM MTI Cache SM MTI Cache SM MTI Cache SFU SFU SFU SFU SFU SFU SFU SFU Shared memory Shared memory Shared memory Shared memory Global Memory Fig 2.6. A simplified CUDA GPGPU architecture. Here =Scalar Processor, SM=Streaming Multiprocessor, SFU=Special Functional Unit, and MTI=Multi-Threaded Instruction Unit. The Tesla C1060 GPGPU has a single graphic processing chip (based on the NVIDIA Tesla T10 GPGPU) with 240 processing cores working at 1.3 GHz [13]. The Tesla C1060 is capable of 933 GFLOPs of processing performance and comes standard with 4 GB of 512-bit DDR3 memory at 102 GB/s bandwidth [13]. The Tesla S1070 GPGPU has four 1.3GHz T10 GPGPU cores, each with access to 4GB of DDR3 memory (totaling 16 GB) [14]. The card is capable of 3.73 GFLOPs and has a combined memory bandwidth of 408 GB/s [14]. 2.4 Related work Several groups have studied image recognition using spiking neural networks. In general, these studies utilize the integrate and fire model. Thorpe developed SpikeNET [5], a large spiking neural network simulation software. The system can be used for several image recognition applications including identification of faces, fingerprints, and video images. Johansson and Lansner developed a large cluster based spiking network simulator of a rodent sized cortex [15]. They tested a small scale version of the system to 11

23 identify pixel images. Baig [16] developed a temporal spiking network model based on integrate and fire neurons and applied them to identify online cursive handwritten characters. Gupta and Long [17] investigated the application of spiking networks for the recognition of simple characters. Other applications of spiking networks include instructing robots in navigation and grasping tasks [18], recognizing temporal sequences [19][20], and the robotic modeling of mouse whiskers [21]. At present several groups are developing biological scale implementations of spiking networks, but are generally not examining the applications of these systems (primarily as they are modeling large scale neuronal dynamics seen in the brain). The Swiss institution EPFL and IBM are developing a highly biologically accurate brain simulation [4] at the subneuron level. They have utilized the Hodgkin-Huxley and the Wilfred Rall [22] models to simulate up to 100,000 neurons on an IBM BlueGene/L supercomputer. At the IBM Almaden Research Center, Ananthanarayanan and Modha [3] utilized the Izhikevich spiking neuron model to simulate 55 million randomly connected neurons (equivalent to a rat-scale cortical model) on a 32,768 processor IBM BlueGene/L supercomputer. Johansson et. al. simulated a randomly connected model of 22 million neurons and 11 billion synapses using an 8,192 processor IBM BlueGene/L supercomputer [23]. A few studies have recently examined the performance of spiking neural network models on GPGPUs. Tiesel and Maida [24] evaluated a grid of 400x400 integrate and fire neurons on an NVIDIA 9800 GX2 GPGPU. In their single layer network, each neuron was connected to its four closest neighbors. This network was not trained to recognize any image. They compared a serial implementation on an Intel Core 2 Quad 2.4 GHz 12

24 processor to a parallel implementation on an NVIDIA 9800 GX2 GPGPU and achieved a speedup of 14.8 times. Nageswarana et al. [25] examined the implementation of 100,000 randomly connected neurons based on the more biologically accurate Izhikevich model. They compared a serial implementation of the model on a 2.13 GHz Intel Core 2 Duo against a parallel implementation on an NVIDIA GTX280 GPGPU and achieved a speedup of about 24 times. Also, Nageswarana et al [26] developed the acceleration of spike based convolutions using GPGPUs. A kernel speedup of 35 times is achieved by a single NVIDIA GTX 280 GPGPU over a CPU only implementation. Fidjeland et al [27] presented a GPGPU implementation (on a platform named NeMo) of the Izhikevich model and achieved 400 million spikes per second throughput performance. Nere and Lipasti [28] developed a GPGPU implementation of their own neocortex inspired pattern recognition model. A speedup of 373 was achieved over their unoptimized C++ serial implementation. In contrast to these implementations, we implement two spiking neural network models (Izhikevich and Hodgkin-Huxley) capable of image recognition on three different GPGPU platforms as well as a GPGPU cluster and compare the performances of the three GPGPUs against their corresponding CPUs. Both data and thread level parallelism was utilized for parallel implementations on those CPUs to ensure a fare comparison. 13

25 CHAPTER III Network Design A two layer network structure was utilized in this study with the first layer acting as input neurons and the second layer as output neurons (see Fig. 3.1). Input images were presented to the first layer of neurons, with each image pixel corresponding to a separate input neuron. Thus the number of neurons in the first layer was equal to the number of pixels in the input image. Binary inputs were utilized in this study. If a pixel was on, a constant current was supplied to the input, while no current was supplied if a pixel was off. The number of output neurons was equal to the number of training images. Each input neuron was connected to all the output neurons. Level 2 Level 1 Fig 3.1. Network used for testing spiking models. In the training process, images from the training set were presented sequentially to the input neurons. The weight matrix of each output neuron was updated using the STDP rule each cycle. In the testing phase, an input image is presented to the input neurons and after a certain number of cycles, one output neuron fires, thus identifying the 14

26 input image. During each cycle, the level 1 neurons are first evaluated based on the input image and a firing vector is updated to indicate which of the level 1 neurons fired that cycle. In the same cycle, the firing vector generated in the previous cycle is used to calculate the input current to each level 2 neuron. The level 2 neuron membrane potentials are then calculated based on their input current. A level 2 neuron s overall input current is the sum of all the individual currents received from the level 1 neurons connected to it. This input current for a level 2 neuron is given by eq. 7 below: where I j w(i, j ) f ( i ) (7) i w is a weight matrix in which w(i,j) is the input weight from level 1 neuron i to level 2 neuron j. f is a firing vector in which f(i) is 0 if level 1 neuron i does not fire, and is 1 is the neuron does fire. Two versions of the model were developed where the main difference was the equations utilized to update the potential of the neurons. In one case, the Hodgkin-Huxley equations presented in section 2.1 were used, while in the second case, the Izhikevich equations in section 2.2 were used. The parameters utilized in each case are specified in Appendix I. 15

27 CHAPTER IV GPGPU Mapping In order to achieve high performance from the GPGPU, all the processing cores need to be utilized at the same time. Each streaming multiprocessors can execute a group of threads concurrently (call a warp) on its scalar processors and can easily switch between warps [29]. To hide long memory latencies, it is recommended that hundreds to thousands of threads be assign to the GPGPU. In case one warp is waiting on a memory access, the streaming multiprocessor can easily switch to other warps to hide this latency. Fig. 4.1 shows a flowchart of the main components of the algorithm. Three kernels are executed on the GPGPU to update the level 1 and level 2 neurons and their corresponding synaptic currents (level 2 neurons input weights). To achieve this high parallelism, two mapping strategy implemented is described as following. Start Read new Image (CPU) Update level 1 neurons (kernel 1) Update synaptic currents for level 2 neurons (kernel 2) Update level 2 neurons (kernel 3) No Any level 2 neurons fired? (CPU) Yes Fig 4.1. Algorithm flowchart. 16

28 4.1 Neuronal Mapping The networks utilized in this study have high amounts of data parallelism. For example, the network contains up to 589,824 neurons from level 1, 48 neurons from level 2 and over 28 million synapses. All the level 1 neuron evaluations are carried out by kernel 1 (shown in Fig. 4.1). Each neuron is evaluated by a separate thread on the GPGPU, thus resulting in the total number of threads available being equal to the total number of neurons in this layer. This amounts to a very large number of threads (up to 589,824) available for execution. Before being mapped to GPGPU streaming multiprocessors, threads are arranged into two level hierarchies: grid level and block level. All the threads of the same kernel [shown as in Fig 4] belong to the same grid. Multiple kernels could be called from the host. Each grid can contain up to thread blocks, while each thread block can contain up to 512 threads. To achieve high performance, no less than 128 threads need to be assigned to each thread block. Each thread block is distributed to different streaming multiprocessor. All the threads of the same thread block are consistently executed on the same streaming multiprocessor. Threads of a thread block are split into multiple warps, with 32 threads in a warp. One streaming multiprocessor can run 32 threads (1 warp) on its eight scalar processors concurrently. Streaming multiprocessors can efficiently switch among warps to hide long memory accesses. Unlike vector machines, which typically utilize SIMD (single instruction multiple data), operations in the CUDA architecture are carried out using SIMT (single instruction multiple thread). Each CUDA thread has its own program counter and register state, and thus can run independently of the other. In a SIMD processor, all elements of a vector are 17

29 operated on in lock step. To achieve high performance, it is necessary to make all threads in a warp have the same execution path. Therefore, branch or divergence in a warp should be avoided by all means to achieve high performances. Image sizes Table 4.1. Threads hierarchy and warps Izhikevich Hodgkin-Huxley Number of grids Number of threads Threads per block Number of thread Blocks Warp size Number of warps Synapses Mapping The summation of the neuron weights described in eq. 7 is carried out for one level 2 neuron at a time (kernel 2 in Fig. 4.1). Here neuronal mapping described in the previous section is no longer applicable to provide mass parallelism or achieve high performance, because only 48 level 2 neurons postsynaptic current need to be calculated and only 48 threads are provided by this method. In this case, much more working load is going to be assigned to much less number of threads and lead to very poor performance. So instead of mapping each neuron to a thread, each synaptic weight is then mapped to one thread. Since there are totally over 28 million synapses mass parallelism is achieved again by this mapping method. For a particular level 2 neuron, j, all the elements in the masked weight vector w(i,j)f(i) have to be added to each other to generate a final input current. This is done by splitting w(i,j)f(i) into multiple sub-vectors, with the sum of the all the sub-vectors accumulated in GPGPU shared memory. When all the sub-vectors 18

30 have been added, parallel reduction [30] is used to generate an input current by adding all the elements of the accumulated vector. Three kernels are executed on the GPGPU to update the level 1 and level 2 neurons and their corresponding synaptic currents (level 2 neurons input weights). Only the check of whether a level 2 neuron fired (and thus categorized an input image) is carried out on the CPU as this is a serial operation. All the neuron parameters are implemented using single floating point variables on the GPGPU. 4.3 MPI Mapping The limited memory associated with each GPGPU core (at most 4GB in the architectures considered) sets a limit on the largest network size that can be evaluated in parallel. Thus to evaluate a large neural network, it would be necessary to split the network into multiple parts and evaluate each part separately. This will make the evaluation slow. A cluster of GPGPUs offers a faster alternative where multiple components of a large network can be evaluated in parallel on separate GPGPU cores. The Message Passing Interface, or simply MPI, provides a good way for the different GPGPU cores to communicate. By using MPI networks of sizes , , and above can be implemented and achieve very high performance. The architecture of GPGPU cluster mapping is shown in Fig 4.2. In order to achieve good performance in the GPGPU cluster, the GPGPU to CPU ratio should be equal to one. So each GPGPU core can be attached to a unique CPU and works as its private co-processor. 19

31 Master CPU0 Worker CPU1 Worker CPU2 Worker CPU3 Worker CPUN. GPU1 GPU2 GPU3 GPUN Fig 4.2. GPGPU cluster mapping. From the MPI point of view, one Tesla S1070 card is considered as four individual T10 GPGPUs with 4 GB of memory each. Here the phase one node does not mean a physical computer node (see section 4.4) but a virtual node consisting of a CPU core with DRAM and a T10 GPGPU core as well as its 4 GB of on-board memory. In each node both the update of the subset of level 1 neurons and the corresponding weight summation are carried out in the same manner as the non-mpi version GPGPU implementation described in earlier sections. The 48 level 2 neurons are carried out by the master CPU instead. To make full use of each node, larger images are segmented into multiple subimages with the size equal to the largest size feasible in a single GPGPU core (described in detail in section 5.3). These sub-images are processed individually and concurrently using multiple GPGPU cores. In the N nodes GPGPU cluster based MPI implementation (as shown in Fig 4.2), the algorithm is mapped to the GPGPU cluster as following: 20

32 1. Each node reads its sub-image form the hard drive and transmits it to its GPGPU on-board memory. 2. The GPGPU core on each node calculates its sub-image until it finds level 1 neurons fired, after calculating the post-synaptic current for this sub image, a 48 length vector (post synaptic current of sub-image) is transmitted to the master CPU. 3. The master CPU then adds the 48 length vectors from the N nodes to get the total post synaptic current. Then the master CPU updates the level 2 neurons until it finds a level 2 neuron fired indicating the final recognition result. The number of nodes implemented in this thesis is from 4 to 9. The performance and timing breakdown of this MPI implementation is shown in the result chapter. 4.4 Experimental Setup The models were implemented on three GPGPU platforms. These are: 1) An Ubuntu Linux based PC with two 2.59 GHz dual core AMD Opteron 285 processors, 4 GB of DRAM, and an NVIDIA GeForce 9800 GX2 graphics card with two G92 GPGPUs. 2) A Red Hat Enterprise Linux based PC with one 2.67 GHz quad core Xeon 5550 processor, and 12 GB of DRAM. Two versions of this system were utilized. One had a single NVIDIA Tesla C1060 GPGPU. The other had two NVIDIA C1060 GPGPUs with both placed on the same motherboard, but on two different PCI-Express x slots. 3) A Fedora 10 Linux based PC with two 2.4 GHz dual core AMD Opteron 2216 processors, 8 GB of DRAM, and a NVIDIA Tesla S1070 GPGPU. 21

33 4) A 32-node Fedora 10 Linux based cluster with InfiniBand interconnection, each node is equipped with two 2.4 GHz dual core AMD Opteron 2216 processors, 8 GB DRAM and an NVIDIA Tesla S1070 GPGPU. A C implementation of the models was run on one of the AMD Opteron processors for systems 1 and 3 above, and on the Intel Xeon processor. The C code utilized the SSE3 SIMD extensions for data parallelism. It also utilized the POSIX threads library to run on all cores available (two on the AMD processors and four on the Xeon processor). The programs were compiled with -O3 optimizations using gcc. The GPGPU implementation of the model utilized CUDA 2.3 nvcc. Four networks with varying input image sizes were developed to examine the two spiking neural network models. The overall network structure was kept similar to the design shown in Fig. 3.1, with two layers of nodes per network. Each of the level 2 neurons was connected to all of the neurons from level 1. Table 4.2 shows the number of neurons per layer for all the network sizes implemented. In this study all the spiking networks of different sizes were trained to recognize the same number of images. A set of pixel images were generated initially and were then scaled linearly to the different network input sizes. Fig. 4.4 shows the training images used. The images represented the 26 upper case letters (A-Z), 10 numerals (0-9), 8 Greek letters, and 4 symbols. Table 4.2. Spiking network configurations evaluated. Input image size Total neurons 147, ,872 2,359,344 9,437,232 Level 2 neurons Level 1 neurons 147, ,824 2,359,296 9,437,184 22

34 4.5 GPGPU cluster configurations As shown in Fig 4.2, the three basic components of the GPGPU cluster utilized in this study are the host nodes, the GPGPUs and the interconnect between the host nodes. The GPGPU cluster being used in this study is the Accelerator Cluster (AC) [31] at National Center for Supercomputing Applications (NCSA) [32]. The structure of the AC cluster is shown in Fig 4.3. It has 32 nodes connected by InfiniBand. Each node is an HP wx9400 workstation equipped with two 2.4 GHz AMD Opteron dual-core 2216 processors with 1MB of L2 cache and 1GHz Hyper Transport link and 8GB (4x2GB) of DDR2-667 memory. Each node has two PCIe Gen1 x16 slots and one PCIe x8 slot. The two x16 slots are used to connect to a single Tesla S1070 Computing System (4 GPGPUs) and the x8 slot is used to connect to an InfiniBand QDR adapter. Each node of the cluster runs on Fedora 10 Linux. Detailed configurations and architecture are presented in Table 4.3 and Fig 4.3. This cluster is well balanced because: 1. Nodes interconnection: InfiniBand QDR interconnect is used to match the GPGPU/Host bandwidth. In contrast to slower Ethernet inter-connections (1-2GB/s) with network card, the InfiniBand QDR inter-connection (4GB/s) connects different nodes via PCIe slots and QDR adapters [31]. 2. CPU/GPGPU ratio: This is equal to one in order to simplify the MPI based implementations [31]. 3. Host memory: This is larger than the on-board memory of a GPGPU core in order to enable full utilization of the GPGPU core on that node [31]. 23

35 QDR IB InfiniBand HP xw9400 workstation Node 1 Node 2 Node N Dual core CPU Dual core CPU PCIe Bus PCIe Bus PCIe x 16 PCIe x 16 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070 Node 0 Fig 4.3. Architecture of the GPGPU cluster (Although all 32 nodes have the same configuration, only node 0 is shown in detail) (based on figure in [31]). Table 4.3. GPGPU cluster configuration (based on table in [31]). CPU Host HP xw9400 CPU Dual-core 2216 AMD Opteron CPU frequency 2.4 CPU cores per node 4 CPU memory host (GB) 8 GPU host Tesla S1070 GPU frequency (GHz) 1.3 GPU chips per node 4 GPU memory per host (GB) 16 CPU/GPU ratio 1 interconnect IB QDR # of cluster nodes 32 # of CPU cores 128 # of GPU chips 128 # of racks 5 Total power rating (kw) <45 24

36 Fig 4.4. Training images. 25

37 CHAPTER V Results Both the CPU and the GPGPU versions of the models were able to recognize all the training images (Fig. 4.4). They were also able to recognize nine out of ten noisy versions of these images accurately (Fig. 5.1), with image #6 not being recognized. (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Fig 5.1. Additional test images. 5.1 Single GPGPU card performances The non-mpi neurons per second throughput of the GPGPUs and the CPUs are shown in Fig. 5.2 for the Izhikevich and Hodgkin Huxley models. Note that the GeForce 9800 GX2 card was unable to implement the larger image sizes due to limited on-board memory. The performance of both the AMD processors were almost the same, and so only one of them (the 2.4GHz version) was plotted in Fig The performances of both Tesla C1060 based platforms, with one and two C1060s, are shown (the two C1060 platform is listed as C1060 2). The results show that the performance of the GPGPUs is 26

38 higher than that of the general purpose processors examined. The fastest GPGPU and CPU utilized in this study were the Tesla S1070 and the quad core 2.67 GHz Xeon processor respectively. Comparing the performances of these two systems in Fig. 5.2 shows that the fastest GPGPU utilized (S1070) provided speedups of 5.6 and 84.4 over the fastest CPU (Xeon) for the Izhikevich and Hodgkin Huxley models respectively (for the largest image size). The speedups of each GPGPU over the CPU on its own platform are shown in Tables 5.1 and 5.2. Given that the C1060 is paired with the fastest processor (with an Intel Xeon processor with four cores), the speedup for this GPGPU was the lowest. The Izhikevich model provided a lower speedup as it is less computationally demanding (further discussion below). Table 5.1. Speedups achieved on four GPGPU platforms for Izhikevich model for different image sizes S GX C1060 ( 1) C1060 ( 2) Table 5.2. Speedups achieved on four GPGPU platforms for Hodgkin Huxley model for different image sizes S GX C1060 ( 1) C1060 ( 2) For both Izhikevich and Hodgkin-Huxley models, image sizes of , , and pixels were examined. In the Hodgkin-Huxley model each neuron requires many more parameters, thus increasing the overall memory 27

39 Neurons/s Neurons/s footprint of the model. As a result the and images could not be implemented for the Hodgkin-Huxley model due to limited memory on the GeForce 9800 GX2 GPGPU. 3.5E E E E E E E+07 S1070 C C GX2 Xeon AMD 0.0E E E+06 Neurons 1.0E+07 (a) 2.5E E E E E+05 S1070 C C GX2 Xeon AMD 0.0E E E E+07 Neurons (b) Fig 5.2. Neurons/second throughput of GPGPU and CPU processors for the (a) the Izhikevich model and (b) the Hodgkin Huxley model. 28

40 The runtime breakdown of the GPGPU versions of models for the 768x768 image size are shown in Fig Both level 1 and level 2 neuron computations (kernels 1 and 3 in Fig. 4.1), along with the level 2 neuron input weight computations (kernel 2), take place on the GPGPU. The rest of the runtime is for: 1) main memory access, including data transfer between the host (CPU) memory and the device (GPGPU) memory, and 2) CPU operations, including initialization and checking all 48 level-two neurons voltages to find out which one fired (to determine the final recognition result). Given that the number of level 2 neurons does not vary with the input image size, the computation time for the level 2 neurons remains constant for a given model. The runtime for the remaining components do vary with the number of level 1 neurons, and hence the input image size. It is seen in Fig. 5.3 that the Izhikevich model has a higher percentage of time spent on CPU and memory access, compared to the Hodgkin-Huxley model. This leads to a higher overhead ratio in the Izhikevich model, and thus a lower speedup. The bytes of data required per flop for the models are listed in Table 5.3. It also lists the peak bytes per flop capability of the three GPGPUs examined (calculated from specification in [12] [13] and [14]). Table 5.4 also listed flops actually achieved by different GPGPU cards for network configuration The results show that the memory bandwidth requirement for the Izhikevich model is about 19 times the peak capability of the Tesla GPGPUs, while bandwidth requirement for the Hodgkin-Huxley model is about 3.5 times higher than capability of these cards. This also leads to the speedup of the Izhikevich model to be lower than the Hodgkin-Huxley model. 29

41 Table 5.3. Bytes/Flop requirement of the models and capabilities of the GPGPU cards. Model/Platform Bytes/Flop Model: Izhikevich 2.11 Model: Hodgkin Huxley 0.38 Card: S Card: C Card: 9800 GX Table 5.4. Bytes/Flop achieved by different GPGPU cards (in GFLOPs). System Hodgkin- GPGPUs Izhikevich Peak Huxley 9800GX C C S As shown in Table 5.4, the performance achieved by the double C1060 GPGPU configuration is about twice as high as the performance of a single C1060 GPGPU. Also, the performance of the Tesla S1070 GPGPU is about 4 times as high as a single C1060 GPGPU. This is because the Tesla C1060 has one Tesla T10 GPGPU inside with 4GB of memory, while the Tesla S1070 has four such T10 GPGPUs, each with its own 4 GB memory bank. Thus the performance of the S1070 is about 4 times that of the C1060. The performance increases linearly with the number of cores being used. Even though the C1060 card is a more advanced GPGPU compared to the 9800 GX2, the performance of a single C1060 card is no better than a 9800 GX2 card. This is because the 9800 GX2 has a dual-pcb configuration (as described in section 2.3) which includes two G92 cards. The total number of CUDA cores on a 9800 GX2 is 256 (128 per G92 GPGPU). This is more than the total of 240 CUDA cores in a C1060 card. Hence the 9800 GX2 card has a peak performance of 1152 GFLOPs while the C1060 has a peak 30

42 performance of 933 GFLOPs. What is more, the 9800 GX2 GPGPU has a smaller (1GB) but faster memory performance (128BG/s) compared with the lager (4GB) but slower (102 GB/s) memory of the C1060 GPGPU. So both more CUDA cores and a faster memory performance help the 9800 GX2 card to have a better performance than a single C1060 card. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 9800GX2 C1060 C S1070 CPU/memory L2 weight (GPU) L2 (GPU) L1 (GPU) (a) 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% 9800GX2 C1060 C S1070 CPU/memory L2 weight (GPU) L2 (GPU) L1 (GPU) (b) Fig 5.3. Timing breakdown by percentage of overall runtime for processing the 768x768 image using (a) Izhikevich model and (b) Hodgkin Huxley model. 31

43 5.2 MPI performances Since the performance of a single GPGPU is limited by both the capability of parallelization and the on-board memory space, MPI is utilized here for the high performance implementation of larger networks such as and Both the memory space and the total amount of parallelization increase linearly with the number of nodes devoted in this parallel computation. To make full use of each node, larger images are segmented into multiple sub-images, with the individual sub-images processed concurrently using multiple GPGPU cores. The size of each sub-image is equal to the largest size feasible in a single GPGPU core (described in detail in section 5.3). As shown in Table 5.5 and Fig 5.4, a image is broken into four segments for both the Izhikevich and Hodgkin-Huxley models. Similarly, image size is broken into nine sub-images and processed by nine GPGPU cores. The size of each sub image processed in each GPGPU core is consistently Therefore the speedup for larger images is slightly less than the maximum seen for each of the models in the previous results because of task synchronizations and memory bandwidth limitation between host (CPU) memory and device (GPGPU) memory. Table 5.5. GPGPU cluster run time (ms). Image sizes Nodes Izhikevich Hodgkin- Huxley

44 Neurons/s 1.00E E E E E E E E E E Number of T10 GPGPU cores Izhikevich Hodgkin-Huxley Fig 5.4. Neurons/second throughput of GPGPU cluster. The MPI timing breakdown is shown in Table 5.6. The MPI communication time is a very small fraction of the whole MPI run time. It has almost no influence on the system performance. But still it changes with the number of nodes used. The more nodes devoted to parallel processing, the longer time it takes to communicate via the InfiniBand connections between those worker nodes and the master node. The communication time versus the number of nodes are plotted in Fig 5.5. Table 5.6 MPI timing breakdown (ms). Image sizes Number of nodes Izhikevich Communication time Level 2 neuron update Level 1 neurons update Total run time Hodgkin-Huxley Communication time Level 2 neuron update Level 1 neurons update Total run time

45 MPI Communication Time (ms) communication time Number of T10 GPGPU cores Fig 5.5. MPI communication time. 5.3 Memory analysis For the two models (Izhikevich and Hodgkin-Huxley) implemented in this study, a network with N i level 1 neurons and N e level 2 neurons will have N i N e synapses. Since all neuronal parameters were represented with 32-bit single precision floating point numbers, each neuron parameter requires 4 bytes memory space. As a result, each neuron of the Izhikevich model (with 4 parameters) requires 16 bytes, and each neuron of the Hodgkin Huxley model (with 25 parameters) requires 100 bytes. The memory space required by different network elements for varying network configurations is shown in Table 5.7. Table 5.7. Memory required by different network component (Bytes). SNN components Memory required by the two models Izhikevich Hodgkin-Huxley N e Level 1 neurons (4 N e ) 4 (25 Ne) 4 Ni Level 2 neurons (4 Ni) 4 (25 Ni) 4 Synaptic weight (Ne Ni+2 Ni+Ne) 4 (Ne Ni+Ni 2+Ne) 4 Total memory (Ne Ni+6 Ni+5 Ne) 4 (Ne Ni+27 Ni+26 Ne) 4 34

46 Required GPU Memory (GB) GPU cluster Memory NV S1070 Memory NV C1060 Memory NV 9800 GX2 Memory Number of Neurons (X10 6 ) Izhikevich Hodgkin- Huxley Fig 5.6. Memory requirement for different configurations. Based on table 5.7, the total amount of memory space required by the different network configurations are plotted in Fig 5.6. As shown in Fig 5.6, the amount of memory space required increases almost linearly with the increasing number of neurons. The number of neurons is shown in the x-axis ranging from 147k to 84,935k. The amount of memory required in GB is shown in y-axis and ranges from 28MB to 23.4GB. The NVIDIA GeForce 9800 GX2 card includes two G92 GPGPUs, each with a dedicated global memory of 512 MB. So for the Izhikevich model, image sizes up to could be implemented. Each T10 core in both the C1060 and S1070 platforms includes 4GB global memory, so configurations up to can be implemented on each T10 GPGPU core. For the NVIDIA Tesla GPGPU cluster, 1 to 9 GPGPU cores are being used in the MPI implementations with image sizes up to implemented. 35

47 Even larger images are implemented using MPI by segmenting them into multiple sub-images. This is because a single T10 GPGPU core can accommodate at most image sizes of up to By using multiple T10 GPGPU cores, with each core processing one sub-image, very large image can be reconstructed and processed. In this thesis, images of the sizes and are processed with four and nine nodes respectively. The size of sub-images processed in each GPGPU core is consistently pixels. 5.4 Firing rate analysis In the CPU implementations of the models, the weight summations in eq. 7 can be carried out selectively by considering only the level 1 neurons that fired (ie. for those that f(i) in eq. 7 is 1). In the current version of the CUDA language constructs supported in the GPGPU utilized, we have not seen any method to easily determine whether any of the level 1 neurons evaluated in parallel (over 100) have fired (this can be done through a slow serial procedure). As a result, we evaluate eq. 7 for all level 1 neurons. Thus, a variation in the level 1 neuron firing rate would affect the CPU runtime, but not the GPGPU runtime. Fig. 5.7 examines the speedup seen with variations in the level 1 firing rate for the Izhikevich and Hodgkin-Huxley models respectively on the GeForce 9800 GX2 platform over the AMD processor paired with it. In the Izhikevich model, there is a significant change in the speedup, with the highest speedup seen when all the level 1 neurons fired (providing a speedup of about 20 times). In this case the CPU run time is the highest, while the GPGPU runtime remains the same. In the Hodgkin-Huxley model, the weight computation time is a tiny fraction of the overall runtime (as seen in Fig. 5.3), 36

48 Speedup Speedup thus there is almost no change in speedup with variations in the level 1 neuron firing rate. In the test images utilized for the results in Fig. 4.4 and Table 4.2, about 5% of the pixels are turned on, thus causing a spiking rate of about 5% at level Firing rate (%) (a) Firing rate (%) Fig 5.7. Variation in speedup of GPGPU implementations on 9800 GX2 platform with firing rate in the level 1 neurons for the: (a) Izhikevich model, and (b) Hodgkin Huxley Model. (b) 37

49 CHAPTER VI Conclusion There is a strong effort in the research community recently to develop biological scale implementations of neural systems. Although the integrate and fire model is typically used in spiking neural network based engineering applications, this model is considered to be significantly inferior to more biologically accurate models. As a result, large scale biological models have tended to utilize the more accurate neuron models, such as the Izhikevich and Hodgkin-Huxley neuron models. Large scale implementations of biologically accurate neuron models for image recognition applications are computationally demanding, requiring large computational clusters. Graphics processing units provide a low cost, high performance platform for data parallel applications. Given that large neural models have significant parallelism, this thesis examined the feasibility of using such hardware for these models. Two versions of an image recognition model (based on the Izhikevich and Hodgkin-Huxley models) were implemented on an NVIDIA GeForce 9800 GX2, an NVIDIA Tesla C1060, and an NVIDIA Tesla S1070 graphic processor and their performances were compared to two dual core AMD Opteron processors and a quad core 2.67 GHz Intel Xeon 5550 processor. 38

50 Models with 147,456 to 9,437,184 neurons were tested by non-mpi version implementation. The fastest GPGPU utilized, the Tesla S1070, provided speedups of 5.6 and 84.4 over highly optimized implementations on the fastest CPU tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and Hodgkin Huxley models respectively (with a 5% firing rate). All available forms of parallelism were utilized on both the GPGPU and the CPU processors. The Izhikevich model provided a lower speedup as it is less computationally demanding (and thus has a higher overhead ratio). Additionally, the speedup in this model varied with the spiking rate seen in the image, with a peak speedup seen at a 100% firing rate. These results indicate that GPGPUs can provide a high performance platform for the implementation of biological spiking neuron based pattern recognition models. By using MPI based GPGPU clusters, larger networks containing 37,748,736 and 84,934,656 neurons were implemented. Since the MPI communication time is a very small fraction of the total MPI run time, the neurons per second throughput increases almost linearly with the number of nodes being used. As future work, I plan to examine how to evaluate only synapses that fire in order to achieve the speedup seen at the 100% rate in my current study. Additionally I plan to explore acceleration of other spiking neural models and more biologically accurate networks on GPGPUs and GPGPU cluster. Finally, I plan to examine applications of the GPGPUs and GPGPU cluster based spiking neural network models to other areas, such as biometrics. 39

51 Bibliography [1] Bing Han and Tarek M. Taha, "Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors," Appl. Opt. 49, B83-B91 (2010) [2] E.Izhikevich, Which Model to Use for Cortical Spiking Neurons? IEEE Transactions on Neural Networks, 15(5), , [3] R. Ananthanarayanan and D. Modha, Anatomy of a Cortical Simulator, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing 2007), Reno, NV, November [4] H. Markram, The Blue Brain Project, Nature Reviews Neuroscience, 7, , [5] A. Delorme and S. J. Thorpe, SpikeNET: an event-driven simulation package for modelling large networks of spiking neurons, Network-computation in neural systems, 14(4), , Nov [6] Y. Dan and M. Poo, Spike time dependent plasticity of neural circuits, Neuron, vol. 44, pp , [7] M. A. Bhuiyan, R. Jalasutram, and T. M. Taha, Character Recognition with Two Spiking Neural Network Models on Multicore Architectures, in IEEE Symposium on Computational Intelligence for Multimedia Signal and Vision Processing, April [8] A. L. Hodgkin and A. F. Huxley, A quantitative description of membrane current and application to conduction and excitation in nerve, Journal of Physiology, 117, , [9] E. M. Izhikevich, Simple Model of Spiking Neurons, IEEE Transactions on Neural Networks, vol. 14, no. 6, November, 2003, pp

52 [10] NVIDIA CUDA Programming Guide: DA_Programming_Guide_2.3.pdf [11] CUDA Architecture [12] GeForce 9800 GX2 Specifications, [13] Tesla C1060 Specifications, [14] Tesla S1070 Specifications, [15] C. Johansson and A. Lansner, Towards Cortex Sized Artificial Neural Systems, Neural Networks, 20(1), 48 61, Jan [16] A. R. Baig, Spatial-temporal artificial neurons applied to online cursive handwritten character recognition, in European Symposium on Artificial Neural Networks, April 2004, paper Bruges (Belgium), pp [17] A. Gupta, L. Long, Character Recognition using Spiking Neural Networks, International Joint Conference on Neural Networks, Aug [18] C. Panchev and S. Wermter Temporal sequence detection with spiking neurons: towards recognizing robot language instructions, Connect. Sci., 18(1): 1-22, [19] D. V. Buonomano and M. M. Merzenich, A neural network model of temporal code generation and position invariant pattern recognition, Neural Computation, vol. 11, pp , [20] T. Ichishita, R. Fujii, Performance evaluation of a temporal sequence learning spiking neural network, Proceedings of the 7th IEEE International Conference on Computer and Information Technology, Oct [21] K-Team, Inc. Available online: [22] W. Rall, Branching dendritic trees and motoneuron membrane resistivity, Experimental Neurology, 1, ,

53 [23] M. Djurfeldt, M. Lundqvist, C. Johansson, M. Rehn, O. Ekeberg, and A. Lansner, Brain-scale simulation of the neocortex on the IBM Blue Gene/L supercomputer, IBM Journal of Research and Development, 52(1-2), 31 41, Jan.-Mar [24] J.-P. Tiesel, A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in International Joint Conference on Neural Networks, [25] J. M. Nageswarana, N. Dutt, J. L. Krichmar, A. Nicolau, A. V. Veidenbauma, A Configurable Simulation Environment for the Efficient Simulation of Large-Scale Spiking Neural Networks on Graphics Processors, Neural Networks, vol. 22, pp , [26] J. M. Nageswarana, N. Dutt, Yingxue Wang, T. Delbrueck, Computing Spikebased Convolutions on GPUs, ISCAS, pp , [27] Andreas K. Fidjeland, Etienne B. Roesch, Murray P. Shanahan and Wayne Luk, MeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs, ASAP, pp , [28] Andrew Nere and Mikko Lipasti, Cortical Architectures on a GPGPU, GPGPU- 10, March 14, [29] N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore GPUs," in IEEE International Symposium on Parallel & Distributed Processing, [30] M. Harris, Optimizing Parallel Reduction in CUDA, n/doc/reduction.pdf [31] Volodymyr V. Kindratenko, Jeremy J. Enos, Guochun Shi, Michael T. Showerman, Galen W. Arnold, John E. Stone, James C. Phillips, Wen-mei Hwu, GPU Clusters for High-Performance Computing [32] M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R. Pennington, W. Hwu, QP: A Heterogeneous Multi-Accelerator Cluster, in Proc. 10th LCI International Conference on High-Performance Clustered Computing, [Online]. Available: 42

54 Appendix Neuron Model Parameters Hodgkin-Huxley: g K =36 seimen, g Na =120 seimen, g L =0.3 seimen, E K =-12 mv, E Na =115 mv, E L = mv, V=-10 mv, V K =0 mv, V Na =0 mv, V L =1 mv, time step=0.01 ms. Izhikevich: Excitatory neurons: a=0.02, b= 0.2, c= -55, d=4; Inhibitory neurons: a=0.06, b=0.22, c=-65, d=2, time step=1 ms. 43