ACCELERATION OF SPIKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS



Similar documents
Acceleration of Spiking Neural Networks in Emerging Multi-core and GPU Architectures

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Parallel Computing Architecture and CUDA Programming Model

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GPU hardware and to CUDA

ultra fast SOM using CUDA

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPUs for Scientific Computing

Parallel Programming Survey

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Next Generation GPU Architecture Code-named Fermi

Introduction to GPGPU. Tiziano Diamanti

CUDA programming on NVIDIA GPUs

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Scalability and Classifications

Multi-Threading Performance on Commodity Multi-Core Processors

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

OpenCL Programming for the CUDA Architecture. Version 2.3

Generations of the computer. processors.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High Performance Computing in CST STUDIO SUITE

Clustering Billions of Data Points Using GPUs

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introduction to GPU Programming Languages

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Introduction to Cloud Computing

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Trends in High-Performance Computing for Power Grid Applications

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

L20: GPU Architecture and Models

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Binary search tree with SIMD bandwidth optimization using SSE

IP Video Rendering Basics

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

The Central Processing Unit:

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

GPGPU Computing. Yong Cao

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Evaluation of CUDA Fortran for the CFD code Strukti

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

HP ProLiant SL270s Gen8 Server. Evaluation Report

Choosing a Computer for Running SLX, P3D, and P5

Introduction to GPU Architecture

Real-time processing the basis for PC Control

HPC with Multicore and GPUs

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Computer Graphics Hardware An Overview

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

NVIDIA GeForce GTX 580 GPU Datasheet

Chapter 2 Parallel Architecture, Software And Performance

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

An examination of the dual-core capability of the new HP xw4300 Workstation

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Parallel Algorithm Engineering

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

~ Greetings from WSU CAPPLab ~

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Improved LS-DYNA Performance on Sun Servers

Chapter 1 Computer System Overview

Lattice QCD Performance. on Multi core Linux Servers

Texture Cache Approximation on GPUs

BLM 413E - Parallel Programming Lecture 3

Recommended hardware system configurations for ANSYS users

HP Workstations graphics card options

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Data Centric Systems (DCS)

Overview of HPC Resources at Vanderbilt

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Turbomachinery CFD on many-core platforms experiences and strategies

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

ME964 High Performance Computing for Engineering Applications

Discovering Computers Living in a Digital World

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Lecture 1. Course Introduction

Parallel Computing. Benson Muite. benson.

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Communicating with devices

Optimizing Application Performance with CUDA Profiling Tools

Transcription:

ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS Thesis Submitted to The School of Engineering of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree Master of Science in Electrical Engineering by Bing Han UNIVERSITY OF DAYTON Dayton, Ohio May 2010

ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS APPROVED BY: Tarek M. Taha, Ph.D. Advisory Committee Chairman Department of Electrical and Computer Engineering John Loomis, Ph.D. Committee Member Department of Electrical and Computer Engineering Balster, Eric Ph.D. Committee Member Department of Electrical and Computer Engineering Malcolm W. Daniels, Ph.D. Associate Dean School of Engineering Tony E. Saliba, Ph.D. Dean, School of Engineering ii

C Copyright by Bing Han All rights reserved 2010

ABSTRACT ACCELERATION OF IKING NEURAL NETWORK ON GENERAL PURPOSE GRAPHICS PROCESSORS Name: Han, Bing University of Dayton Advisor: Dr. Tarek M. Taha There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and have generally utilized more accurate neuron models, such as the Izhikevich and Hodgkin-Huxley models, in favor of the more popular integrate and fire model. This thesis examines the feasibility of using GPGPUs for accelerating a spiking neural network based character recognition network to enable large scale neural systems. Two versions of the network utilizing the Izhikevich and Hodgkin- Huxley models are implemented. Three NVIDIA GPGPU platforms and one GPGPU cluster were examined. These include the GeForce 9800 GX2, the Tesla C1060, the Tesla S1070 platforms, and the 32-node Tesla S1070 GPGPU cluster. Our results show that the GPGPUs can provide significant speedups over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided speedups of 5.6 and 84.4 time over highly optimized implementations on the fastest CPU tested, a quad core 2.67 GHz Xeon processor, for the Izhikevich and Hodgkin Huxley models respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPGPUs are well suited for this application domain. iii

A large portion of the results presented in this thesis have been published in the April 2010 issue of Applied Optics [1]. iv

ACKNOWLEDGEMENTS My special thanks are in order to Dr. Tarek M. Taha, my advisor, for providing the time and equipment necessary for the work contained herein, and for directing this thesis and bringing it to its conclusion with patience and expertise. v

TABLE OF CONTENTS ABSTRACT... iii ACKNOWLEDGEMENTS... v LIST OF FIGURES... viii LIST OF TABLES... x CHAPTERS: I. Introduction... 1 II. Background... 4 2.1 Hodgkin-Huxley Model... 4 2.2 Izhikevich Model... 5 2.3 GPGPU Architecture... 6 2.4 Related work... 11 III. Network Design... 14 IV. GPGPU Mapping... 16 4.1 Neuronal Mapping... 17 4.2 Synapses Mapping... 18 4.3 MPI Mapping... 19 4.4 Experimental Setup... 21 4.5 GPGPU cluster configurations... 23 vi

V. Results... 26 5.1 Single GPGPU card performances... 26 5.2 MPI performances... 32 5.3 Memory analysis... 34 5.4 Firing rate analysis... 36 VI. Conclusion... 38 Bibliography... 40 Appendix... 43 vii

LIST OF FIGURES Figure Page 2.1 Spikes produced with the Hodgkin Huxley model.... 5 2.2 Spikes produced with the Izhikevich model.... 6 2.3 Floating point operations per second for the Intel CPUs and NVIDIA GPUs... 7 2.4 Memory bandwidth for the Intel CPUs and NVIDIA GPUs... 7 2.5 Processing flow on CUDA... 9 2.6 A simplified CUDA GPGPU architecture.... 11 3.1 Network used for testing spiking models... 14 4.1. Algorithm flowchart... 16 4.2. GPGPU cluster mapping... 20 4.3. Architecture of the GPGPU cluster... 24 4.4. Training images... 25 5.1. Additional test images... 26 5.2. Neurons/second throughput of GPGPU and CPU processors for the (a) the Izhikevich model and (b) the Hodgkin Huxley model.... 28 5.3. Timing breakdown by percentage of overall runtime for processing the 768x768 image using (a) Izhikevich model and (b) Hodgkin Huxley model... 31 5.4. Neurons/second throughput of GPGPU cluster... 33 5.5. MPI communication time... 34 5.6. Memory requirement for different configurations... 35 viii

5.7. Variation in speedup of GPGPU implementations on 9800 GX2 platform with firing rate in the level 1 neurons for the: (a) Izhikevich model, and (b) Hodgkin Huxley Model.... 37 ix

LIST OF TABLES Table Page 4.1. Threads hierarchy and warps... 18 4.2. Spiking network configurations evaluated.... 22 4.3. GPGPU cluster configuration... 24 5.1. Speedups achieved on four GPGPU platforms for Izhikevich model for different image sizes.... 27 5.2. Speedups achieved on four GPGPU platforms for Hodgkin Huxley model for different image sizes.... 27 5.3. Bytes/Flop requirement of the models and capabilities of the GPGPU cards... 30 5.4. Bytes/Flop achieved by different GPGPU cards... 30 5.5. GPGPU cluster run time... 32 5.6. MPI timing breakdown... 33 5.7. Memory required by different network component... 34 x

CHAPTER I Introduction At present there is a strong interest in the research community to model biological scale image recognition systems such as the visual cortex of a rat or a human. Most small scale neural network image recognition systems generally utilize the integrate and fire spiking neuron model. Izhikevich points out however that this model is one of the least biologically accurate models. He states that the integrate and fire model cannot exhibit even the most fundamental properties of cortical spiking neurons, and for this reason it should be avoided by all means [2]. Izhikevich compares a set of 11 spiking neuron models [2] in terms of their biological accuracy and computational load. He shows that the five most biologically accurate models are (in order of biological accuracy) the: 1) Hodgkin-Huxley, 2) Izhikevich, 3) Wilson, 4) Hindmarsh-Rose, and 5) Morris-Lecar models. Of these the Hodgkin-Huxley model is the most compute intensive, while the Izhikevich model is the most computationally efficient. As a result, current studies of large scale models of the cortex are generally using either the Izhikevich [3] or the Hodgkin-Huxley [4] spiking neuron models. One of the main problems with using the Hodgkin-Huxley model for large scale implementations is that it is computationally intensive. The recently published Izhikevich model is attractive 1

for large scale implementations as it is close to the Hodgkin-Huxley model in biological accuracy, but is similar to the integrate and fire model in computational intensity. Given that the visual cortex contains 10 9 neurons, biological scale vision systems based on spiking neural networks would require high performance computation capabilities. General purpose graphics processing units (GPGPUs) have attracted significant attention recently for their low cost and high performance capabilities on scientific applications. These architectures are well suited for applications with high levels of data parallelism, such as biological scale cortical vision models. In this thesis we examine the performance of a spiking neural network based character recognition model on three NVIDIA GPGPU platforms and one NVIDIA GPGPU cluster: the GeForce 9800 GX2, the Tesla C1060, the Tesla S1070 platforms, and the 32 nodes Tesla S1070 GPGPU cluster. The recognition phase of two versions of this pattern recognition model, based on the Izhikevich and Hodgkin-Huxley spiking neuron models, are examined. The model utilizes pulse coding to mimic the high speed recognition taking place in the mammalian brain [5] and spike timing dependent plasticity (STDP) for training [6]. It was trained to recognize a set of 48 24 24 pixel images of characters scaled up to 9216 9216 pixels and was able to recognize noisy versions of these images. The largest networks tested had over 84 million neurons and 4 billion synapses implemented. An initial version of this model was presented in [7]. A large portion of the results presented in this thesis have been published in the April 2010 issue of Applied Optics [1]. The GPGPUs examined in this study utilize the Compute Unified Device Architecture (CUDA) from NVIDIA. This architecture allows code compatibility 2

between different generations of NVIDIA GPGPUs. All the GPGPU platforms utilized the same CUDA code. For each GPGPU, we also examined the performance of the general purpose processor on that platform using a highly optimized C version of the code. This C code utilized both data and thread parallelism (through explicit parallel coding in C), to ensure that all cores and the vector parallelism within the cores were utilized. The processors utilized were a 2.59 GHz dual core AMD Opteron processor, a 2.67 GHz quad core Intel Xeon processor, and a 2.4 GHz dual core AMD Opteron processor. The fastest GPGPU utilized, the Tesla S1070, provided speedups of 5.6 and 84.4 for the Izhikevich and Hodgkin Huxley models respectively, when compared to highly optimized code implementations on the fastest CPU tested, the quad core 2.67 GHz Xeon processor. Based on a thorough review of the literature, this is the first study examining the application of GPGPUs for the acceleration of the more biologically accurate Izhikevich and Hodgkin-Huxley models applied to image recognition. Our results indicate that these models are well suited to GPGPU platforms. Chapter II of this thesis discusses background material including a brief introduction to the two spiking network models considered for this study and a discussion of related work. Chapter III presents the character recognition model and examines its mapping to the GPGPU platforms. Chapters IV and V describe the experimental setup and results of the model performance, while chapter VI concludes the thesis. 3

CHAPTER II Background 2.1 Hodgkin-Huxley Model The Hodgkin-Huxley model [8] is considered to be one of the most biologically accurate spiking neuron models. It consists of four differential equations (eq. 1-4) and a large number of parameters. The differential equations describe the neuron membrane potential, activation of Na and K currents, and inactivation of Na currents. The model can exhibit almost all types of neuronal behavior if its parameters are properly tuned. This model is very important to the study of neuronal behavior and dynamics as its parameters are biophysically meaningful and measurable. A time step of 0.01 ms was utilized to update the four differential equations as this is the most commonly used value. Fig. 2.1 shows the spikes produced with this model in our study (using the parameters in Appendix A). dv dt 1 C 4 3 ( ){ I gkn ( V EK ) gnam h( V ENa ) gl( V EL)} (1) dn ( n ( V ) n) / n( V ) (2) dt dm ( m ( V ) m ) / m( V ) dt (3) dh ( h ( V ) h) / h( V ) (4) dt 4

Membrane Voltage (mv) Time (ms) Fig 2.1. Spikes produced with the Hodgkin-Huxley model. 2.2 Izhikevich Model Izhikevich proposed a new spiking neuron model in 2003 [9] that is based on only two differential equations (eq. 5-6) and four parameters. In a study comparing various spiking neuron models [2], Izhikevich found that the Izhikevich model required 13 flops to simulate one millisecond of a neuron s activity, while the Hodgkin Huxley model required 1200 flops. The former was still able to reproduce almost all types of neural responses seen in biological experiments, in spite of the low flop count. In our study, a time step of 1ms was utilized (as was done by Izhikevich in [9]). Fig. 2.2 shows the spikes produced with this model. dv dt 2 0.04V 5V 140 u I (5) du a( bv u) dt (6) V c if V 30 mv, then u u d 5

Membrane Voltage (mv) Time (ms) Fig 2.2. Spikes produced with the Izhikevich model. 2.3 GPGPU Architecture Graphics processing units (GPUs) were originally designed to alleviate the CPU of computational tasks related to drawing onto the external display by utilizing a number of graphics primitive operations. It was dedicated to calculate floating point operations, and for many years, GPUs were only utilized for graphics pipeline accelerations. The concept of General Purpose GPU (GPGPU) was not feasible until recent years since GPUs did not have high enough computational capability in the past decades. In recent years, strong market demands for real-time, high-definition 3D graphics have helped GPUs to evolve into very powerful many-core processors with highly parallel, multithreaded architectures utilizing very high bandwidth memories [10]. Using the massive floating-point computational power of a modern graphics accelerator s shader pipeline, these powerful GPUs can now be utilized for general purpose computations (as opposed to graphics operations). GPGPUs provide a new low cost but high performance platform for many scientific studies and applications such as the physic based simulation, CT reconstruction, and GPGPU based pattern recognition. 6

Bandwidth GB/s Peak GFLOPs 1000 900 NVIDIA GPU GT200 800 700 600 500 Intel CPU G80 G80 Ultra G92 400 300 G71 200 100 0 NV30 NV35 NV40 G70 3.0 GHz Core2 Duo 3.2 GHz Harpertown Sep-02 Jan-04 May-05 Oct-06 Feb-08 Jul-09 GT200=GeForce GTX280 G71=GeForce7900GTX NV35=GeForce FX 5950Untra G92=GeForce 9800GTX G70=GeForce 7800 GTX NV30=GeForce FX 5800 G80=GeForce 8800GTX NV40=GeForce 6800Ultra Fig 2.3. Floating point operations per second for Intel CPUs and NVIDIA GPUs. Data from [10] 120 100 NVIDIA GPU Intel CPU G80 G80 Ultra 80 60 G71 40 NV40 20 NV30 Prescott EE Woodcrest Harpertown Northwood 0 2002 2003 2004 2005 2006 2007 2008 Fig 2.4. Memory bandwidth for Intel CPUs and NVIDIA GPUs. Data from [10] 7

Fig 2.3 shows the discrepancy in floating point capability between different generations of Intel CPUs and NVIDIA GPUs. As shown in, there is an increasing discrepancy in the floating point performance between CPUs and GPGPUs. This is mainly because GPGPUs are specifically designed for compute-intensive and highly parallel applications. Unlike conventional CPUs, GPGPUs devoted most of their transistors for data processing as opposed to data caching and flow control. Modern mainstream processors, such as multi-core CPUs and many-core GPGPUs, are parallel systems that continue to scale with Moore s law [10]. The main challenge for these systems is the development of application software that scales its parallelism transparently with the increasing number of processor cores. The Compute Unified Device Architecture (CUDA) is a general purpose parallel computing architecture that was first released by NVIDIA in the late 2006. This has a software environment that allows developers to use C (with NVIDIA extensions) as the high level programming language to code algorithms for execution on their GPGPUs. According to NVIDIA documents [10], CUDA is parallel programming language designed to make parallel programming easy and with a low learning curve. The programs developed using this language are supposed to be portable across multiple NVIDIA GPGPU platforms. CUDA essentially hides the underlying hardware architecture of the GPU from the programmer. To do this CUDA utilizes a structure that has three basic features: hierarchy of threads, shared memories, and barrier synchronization [10]. These features are all accessible to the programmer via a set of language extensions. As a result, the programmer can easily partition a problem into coarse sub-problems that can be solved independently in parallel and then these sub-problems are split into finer pieces to be 8

solved cooperatively in parallel. Such a decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time, enables transparent scalability since each sub-problem can be scheduled to be solved on any of the available processor cores. A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count. The CUDA processing flow is shown in Fig. 2.5. Main Memory CPU 1 4 2 GPU Memory 3 1. Copy processing data 2. Instruct the processing 3. Execute parallel in each core 4. Copy the result NVIDIA GPGPU Fig 2.5. Processing flow on CUDA. Based on figure in [11]. 9

NVIDIA s CUDA GPGPU architecture consists of a set of scalar processors (s) operating in parallel. Eight scalar processors are organized into a streaming multiprocessor (SM), with several SMs on a GPGPU (see Fig. 2.6). Each SM has an internal shared memory along with caches (a constant and a texture cache), a multithreaded instruction (MTI) unit, and special functional units. The SMs share a global memory. A large number of threads are typically assigned to each SM. The MTI can switch threads frequently to hide longer global memory accesses from any of the threads. Both integer and floating point data formats are supported in the s. The NVIDIA GeForce 9800 GX2 GPGPU card utilized in this study has two 1.5 GHz G92 graphics processing chips, with each G92 GPGPU containing 128 CUDA scalar processors [12]. The two G92 GPGPUs are placed on two separate printed circuit boards with their own power circuitry and memory while still only requiring a single PCI-Express 16x slot on the motherboard (a dual-pcb configuration). Each GPGPU has access to a 256-bit DDR3 memory interfaces with 512MB of memory. This amounts to 1GB of memory for the full card with a peak theoretical bandwidth of 128 GB/s (64 GB/s per GPGPU) [12]. The sustained performance of the NVIDIA 9800 GX2 is approximately 1152 GFLOPs (576 GFLOPS per GPGPU) [12]. 10

SM MTI Cache SM MTI Cache SM MTI Cache SM MTI Cache SFU SFU SFU SFU SFU SFU SFU SFU Shared memory Shared memory Shared memory Shared memory Global Memory Fig 2.6. A simplified CUDA GPGPU architecture. Here =Scalar Processor, SM=Streaming Multiprocessor, SFU=Special Functional Unit, and MTI=Multi-Threaded Instruction Unit. The Tesla C1060 GPGPU has a single graphic processing chip (based on the NVIDIA Tesla T10 GPGPU) with 240 processing cores working at 1.3 GHz [13]. The Tesla C1060 is capable of 933 GFLOPs of processing performance and comes standard with 4 GB of 512-bit DDR3 memory at 102 GB/s bandwidth [13]. The Tesla S1070 GPGPU has four 1.3GHz T10 GPGPU cores, each with access to 4GB of DDR3 memory (totaling 16 GB) [14]. The card is capable of 3.73 GFLOPs and has a combined memory bandwidth of 408 GB/s [14]. 2.4 Related work Several groups have studied image recognition using spiking neural networks. In general, these studies utilize the integrate and fire model. Thorpe developed SpikeNET [5], a large spiking neural network simulation software. The system can be used for several image recognition applications including identification of faces, fingerprints, and video images. Johansson and Lansner developed a large cluster based spiking network simulator of a rodent sized cortex [15]. They tested a small scale version of the system to 11

identify 128 128 pixel images. Baig [16] developed a temporal spiking network model based on integrate and fire neurons and applied them to identify online cursive handwritten characters. Gupta and Long [17] investigated the application of spiking networks for the recognition of simple characters. Other applications of spiking networks include instructing robots in navigation and grasping tasks [18], recognizing temporal sequences [19][20], and the robotic modeling of mouse whiskers [21]. At present several groups are developing biological scale implementations of spiking networks, but are generally not examining the applications of these systems (primarily as they are modeling large scale neuronal dynamics seen in the brain). The Swiss institution EPFL and IBM are developing a highly biologically accurate brain simulation [4] at the subneuron level. They have utilized the Hodgkin-Huxley and the Wilfred Rall [22] models to simulate up to 100,000 neurons on an IBM BlueGene/L supercomputer. At the IBM Almaden Research Center, Ananthanarayanan and Modha [3] utilized the Izhikevich spiking neuron model to simulate 55 million randomly connected neurons (equivalent to a rat-scale cortical model) on a 32,768 processor IBM BlueGene/L supercomputer. Johansson et. al. simulated a randomly connected model of 22 million neurons and 11 billion synapses using an 8,192 processor IBM BlueGene/L supercomputer [23]. A few studies have recently examined the performance of spiking neural network models on GPGPUs. Tiesel and Maida [24] evaluated a grid of 400x400 integrate and fire neurons on an NVIDIA 9800 GX2 GPGPU. In their single layer network, each neuron was connected to its four closest neighbors. This network was not trained to recognize any image. They compared a serial implementation on an Intel Core 2 Quad 2.4 GHz 12

processor to a parallel implementation on an NVIDIA 9800 GX2 GPGPU and achieved a speedup of 14.8 times. Nageswarana et al. [25] examined the implementation of 100,000 randomly connected neurons based on the more biologically accurate Izhikevich model. They compared a serial implementation of the model on a 2.13 GHz Intel Core 2 Duo against a parallel implementation on an NVIDIA GTX280 GPGPU and achieved a speedup of about 24 times. Also, Nageswarana et al [26] developed the acceleration of spike based convolutions using GPGPUs. A kernel speedup of 35 times is achieved by a single NVIDIA GTX 280 GPGPU over a CPU only implementation. Fidjeland et al [27] presented a GPGPU implementation (on a platform named NeMo) of the Izhikevich model and achieved 400 million spikes per second throughput performance. Nere and Lipasti [28] developed a GPGPU implementation of their own neocortex inspired pattern recognition model. A speedup of 373 was achieved over their unoptimized C++ serial implementation. In contrast to these implementations, we implement two spiking neural network models (Izhikevich and Hodgkin-Huxley) capable of image recognition on three different GPGPU platforms as well as a GPGPU cluster and compare the performances of the three GPGPUs against their corresponding CPUs. Both data and thread level parallelism was utilized for parallel implementations on those CPUs to ensure a fare comparison. 13

CHAPTER III Network Design A two layer network structure was utilized in this study with the first layer acting as input neurons and the second layer as output neurons (see Fig. 3.1). Input images were presented to the first layer of neurons, with each image pixel corresponding to a separate input neuron. Thus the number of neurons in the first layer was equal to the number of pixels in the input image. Binary inputs were utilized in this study. If a pixel was on, a constant current was supplied to the input, while no current was supplied if a pixel was off. The number of output neurons was equal to the number of training images. Each input neuron was connected to all the output neurons. Level 2 Level 1 Fig 3.1. Network used for testing spiking models. In the training process, images from the training set were presented sequentially to the input neurons. The weight matrix of each output neuron was updated using the STDP rule each cycle. In the testing phase, an input image is presented to the input neurons and after a certain number of cycles, one output neuron fires, thus identifying the 14

input image. During each cycle, the level 1 neurons are first evaluated based on the input image and a firing vector is updated to indicate which of the level 1 neurons fired that cycle. In the same cycle, the firing vector generated in the previous cycle is used to calculate the input current to each level 2 neuron. The level 2 neuron membrane potentials are then calculated based on their input current. A level 2 neuron s overall input current is the sum of all the individual currents received from the level 1 neurons connected to it. This input current for a level 2 neuron is given by eq. 7 below: where I j w(i, j ) f ( i ) (7) i w is a weight matrix in which w(i,j) is the input weight from level 1 neuron i to level 2 neuron j. f is a firing vector in which f(i) is 0 if level 1 neuron i does not fire, and is 1 is the neuron does fire. Two versions of the model were developed where the main difference was the equations utilized to update the potential of the neurons. In one case, the Hodgkin-Huxley equations presented in section 2.1 were used, while in the second case, the Izhikevich equations in section 2.2 were used. The parameters utilized in each case are specified in Appendix I. 15

CHAPTER IV GPGPU Mapping In order to achieve high performance from the GPGPU, all the processing cores need to be utilized at the same time. Each streaming multiprocessors can execute a group of threads concurrently (call a warp) on its scalar processors and can easily switch between warps [29]. To hide long memory latencies, it is recommended that hundreds to thousands of threads be assign to the GPGPU. In case one warp is waiting on a memory access, the streaming multiprocessor can easily switch to other warps to hide this latency. Fig. 4.1 shows a flowchart of the main components of the algorithm. Three kernels are executed on the GPGPU to update the level 1 and level 2 neurons and their corresponding synaptic currents (level 2 neurons input weights). To achieve this high parallelism, two mapping strategy implemented is described as following. Start Read new Image (CPU) Update level 1 neurons (kernel 1) Update synaptic currents for level 2 neurons (kernel 2) Update level 2 neurons (kernel 3) No Any level 2 neurons fired? (CPU) Yes Fig 4.1. Algorithm flowchart. 16

4.1 Neuronal Mapping The networks utilized in this study have high amounts of data parallelism. For example, the 768 768 network contains up to 589,824 neurons from level 1, 48 neurons from level 2 and over 28 million synapses. All the level 1 neuron evaluations are carried out by kernel 1 (shown in Fig. 4.1). Each neuron is evaluated by a separate thread on the GPGPU, thus resulting in the total number of threads available being equal to the total number of neurons in this layer. This amounts to a very large number of threads (up to 589,824) available for execution. Before being mapped to GPGPU streaming multiprocessors, threads are arranged into two level hierarchies: grid level and block level. All the threads of the same kernel [shown as in Fig 4] belong to the same grid. Multiple kernels could be called from the host. Each grid can contain up to 65536 thread blocks, while each thread block can contain up to 512 threads. To achieve high performance, no less than 128 threads need to be assigned to each thread block. Each thread block is distributed to different streaming multiprocessor. All the threads of the same thread block are consistently executed on the same streaming multiprocessor. Threads of a thread block are split into multiple warps, with 32 threads in a warp. One streaming multiprocessor can run 32 threads (1 warp) on its eight scalar processors concurrently. Streaming multiprocessors can efficiently switch among warps to hide long memory accesses. Unlike vector machines, which typically utilize SIMD (single instruction multiple data), operations in the CUDA architecture are carried out using SIMT (single instruction multiple thread). Each CUDA thread has its own program counter and register state, and thus can run independently of the other. In a SIMD processor, all elements of a vector are 17

operated on in lock step. To achieve high performance, it is necessary to make all threads in a warp have the same execution path. Therefore, branch or divergence in a warp should be avoided by all means to achieve high performances. Image sizes 384 384 Table 4.1. Threads hierarchy and warps. 768 768 Izhikevich 1536 1536 3072 3072 384 384 768 768 Hodgkin-Huxley 1536 1536 3072 3072 Number of grids 3 3 3 3 3 3 3 3 Number of threads 147456 589824 2359296 9437184 147456 589824 2359296 9437184 Threads per block 256 256 256 256 256 256 256 256 Number of thread Blocks 576 2304 9216 36864 576 2304 9216 36864 Warp size 32 32 32 32 32 32 32 32 Number of warps 4608 18432 73728 294912 4608 18432 73728 294912 4.2 Synapses Mapping The summation of the neuron weights described in eq. 7 is carried out for one level 2 neuron at a time (kernel 2 in Fig. 4.1). Here neuronal mapping described in the previous section is no longer applicable to provide mass parallelism or achieve high performance, because only 48 level 2 neurons postsynaptic current need to be calculated and only 48 threads are provided by this method. In this case, much more working load is going to be assigned to much less number of threads and lead to very poor performance. So instead of mapping each neuron to a thread, each synaptic weight is then mapped to one thread. Since there are totally over 28 million synapses mass parallelism is achieved again by this mapping method. For a particular level 2 neuron, j, all the elements in the masked weight vector w(i,j)f(i) have to be added to each other to generate a final input current. This is done by splitting w(i,j)f(i) into multiple sub-vectors, with the sum of the all the sub-vectors accumulated in GPGPU shared memory. When all the sub-vectors 18

have been added, parallel reduction [30] is used to generate an input current by adding all the elements of the accumulated vector. Three kernels are executed on the GPGPU to update the level 1 and level 2 neurons and their corresponding synaptic currents (level 2 neurons input weights). Only the check of whether a level 2 neuron fired (and thus categorized an input image) is carried out on the CPU as this is a serial operation. All the neuron parameters are implemented using single floating point variables on the GPGPU. 4.3 MPI Mapping The limited memory associated with each GPGPU core (at most 4GB in the architectures considered) sets a limit on the largest network size that can be evaluated in parallel. Thus to evaluate a large neural network, it would be necessary to split the network into multiple parts and evaluate each part separately. This will make the evaluation slow. A cluster of GPGPUs offers a faster alternative where multiple components of a large network can be evaluated in parallel on separate GPGPU cores. The Message Passing Interface, or simply MPI, provides a good way for the different GPGPU cores to communicate. By using MPI networks of sizes 6144 6144, 9216 9216, and above can be implemented and achieve very high performance. The architecture of GPGPU cluster mapping is shown in Fig 4.2. In order to achieve good performance in the GPGPU cluster, the GPGPU to CPU ratio should be equal to one. So each GPGPU core can be attached to a unique CPU and works as its private co-processor. 19

Master CPU0 Worker CPU1 Worker CPU2 Worker CPU3 Worker CPUN. GPU1 GPU2 GPU3 GPUN Fig 4.2. GPGPU cluster mapping. From the MPI point of view, one Tesla S1070 card is considered as four individual T10 GPGPUs with 4 GB of memory each. Here the phase one node does not mean a physical computer node (see section 4.4) but a virtual node consisting of a CPU core with DRAM and a T10 GPGPU core as well as its 4 GB of on-board memory. In each node both the update of the subset of level 1 neurons and the corresponding weight summation are carried out in the same manner as the non-mpi version GPGPU implementation described in earlier sections. The 48 level 2 neurons are carried out by the master CPU instead. To make full use of each node, larger images are segmented into multiple subimages with the size equal to the largest size feasible in a single GPGPU core (described in detail in section 5.3). These sub-images are processed individually and concurrently using multiple GPGPU cores. In the N nodes GPGPU cluster based MPI implementation (as shown in Fig 4.2), the algorithm is mapped to the GPGPU cluster as following: 20

1. Each node reads its sub-image form the hard drive and transmits it to its GPGPU on-board memory. 2. The GPGPU core on each node calculates its sub-image until it finds level 1 neurons fired, after calculating the post-synaptic current for this sub image, a 48 length vector (post synaptic current of sub-image) is transmitted to the master CPU. 3. The master CPU then adds the 48 length vectors from the N nodes to get the total post synaptic current. Then the master CPU updates the level 2 neurons until it finds a level 2 neuron fired indicating the final recognition result. The number of nodes implemented in this thesis is from 4 to 9. The performance and timing breakdown of this MPI implementation is shown in the result chapter. 4.4 Experimental Setup The models were implemented on three GPGPU platforms. These are: 1) An Ubuntu Linux based PC with two 2.59 GHz dual core AMD Opteron 285 processors, 4 GB of DRAM, and an NVIDIA GeForce 9800 GX2 graphics card with two G92 GPGPUs. 2) A Red Hat Enterprise Linux based PC with one 2.67 GHz quad core Xeon 5550 processor, and 12 GB of DRAM. Two versions of this system were utilized. One had a single NVIDIA Tesla C1060 GPGPU. The other had two NVIDIA C1060 GPGPUs with both placed on the same motherboard, but on two different PCI-Express 2.0 16x slots. 3) A Fedora 10 Linux based PC with two 2.4 GHz dual core AMD Opteron 2216 processors, 8 GB of DRAM, and a NVIDIA Tesla S1070 GPGPU. 21

4) A 32-node Fedora 10 Linux based cluster with InfiniBand interconnection, each node is equipped with two 2.4 GHz dual core AMD Opteron 2216 processors, 8 GB DRAM and an NVIDIA Tesla S1070 GPGPU. A C implementation of the models was run on one of the AMD Opteron processors for systems 1 and 3 above, and on the Intel Xeon processor. The C code utilized the SSE3 SIMD extensions for data parallelism. It also utilized the POSIX threads library to run on all cores available (two on the AMD processors and four on the Xeon processor). The programs were compiled with -O3 optimizations using gcc. The GPGPU implementation of the model utilized CUDA 2.3 nvcc. Four networks with varying input image sizes were developed to examine the two spiking neural network models. The overall network structure was kept similar to the design shown in Fig. 3.1, with two layers of nodes per network. Each of the level 2 neurons was connected to all of the neurons from level 1. Table 4.2 shows the number of neurons per layer for all the network sizes implemented. In this study all the spiking networks of different sizes were trained to recognize the same number of images. A set of 48 24 24 pixel images were generated initially and were then scaled linearly to the different network input sizes. Fig. 4.4 shows the training images used. The images represented the 26 upper case letters (A-Z), 10 numerals (0-9), 8 Greek letters, and 4 symbols. Table 4.2. Spiking network configurations evaluated. Input image size 384 384 768 768 1536 1536 3072 3072 Total neurons 147,504 589,872 2,359,344 9,437,232 Level 2 neurons 48 48 48 48 Level 1 neurons 147,456 589,824 2,359,296 9,437,184 22

4.5 GPGPU cluster configurations As shown in Fig 4.2, the three basic components of the GPGPU cluster utilized in this study are the host nodes, the GPGPUs and the interconnect between the host nodes. The GPGPU cluster being used in this study is the Accelerator Cluster (AC) [31] at National Center for Supercomputing Applications (NCSA) [32]. The structure of the AC cluster is shown in Fig 4.3. It has 32 nodes connected by InfiniBand. Each node is an HP wx9400 workstation equipped with two 2.4 GHz AMD Opteron dual-core 2216 processors with 1MB of L2 cache and 1GHz Hyper Transport link and 8GB (4x2GB) of DDR2-667 memory. Each node has two PCIe Gen1 x16 slots and one PCIe x8 slot. The two x16 slots are used to connect to a single Tesla S1070 Computing System (4 GPGPUs) and the x8 slot is used to connect to an InfiniBand QDR adapter. Each node of the cluster runs on Fedora 10 Linux. Detailed configurations and architecture are presented in Table 4.3 and Fig 4.3. This cluster is well balanced because: 1. Nodes interconnection: InfiniBand QDR interconnect is used to match the GPGPU/Host bandwidth. In contrast to slower Ethernet inter-connections (1-2GB/s) with network card, the InfiniBand QDR inter-connection (4GB/s) connects different nodes via PCIe slots and QDR adapters [31]. 2. CPU/GPGPU ratio: This is equal to one in order to simplify the MPI based implementations [31]. 3. Host memory: This is larger than the on-board memory of a GPGPU core in order to enable full utilization of the GPGPU core on that node [31]. 23

QDR IB InfiniBand HP xw9400 workstation Node 1 Node 2 Node N Dual core CPU Dual core CPU PCIe Bus PCIe Bus PCIe x 16 PCIe x 16 PCIe interface PCIe interface T10 T10 T10 T10 DRAM DRAM DRAM DRAM Tesla S1070 Node 0 Fig 4.3. Architecture of the GPGPU cluster (Although all 32 nodes have the same configuration, only node 0 is shown in detail) (based on figure in [31]). Table 4.3. GPGPU cluster configuration (based on table in [31]). CPU Host HP xw9400 CPU Dual-core 2216 AMD Opteron CPU frequency 2.4 CPU cores per node 4 CPU memory host (GB) 8 GPU host Tesla S1070 GPU frequency (GHz) 1.3 GPU chips per node 4 GPU memory per host (GB) 16 CPU/GPU ratio 1 interconnect IB QDR # of cluster nodes 32 # of CPU cores 128 # of GPU chips 128 # of racks 5 Total power rating (kw) <45 24

Fig 4.4. Training images. 25

CHAPTER V Results Both the CPU and the GPGPU versions of the models were able to recognize all the training images (Fig. 4.4). They were also able to recognize nine out of ten noisy versions of these images accurately (Fig. 5.1), with image #6 not being recognized. (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Fig 5.1. Additional test images. 5.1 Single GPGPU card performances The non-mpi neurons per second throughput of the GPGPUs and the CPUs are shown in Fig. 5.2 for the Izhikevich and Hodgkin Huxley models. Note that the GeForce 9800 GX2 card was unable to implement the larger image sizes due to limited on-board memory. The performance of both the AMD processors were almost the same, and so only one of them (the 2.4GHz version) was plotted in Fig. 5.2. The performances of both Tesla C1060 based platforms, with one and two C1060s, are shown (the two C1060 platform is listed as C1060 2). The results show that the performance of the GPGPUs is 26

higher than that of the general purpose processors examined. The fastest GPGPU and CPU utilized in this study were the Tesla S1070 and the quad core 2.67 GHz Xeon processor respectively. Comparing the performances of these two systems in Fig. 5.2 shows that the fastest GPGPU utilized (S1070) provided speedups of 5.6 and 84.4 over the fastest CPU (Xeon) for the Izhikevich and Hodgkin Huxley models respectively (for the largest image size). The speedups of each GPGPU over the CPU on its own platform are shown in Tables 5.1 and 5.2. Given that the C1060 is paired with the fastest processor (with an Intel Xeon processor with four cores), the speedup for this GPGPU was the lowest. The Izhikevich model provided a lower speedup as it is less computationally demanding (further discussion below). Table 5.1. Speedups achieved on four GPGPU platforms for Izhikevich model for different image sizes. 384 384 768 768 1536 1536 3072 3072 S1070 11.7 20.1 23.7 24.6 9800 GX2 5.1 8.8 9.7 -- C1060 ( 1) 1.2 1.5 1.6 1.7 C1060 ( 2) 2.3 3.0 3.3 3.6 Table 5.2. Speedups achieved on four GPGPU platforms for Hodgkin Huxley model for different image sizes. 384 384 768 768 1536 1536 3072 3072 S1070 188.4 199.4 190.5 177.0 9800 GX2 74.4 74.6 -- -- C1060 ( 1) 22.8 22.4 21.5 21.5 C1060 ( 2) 45.6 45.1 43.0 43.0 For both Izhikevich and Hodgkin-Huxley models, image sizes of 384 384, 768 768, 1536 1536 and 3072 3072 pixels were examined. In the Hodgkin-Huxley model each neuron requires many more parameters, thus increasing the overall memory 27

Neurons/s Neurons/s footprint of the model. As a result the 1536 1536 and 3072 3072 images could not be implemented for the Hodgkin-Huxley model due to limited memory on the GeForce 9800 GX2 GPGPU. 3.5E+08 3.0E+08 2.5E+08 2.0E+08 1.5E+08 1.0E+08 5.0E+07 S1070 C1060 2 C1060 9800GX2 Xeon AMD 0.0E+00 1.0E+05 1.0E+06 Neurons 1.0E+07 (a) 2.5E+06 2.0E+06 1.5E+06 1.0E+06 5.0E+05 S1070 C1060 2 C1060 9800GX2 Xeon AMD 0.0E+00 1.0E+05 1.0E+06 1.0E+07 Neurons (b) Fig 5.2. Neurons/second throughput of GPGPU and CPU processors for the (a) the Izhikevich model and (b) the Hodgkin Huxley model. 28

The runtime breakdown of the GPGPU versions of models for the 768x768 image size are shown in Fig. 5.3. Both level 1 and level 2 neuron computations (kernels 1 and 3 in Fig. 4.1), along with the level 2 neuron input weight computations (kernel 2), take place on the GPGPU. The rest of the runtime is for: 1) main memory access, including data transfer between the host (CPU) memory and the device (GPGPU) memory, and 2) CPU operations, including initialization and checking all 48 level-two neurons voltages to find out which one fired (to determine the final recognition result). Given that the number of level 2 neurons does not vary with the input image size, the computation time for the level 2 neurons remains constant for a given model. The runtime for the remaining components do vary with the number of level 1 neurons, and hence the input image size. It is seen in Fig. 5.3 that the Izhikevich model has a higher percentage of time spent on CPU and memory access, compared to the Hodgkin-Huxley model. This leads to a higher overhead ratio in the Izhikevich model, and thus a lower speedup. The bytes of data required per flop for the models are listed in Table 5.3. It also lists the peak bytes per flop capability of the three GPGPUs examined (calculated from specification in [12] [13] and [14]). Table 5.4 also listed flops actually achieved by different GPGPU cards for network configuration 768 768. The results show that the memory bandwidth requirement for the Izhikevich model is about 19 times the peak capability of the Tesla GPGPUs, while bandwidth requirement for the Hodgkin-Huxley model is about 3.5 times higher than capability of these cards. This also leads to the speedup of the Izhikevich model to be lower than the Hodgkin-Huxley model. 29

Table 5.3. Bytes/Flop requirement of the models and capabilities of the GPGPU cards. Model/Platform Bytes/Flop Model: Izhikevich 2.11 Model: Hodgkin Huxley 0.38 Card: S1070 0.11 Card: C1060 0.11 Card: 9800 GX2 0.11 Table 5.4. Bytes/Flop achieved by different GPGPU cards (in GFLOPs). System Hodgkin- GPGPUs Izhikevich Peak Huxley 9800GX2 1152 97.68 860.34 C1060 933 70.19 456.52 C1060 2 1866 133.35 942.22 S1070 3732 197.88 1881.18 As shown in Table 5.4, the performance achieved by the double C1060 GPGPU configuration is about twice as high as the performance of a single C1060 GPGPU. Also, the performance of the Tesla S1070 GPGPU is about 4 times as high as a single C1060 GPGPU. This is because the Tesla C1060 has one Tesla T10 GPGPU inside with 4GB of memory, while the Tesla S1070 has four such T10 GPGPUs, each with its own 4 GB memory bank. Thus the performance of the S1070 is about 4 times that of the C1060. The performance increases linearly with the number of cores being used. Even though the C1060 card is a more advanced GPGPU compared to the 9800 GX2, the performance of a single C1060 card is no better than a 9800 GX2 card. This is because the 9800 GX2 has a dual-pcb configuration (as described in section 2.3) which includes two G92 cards. The total number of CUDA cores on a 9800 GX2 is 256 (128 per G92 GPGPU). This is more than the total of 240 CUDA cores in a C1060 card. Hence the 9800 GX2 card has a peak performance of 1152 GFLOPs while the C1060 has a peak 30

performance of 933 GFLOPs. What is more, the 9800 GX2 GPGPU has a smaller (1GB) but faster memory performance (128BG/s) compared with the lager (4GB) but slower (102 GB/s) memory of the C1060 GPGPU. So both more CUDA cores and a faster memory performance help the 9800 GX2 card to have a better performance than a single C1060 card. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 9800GX2 C1060 C1060 2 S1070 CPU/memory L2 weight (GPU) L2 (GPU) L1 (GPU) (a) 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% 9800GX2 C1060 C1060 2 S1070 CPU/memory L2 weight (GPU) L2 (GPU) L1 (GPU) (b) Fig 5.3. Timing breakdown by percentage of overall runtime for processing the 768x768 image using (a) Izhikevich model and (b) Hodgkin Huxley model. 31

5.2 MPI performances Since the performance of a single GPGPU is limited by both the capability of parallelization and the on-board memory space, MPI is utilized here for the high performance implementation of larger networks such as 6144 6144 and 9216 9216. Both the memory space and the total amount of parallelization increase linearly with the number of nodes devoted in this parallel computation. To make full use of each node, larger images are segmented into multiple sub-images, with the individual sub-images processed concurrently using multiple GPGPU cores. The size of each sub-image is equal to the largest size feasible in a single GPGPU core (described in detail in section 5.3). As shown in Table 5.5 and Fig 5.4, a 6144 6144 image is broken into four 3072 3072 segments for both the Izhikevich and Hodgkin-Huxley models. Similarly, image size 9216 9216 is broken into nine 3072 3072 sub-images and processed by nine GPGPU cores. The size of each sub image processed in each GPGPU core is consistently 3072 3072. Therefore the speedup for larger images is slightly less than the maximum seen for each of the models in the previous results because of task synchronizations and memory bandwidth limitation between host (CPU) memory and device (GPGPU) memory. Table 5.5. GPGPU cluster run time (ms). Image sizes Nodes Izhikevich Hodgkin- Huxley 3072 3072 1 124.86 17838.37 6144 6144 4 128.36 17843.92 9216 9216 9 131.19 17847.06 32

Neurons/s 1.00E+09 1.00E+08 1.00E+07 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 0 2 4 6 8 10 Number of T10 GPGPU cores Izhikevich Hodgkin-Huxley Fig 5.4. Neurons/second throughput of GPGPU cluster. The MPI timing breakdown is shown in Table 5.6. The MPI communication time is a very small fraction of the whole MPI run time. It has almost no influence on the system performance. But still it changes with the number of nodes used. The more nodes devoted to parallel processing, the longer time it takes to communicate via the InfiniBand connections between those worker nodes and the master node. The communication time versus the number of nodes are plotted in Fig 5.5. Table 5.6 MPI timing breakdown (ms). Image sizes 3072 3072 6144 6144 9216 9216 Number of nodes 1 4 9 Izhikevich Communication time 0.000002 0.000294 0.00048 Level 2 neuron update 0.44 0.44 0.44 Level 1 neurons update 124.42 127.92 130.75 Total run time 124.86 128.36 131.19 Hodgkin-Huxley Communication time 0.000002 0.000291 0.00053 Level 2 neuron update 38.72 38.72 38.72 Level 1 neurons update 17799.65 17805.2 17808.34 Total run time 17838.37 17843.92 17847.06 33

MPI Communication Time (ms) 0.0006 0.0005 0.0004 0.0003 communication time 0.0002 0.0001 0 3 5 7 9 Number of T10 GPGPU cores Fig 5.5. MPI communication time. 5.3 Memory analysis For the two models (Izhikevich and Hodgkin-Huxley) implemented in this study, a network with N i level 1 neurons and N e level 2 neurons will have N i N e synapses. Since all neuronal parameters were represented with 32-bit single precision floating point numbers, each neuron parameter requires 4 bytes memory space. As a result, each neuron of the Izhikevich model (with 4 parameters) requires 16 bytes, and each neuron of the Hodgkin Huxley model (with 25 parameters) requires 100 bytes. The memory space required by different network elements for varying network configurations is shown in Table 5.7. Table 5.7. Memory required by different network component (Bytes). SNN components Memory required by the two models Izhikevich Hodgkin-Huxley N e Level 1 neurons (4 N e ) 4 (25 Ne) 4 Ni Level 2 neurons (4 Ni) 4 (25 Ni) 4 Synaptic weight (Ne Ni+2 Ni+Ne) 4 (Ne Ni+Ni 2+Ne) 4 Total memory (Ne Ni+6 Ni+5 Ne) 4 (Ne Ni+27 Ni+26 Ne) 4 34

Required GPU Memory (GB) 40 35 GPU cluster Memory 30 25 20 15 10 5 0 NV S1070 Memory NV C1060 Memory NV 9800 GX2 Memory 0 20 40 60 80 Number of Neurons (X10 6 ) Izhikevich Hodgkin- Huxley Fig 5.6. Memory requirement for different configurations. Based on table 5.7, the total amount of memory space required by the different network configurations are plotted in Fig 5.6. As shown in Fig 5.6, the amount of memory space required increases almost linearly with the increasing number of neurons. The number of neurons is shown in the x-axis ranging from 147k to 84,935k. The amount of memory required in GB is shown in y-axis and ranges from 28MB to 23.4GB. The NVIDIA GeForce 9800 GX2 card includes two G92 GPGPUs, each with a dedicated global memory of 512 MB. So for the Izhikevich model, image sizes up to 1536 1536 could be implemented. Each T10 core in both the C1060 and S1070 platforms includes 4GB global memory, so configurations up to 3070 3072 can be implemented on each T10 GPGPU core. For the NVIDIA Tesla GPGPU cluster, 1 to 9 GPGPU cores are being used in the MPI implementations with image sizes up to 9216 9216 implemented. 35

Even larger images are implemented using MPI by segmenting them into multiple 3072 3072 sub-images. This is because a single T10 GPGPU core can accommodate at most image sizes of up to 3072 3072. By using multiple T10 GPGPU cores, with each core processing one 3072 3072 sub-image, very large image can be reconstructed and processed. In this thesis, images of the sizes 6144 6144 and 9216 9216 are processed with four and nine nodes respectively. The size of sub-images processed in each GPGPU core is consistently 3072 3072 pixels. 5.4 Firing rate analysis In the CPU implementations of the models, the weight summations in eq. 7 can be carried out selectively by considering only the level 1 neurons that fired (ie. for those that f(i) in eq. 7 is 1). In the current version of the CUDA language constructs supported in the GPGPU utilized, we have not seen any method to easily determine whether any of the level 1 neurons evaluated in parallel (over 100) have fired (this can be done through a slow serial procedure). As a result, we evaluate eq. 7 for all level 1 neurons. Thus, a variation in the level 1 neuron firing rate would affect the CPU runtime, but not the GPGPU runtime. Fig. 5.7 examines the speedup seen with variations in the level 1 firing rate for the Izhikevich and Hodgkin-Huxley models respectively on the GeForce 9800 GX2 platform over the AMD processor paired with it. In the Izhikevich model, there is a significant change in the speedup, with the highest speedup seen when all the level 1 neurons fired (providing a speedup of about 20 times). In this case the CPU run time is the highest, while the GPGPU runtime remains the same. In the Hodgkin-Huxley model, the weight computation time is a tiny fraction of the overall runtime (as seen in Fig. 5.3), 36