Parallel Numerical Simulation of Visual Neurons for Analysis of Optical Illusion



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

BERNSTEIN POLYNOMIALS

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

An interactive system for structure-based ASCII art creation

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

MACHINE VISION SYSTEM FOR SPECULAR SURFACE INSPECTION: USE OF SIMULATION PROCESS AS A TOOL FOR DESIGN AND OPTIMIZATION

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A Multi-Camera System on PC-Cluster for Real-time 3-D Tracking

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Recurrence. 1 Definitions and main statements

An Alternative Way to Measure Private Equity Performance

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Vehicle Detection and Tracking in Video from Moving Airborne Platform

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

VoIP Playout Buffer Adjustment using Adaptive Estimation of Network Delays

Implementation of Deutsch's Algorithm Using Mathcad

Gender Classification for Real-Time Audience Analysis System

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Damage detection in composite laminates using coin-tap method

Design and Development of a Security Evaluation Platform Based on International Standards

Calculation of Sampling Weights

A Programming Model for the Cloud Platform

A Secure Password-Authenticated Key Agreement Using Smart Cards

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

Enabling P2P One-view Multi-party Video Conferencing

A frequency decomposition time domain model of broadband frequency-dependent absorption: Model II

IMPACT ANALYSIS OF A CELLULAR PHONE

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Actuator forces in CFD: RANS and LES modeling in OpenFOAM

Forecasting the Direction and Strength of Stock Market Movement

Realistic Image Synthesis

Distributed Multi-Target Tracking In A Self-Configuring Camera Network

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Section 5.4 Annuities, Present Value, and Amortization

An Interest-Oriented Network Evolution Mechanism for Online Communities

Frequency Selective IQ Phase and IQ Amplitude Imbalance Adjustments for OFDM Direct Conversion Transmitters

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Calculating the high frequency transmission line parameters of power cables

A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

DEFINING %COMPLETE IN MICROSOFT PROJECT

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Project Networks With Mixed-Time Constraints

What is Candidate Sampling

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

FORMAL ANALYSIS FOR REAL-TIME SCHEDULING

A Performance Analysis of View Maintenance Techniques for Data Warehouses

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

A Multi-mode Image Tracking System Based on Distributed Fusion

Mining Multiple Large Data Sources

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

Adaptive Fractal Image Coding in the Frequency Domain

EVALUATING THE PERCEIVED QUALITY OF INFRASTRUCTURE-LESS VOIP. Kun-chan Lan and Tsung-hsun Wu

A Simple Approach to Clustering in Excel

Loop Parallelization

Statistical Approach for Offline Handwritten Signature Verification

Automated information technology for ionosphere monitoring of low-orbit navigation satellite signals

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Multiple-Period Attribution: Residuals and Compounding

Analysis of Premium Liabilities for Australian Lines of Business

Lecture 2: Single Layer Perceptrons Kevin Swingler

Dynamic Pricing for Smart Grid with Reinforcement Learning

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

A Design Method of High-availability and Low-optical-loss Optical Aggregation Network Architecture

The OC Curve of Attribute Acceptance Plans

TOFEC: Achieving Optimal Throughput-Delay Trade-off of Cloud Storage Using Erasure Codes

Traffic-light a stress test for life insurance provisions

Master s Thesis. Configuring robust virtual wireless sensor networks for Internet of Things inspired by brain functional networks

Detecting Global Motion Patterns in Complex Videos

Developing an Employee Evaluation Management System: The Case of a Healthcare Organization

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Improved SVM in Cloud Computing Information Mining

Cloud-based Social Application Deployment using Local Processing and Global Distribution

Politecnico di Torino. Porto Institutional Repository

An ILP Formulation for Task Mapping and Scheduling on Multi-core Architectures

J. Parallel Distrib. Comput.

Efficient Bandwidth Management in Broadband Wireless Access Systems Using CAC-based Dynamic Pricing

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

Time Domain simulation of PD Propagation in XLPE Cables Considering Frequency Dependent Parameters

Transcription:

212 Thrd Internatonal Conference on Networkng and Computng Parallel Numercal Smulaton of Vsual Neurons for Analyss of Optcal Illuson Akra Egashra, Shunj Satoh, Hdetsugu Ire and Tsutomu Yoshnaga Graduate School of Informaton Systems, Unversty of Electro-Communcatons, 1-5-1 Chofugaoka, Chofu-sh 182-8585, Tokyo, Japan Emal: egashra@comp.s.uec.ac.jp, {shun, re, yosnaga}@s.uec.ac.jp Abstract Detaled mechansm of optcal lluson caused by vsual neurons n human bran has not been well understood, and ts numercal smulaton s helpful to analyze vsual system of humans. Ths paper descrbes mplementaton technques of parallel numercal smulaton to help understandng optcal lluson by usng a GPU-accelerated PC cluster. Our parallel acceleraton technques nclude followng three ponts. Frstly, nput mages of the numercal smulaton s effcently calculated by dvdng t mages for multple computaton nodes usng MPI (Message Passng Interface). Secondly, convoluton, whch s domnated computaton for the optcal flow, s accelerated by GPU. Fnally, an algorthm to compute convoluton specfed to analyze optcal lluson s proposed to speed up the smulaton. Our expermental results show an nterestng nsght that values of optcal flow for mages causng optcal lluson are qute dfferent compared to that does not cause the optcal lluson. We also demonstrate that our mplementaton of smulaton works effcently on the GPU-accelerated PC cluster. Index Terms Parallel computng; CUDA; GPU; MPI; Numercal smulaton; Convoluton; Vsual neural system. I. INTRODUCTION Understandng human bran system s mportant not only to clarfy ts competence but also to develop engneerng applcatons. Vsual nformaton processng functon of human s one of a challengng topc to be examned, and there s an approach to perform computatonal numercal smulaton as well as clncal trals. Optcal lluson s an nterestng phenomenon that s characterzed by vsually perceved mages causng dfferent from objectve realty. Fg. 1 shows a popular mage called rotatng snake, whch causes the optcal lluson, that s to say, crcular mosac pattern s perceved as rotatng. Detaled mechansm nsde human bran has not been enough clarfed, therefore, numercal smulaton s helpful to consder a hypothetcal model s plausble or not. Ths hypothetcal model s provded by an mage processng flter (henceforth kernel) n the numercal smulaton of vsual neurons usng a lnear model [1]. As a pror research, Blue Bran Project [2] conducts fnegraned neural system smulaton that adopts a so-called compartment model [2]. Although The compartment model enables to descrbe neuron s level mathematcs, t does not sut to analyze vsual functon level theory because t s too fne-graned and tme consumng. Hence, we use the lnear model n the vsual numercal smulaton. In the lnear model, an output from a vsual neuron s calculated by convolutons wth nputs to the neuron and synapse wats, as explaned n secton III. The convolutons are smple and enormous. An effcent computaton of the convolutons s a key to speed up the smulaton. Our prevous work presents technques to mplement the effcent parallel computaton of the convolutons on a GPU-accelerated PC cluster [1]. In ths paper, we also use smlarly developed software to compute optcal flow of nput mages. The optcal flow s apparent moton pattern of objects n a vsual scene, and t s represented by vector values. We show a peak value of the optcal flow for an nput mage, whch created from the rotatng snake, becomes larger than expected. These results are qute nterestng snce the dfference between the computed peak and expected values results n that rotatng snake s perceved as movng. To the best of our knowledge, ths s the frst work to consder the reason for moton percepton of the rotatng snake based on the vsual numercal smulaton. Another contrbuton of ths paper s showng effcent mplementaton of the vsual numercal smulaton on a GPUaccelerated PC cluster. Utlzng a smulator developed from our prevous work [1] as a baselne program, we optmzed t for the purpose of nvestgatng optcal lluson. The baselne program performs the vsual smulaton n two steps. The frst step s dvdng an nput 2D mage nto regons and assgnng these regons one by one to each node of the PC cluster. The second step s computng convolutons for the assgned regons n parallel utlzng GPUs. Meanwhle, computaton of convoluton for optcal lluson has some specal features, whch are derved by regularty of nput data; By applyng these features, we modfy the algorthm to compute convoluton so as to reduce an amount of calculatons. The algorthm changes a 3D convoluton secton nto a 2D convoluton. Snce the convoluton secton s accelerated, we can omt the data exchangng among nodes. Fnally, we succeed n 77 % executon tme reducton compared wth the baselne program. The rest of the paper s organzed as follows; Secton II explans related studes. Secton III explans methodology to examne optcal lluson and mathematcal model for the vsual smulaton. Secton IV descrbes our parallel mplementaton and acceleraton methods. Secton V shows a prelmnary experment and results. Secton VI presents evaluaton results and dscusses performance. Fnally, secton VII concludes ths paper. II. RELATED WORK Our study relates the felds of HPC (Hgh Performance Computng) and human vsual smulaton. We summarzed some of strongly related ones n these two felds. RIKEN reports an mplementaton of an neural smulator [3]. The smulator target area s so-called V1 feld n the human bran, whch relates to the basc part of the vsual system. The US Ar Force also studes a 978--7695-4893-7/12 $26. 212 IEEE DOI 1.119/ICNC.212.27 13

Fg. 1. An mage called rotatng snake whch s popular to cause optcal lluson. neural smulator usng a PS3 cluster [4]. Ths study conducts V1 model smulaton wth 265 mllon neurons ncorporated across 1.6 mllon cortcal columns. In comparson wth these two studes, ths paper dffers n motvaton of vsual smulaton. Our motvaton s clarfyng functonal mechansm of the optcal lluson. Optcal lluson s caused by a mechansm of optcal flow n human vsual system. As physologcal works, there are some experments of optcal lluson [5]. On the other hand, a numercal smulaton s expected to clarfy a theoretcal mechansm of vsual functons. However, computer smulaton s tme consumng because of ts vast calculaton amount. Ths s one of the reason that prevents a detaled computatonal smulaton of human bran. Hence, we adopt parallel processng to accelerate human vsual smulaton. Our mprovement n vsual smulaton has been done n the order of followng. 1) 1D optcal lluson smulaton system [6] 2) 2D optcal flow smulaton system [1] 3) 2D optcal lluson smulaton system (ths paper) The frst study smplfes nput mages n a sngle dmensonal (1D) data, whch treats a small nvoluntary eye s movement as an optcal flow n a wde sense. It s consdered that human bran has ablty to cancel ths small eye s movement to perceve objects n a state of rest. We demonstrate that a movng pattern of 1D nput brghtness generated from the rotatng snake mage can be perceved wth larger peak values of optcal flow than the actual speed. The second one apples the frst smulator to treat general two dmensonal (2D) moves. It shows scalable performance to compute optcal flow on a GPU-accelerated PC cluster. Based on our prevous works, ths paper presents analyss of optcal lluson usng more realstc 2D nputs compared to [1] as well as conductng hgh-performance vsual smulaton. III. OPTICAL ILLUSION A. Mechansm of optcal lluson Actually, when human watches objects, human eyes are not at a standstll. Eyes are constantly movng, and ths acton s called small nvoluntary eye movement. Although our sght s expected to move constantly because of ths eye s movement, statonary objects can be properly perceved as keepng stll. The reason s that human vsual system corrects the nformaton from the retna wth optcal flow detected n the bran. However, n a case of optcal lluson, the vsual system may be deceved by detectng erroneous optcal flow. As shown n Fg. 2. A methodology to analyze optcal lluson. Fg. 2, ncorrect optcal flow leads to the optcal lluson. Based on ths assumpton, our methodology to analyze the optcal lluson s made by comparng V eye and V detect, where V eye s a movng speed of nput patterns to a vsual smulator and V detect s computed optcal flow by the smulator. B. Mathematcal mode for analyss of optcal lluson Here we explan a mathematcal model of human vsual system. We use a lnear model whch s gven by a 3D convoluton functon. To smulate human vsual system, the convoluton utlzes spato-temporal kernels, whch work as mage flters. Therefore, convolutons wth varous kernels provde smulaton capablty of varous mage processng. Equatons n (1) show a spatal kernel. g (x) descrbes a Gaussan functon and dg (x) s ts x-drectonal dfferental. Parameter σ s a standard devaton, determnes the shape of these equatons. g (x) = 1 σ 2π e x 2 2σ 2 dg (x) = x σ 3 2π e x 2 2σ 2 (1) Smlarly, equatons n (2) are functons for a temporal kernel. h (t) descrbes a tme lag of n-th order and dh (t) s ts tme dervatve. Parameter τ s a tme constant. h (t) = tn 1 τ n Γ(n) e t τ dh (t) = tn 2 τ n+1 Γ(n) ((n 1)τ t) e t τ (2) (t: frame number, n: the number of stages n human vsual system (n=8) [7]) Usng above mentoned spatal and temporal kernels, our 3D kernels, K x, K y and K t, are defned by equatons (3). These kernels are expanded three-dmensonally as shown n Fg. 3. Ths s called a separable kernel whch s calculated by multplcatons wth three values of 1D arrays for a spatal coordnate x, y, and temporal t. 131

Fg. 3. An example of 3D kernel (K x(x, y, t)). K x (x, y, t) =dg (x) g (y) h (t) K y (x, y, t) =g (x) dg (y) h (t) K t (x, y, t) =g (x) g (y) dh (t) (3) Input data of our vsual smulator are move data [1]. To smulate eye s movement, we smply move an nput mage to rght or to left n a fxed speed. In our experments, movement speed s 1 pxel per second (pxel/sec) to the rght (dx =1)or the left drecton (dx = 1). Thus, pxels or brghtness patterns generated from nput mages are shfted rght or left. Ths optcal flow calculaton method requres three dfferentals of I x, I I y and t (henceforth I x, I y and I t, respectvely), where I(x, y, t) s a brghtness value for a coordnate x, y and t. To smulate human vsual system, our smulaton system calculates dfferentals by convolutons wth nput I and kernel K. I x (x, y, t) = n 2 n 2 = n 2 j= n 2 k= m K x (, j, k)i(x, y j, t k) (4) (n: spatal sze of kernel, m: temporal sze of kernel) Smlarly, I y and I t are calculated from convolutons wth K y and I or K t and I. Optcal flow s calculated by Lucas-Kanade method [8]. The detals are as shown n equaton (5). q 1,q 2,...,q n, whch appear n the rght sde of ths equaton, are the spacal-temporal coordnates for pxels nsde the wndow, whch s a summaton range. [ u ] = v [ Ix(q ) 2 +ɛ Ix(q )Iy (q ) Ix(q )Iy (q ) Iy (q ) 2 +ɛ ] 1 [ Ix(q )I t (q ) Iy (q )I t (q ) (5) (u: x-drectonal value of optcal flow, v: y-drectonal optcal flow value, ɛ: the parameter for avodng aperture problem [9]) IV. IMPLEMENTATION Although optcal flow can be computed as smlar n [1], requred kernel sze and resoluton of convoluton are dfferent as follows. Wde spato-temporal convoluton sze In our smulaton program, a spato-temporal convoluton sze s 3 frames. Ths temporal sze s qute larger than general mage processng applcatons and contrbutes to support varous ] Fg. 4. Correspondence of Gaussan kernel values to x-coordnates n a case of.2 pxel ptch. TABLE I RANGES AND INTERVALS OF A KERNEL. mn max ntervals x-drecton -3. 3..2 y-drecton -3. 3..2 t-drecton. 29. 1. kernel types. Hgh precson of convoluton In order to apply precse value of the kernel to a brghtness value at each x-dmensonal coordnate, t s necessary to compute convoluton at very tny nterval. As shown n Fg. 4, we compute t at every.2 pxel nterval so that a Gaussan kernel draws a smooth curve. The y drecton kernel s also appled the tny nterval. The ntervals of each drecton s shown n Table I. Table I shows parameters to realze a realstc kernel for convoluton n our smulaton. Range and samplng nterval along each of 3D dmenson are shown as mn, max, and nterval, respectvely. Accordngly, the convoluton n our smulaton requres a lot of calculaton amount. The convolutons has multplcatons and summatons wth 3D kernel (31x31x3) and correspondng pxel value. In lke manner, optcal requres summatons and matrx operatons. The summatons add up adjacent convoluton results (I x, I y, I t ) n a wndows sze (15x15). The matrx operatons need several calculatons as shown n equatons n (5). Equaton (6) shows requred number of fp n our smulaton program per pxel to compute optcal flow usng the 3D kernel. Ths huge computaton tme domnates for our smulaton. {4 (31 31 3)} 3 + {5 (15 15)+13} 2 =348236 } {{ } } {{ } # of fp n convolutons # of fp n calculatons for optcal flow (6) For example, n a case that nput resoluton s 32x32, the total calculaton amount reaches 3.3 2 3 tmes. We use the mplementaton n [1] as a baselne. A man characterstc of the baselne s parallel processng on a GPUaccelerated PC cluster. It dvdes 2D nput frame data nto multple regons wth the same number of nodes n the PC cluster. Then, domnant computaton of convoluton to obtan the optcal flow s performed on a GPU n each node. Next, we explan the basc procedure of the baselne, after that we ntroduce 3 addtonal accelerated mplementaton to analyze the 132

Fg. 5. A concept of avodng a frame data from outsde of the mage. Fg. 7. A concept of packng convoluton. TABLE II A COMPARISON OF THE NUMBER OF FP OPERATIONS. number of floatng ponts calculatons [2 3 tmes] nput mage resoluton baselne packed convoluton 24x24 181 19 48x48 727 74 72x72 1635 168 Fg. 6. optcal lluson. A concept of gpu eye s movement smulaton. A. An executon work flow The operaton flow of our baselne smulaton program s as follows; 1) Intalzaton: Root node loads an nput move data from local drectory. The nput data are dvded nto mesh parttons and scattered to other nodes. Each node transfers the receved nput data to corresponded GPU. 2) Convoluton: Convolutons are performed by GPU at each node. 3) Data exchange: Wng area data are exchanged between neghbor nodes after the convoluton. 4) Optcal flow: Each node calculates optcal flow values wth referrng convoluton data. 5) Gather: Fnally, root node gathers calculated results of the optcal flow from all other nodes and generates the fnal optcal flow. B. Smulate eye s movement by GPU (gpu eye sm) Ths acceleraton, called gpu eye sm, smulates eye s movement by GPU. Fg. 6 shows to smulate eye s movement by GPU. The gpu eye sm requres no move data but only an mage data for nput. In the gpu eye sm, GPU smulates eye s movement by movng a regon of nterest (ROI) and fetches a frame data from the moved ROI. To avod fetchng a frame data from outsde of the mage, the mage s provded wth enough margn (Fg. 5). All processes except the move generaton process are same between the baselne program and the gpu eye sm. The baselne program requres large GPU memory regons to store the nput move data. On the other hand, n ths experment we only use statc mages. And emulates ts movement nsde the GPU. It ams to mprove effcency of GPU memory access. Besdes, gpu eye sm reduces communcaton overheads caused by scatterng nput data frame by frame from the root node. C. Packng the 3D convoluton nto 2D (packed conv) As shown n Fg. 7, the convoluton for the optcal lluson smulaton has some specfc ponts. Input frame data are smple pctures that shft a specfc brghtness pattern along x-dmenson. The 3D convoluton s a collecton of smple 2D convolutons. Takng account these characterstcs nto consderaton, 3D kernel data are ntegrated nto 2D so that the convoluton can be performed wth reduced number of calculatons amount. In Fg. 7, green and red lnes represent kernel and nput data, respectvely. In ths fgure, the left sde shows the collecton of 2D convoluton wth nput frame and kernel data. As shown n the mddle part of Fg. 7, ths convoluton collecton s equvalent to calculate convolutons wth nput mage and shfted kernel data. Fnally, ths collecton of shfted kernel data can be packed nto a 2D kernel data. In ths way, ths algorthm realzes to pack the 3D convoluton nto the 2D. As shown n Fg. 8, the generaton process of the packed kernel s realzed by superpostons of the 2D shfted kernels. Ths superpostons are done by matrx addtons between corresponded kernel data elements. Table II shows a comparson of the number of requred convolutons between the baselne and the packed convoluton. The latter reduces the number of calculatons approxmately nto 1/1. 133

Fg. 1. Result of the kernel check test. Yellow area shows combnatons of τ and σ passed. The parameter tunng narrows down the τ and σ parameter range to 3 %. Fg. 8. A concept llustraton of packng kernel (more detal). Fg. 9. Parameter tunng to compute reasonable optcal flow. D. Omttng data exchange (no exchange) Snce the packed convoluton reduces the calculaton amount, communcaton among nodes n the PC cluster becomes a crtcal part. In general, gather collectve communcaton s performed optmally by OpenMPI lbrary [1]. Therefore, we optmze the data exchange secton. In ths optmzaton, dynamc wng data exchange among neghbors s totally omtted but requred results of the convoluton are computed on each node by dstrbutng the wng area data wth overlapped to adjacent nodes. V. EXPERIMENTS A. Parameter Tunng A prelmnary experment has been carred out to verfy that our smulaton system can compute a correct optcal flow. To do so, parameter tunng s requred to decde proper kernels. Frstly, parameter tunng has been done for τ and σ whch are appeared n equaton (1) and (2). As shown n Fg. 9, we prepared a smple mage whose brghtness pattern s nclned along x-dmenson, then created a move by shftng the edge rght or left wth a fxed speed as ± 1 pxel/sec. Optcal flow for the nputs are computed wth varyng τ and σ comprehensve way. Computed optcal flow s compared wth the nput speed, that s ± 1 pxel/sec, and error between them s obtaned. Fg. 1 shows the result of the parameter tunng. The yellow area n ths fgure s a range of τ and σ that outputs an error less than ± 1 %. We use parameter values n ths area for the rest of experments. B. Analyss of optcal lluson After the parameter tunng, we use two types of nput moves generated from the rotatng snake (Fg. 11) and slghtly dfferent one (Fg. 12), by movng them at 1 pxel/sec rght or left Fg. 11. An nput mage generated from the rotatng snake (rotatng snake). Fg. 12. A dfferent pattern of brghtness compared to the rotatng snake (nonrotatng snake). drecton along the x-dmenson. Note, the former, hereafter called rotatng snake, s known to cause optcal lluson, and the latter, called non-rotatng snake, does not. Then, we compute the optcal flow values for each mage and compare those peak values for the followng four cases. case 1 A peak value when rotatng snake s shfted to the left drecton. case 2 A peak value when non-rotatng snake s shfted to the left drecton. case 3 A peak value when rotatng snake s shfted to the rght drecton. case 4 A peak value when non-rotatng snake s shfted to the rght drecton. Table III shows obtaned optcal flow when we use parameters of τ =1.1 and σ =.9 whch are selected form the combnaton n Fg. 1. Fg. 13 llustrates varatons of optcal flow value at each coordnate x. V r : A peak value of optcal flow n case 1. V n : A peak value of optcal flow n case 2. V r+ : A peak value of optcal flow n case 3. V n+ : A peak value of optcal flow n case 4. There are an mportant fndngs. Absolute values of V r+ and V r are qute smaller than V n+ and V n. It could be a reason 134

Optcal flow [pxel/sec] Optcal flow [pxel/sec] 1.5 -.5-1 Veye+ Vn+ Vr+ 1 2 3 4 x (a) optcal flow values n a case shfted to the rght drecton Veye- Vn- Vr- 1 2 3 4 x (b) optcal flow values n a case shfted to the left drecton Fg. 13. Comparsons of optcal flow for optcal flow values n cases shfted to the rght drecton (V r+, V n+ and V eye+) and left drecton (V r, V n and V eye ). TABLE III COMPARISON OF PEAK VALUES OF THE OPTICAL FLOW peak value of optcal flow [pxel/sec] τ σ V n+ V r+ V n V r 1.1.9.98.52 -.98 -.51 to cause optcal lluson because the cancellaton of the small nvoluntary eyes movement n human bran can not completely negate for the partcular pattern lke the rotatng snake. Namely, when t cancels.52 pxel/sec n case 3, the rotatng snake s perceved to move.48 pxel/sec (V lluson ) as shown n the equaton n (7). Smlarly, the rotatng snake s perceved to move -.49 (V lluson ) pxel/sec n case 1. }{{} 1. }{{}.52 = }{{}.48 (7) V eye+ V detected (=Vr+) V lluson VI. PERFORMANCE OF PARALLEL EXECUTION Ths secton dscusses performance of the vsual smulaton programs on a GPU-accelerated PC cluster. We use a 16-node cluster whch s provded wth GPU on each node. Table IV shows the hardware and software specfcaton of the cluster nodes. We measure an executon tme and throughput (GFLOPS) of the vsual smulaton programs to compute the optcal flow for three sze of nput move data. From the result, we found that the performance dfference between gpu eye sm and the baselne program s neglgble. Thus, hereafter we show the results of 3 mplementatons; the gpu eye sm, the packed conv and the no exchange. A. Comparson between gpu eye sm and packed conv A performance mprovement of packed conv from gpu eye sm s shown n Fg. 14. Man reason of ths speed up s reduced the number of operatons to compute the convolutons. We notce that 7 % reducton of the executon tme s attaned for the case of 72x72 frames on 16 nodes, compared from the executon tme of the gpu eye sm. On the other hand, the performance of packed conv s dropped, compared from gpu eye sm. The man reason for ths degradaton s reducton of the number of operatons to compute convolutons. And second reason s communcaton overhead n packed conv of the gather and exchange secton. TABLE IV EXPERIMENTAL ENVIRONMENT. CPU Intel Xeon Quad-Core CPU W352 Clock speed 2.67 GHz memory 6GB GPU NVIDIA C16 (GT2 archtecture) Clock speed 1.296 GHz Number of Streamng Processor 24 Peak performance 933 GFLOPS Memory 4GB Memory bandwdth 12 GB/sec Graphcs bus PCI Express x16 Generaton 2. OS CentOS 5.3 C Compler Intel C compler 11.1 CUDA CUDA Toolkt 3.2 TABLE V A BREAKDOWN OF EACH SECTION OF THE gpu eye sm AND THE packed conv. THE DATA ARE EXECUTION TIMES FOR 72X72 FRAME USING BY 16 NODES. THE NUMBERS IN BRACKETS ARE THE PERCENTAGE OF A TOTAL EXECUTION TIME nput data resoluton gpu eye sm packed conv convoluton [sec].1399 (83.9 %).234 (46.7 %) data exchange [sec].31 ( 1.9 %).35 ( 7. %) optcal flow [sec].7 (.4 %).7 ( 1.4 %) gather [sec].229 (13.8 %).225 (44.9 %) Table V s a breakdown of each secton of the gpu eye sm and the packed conv. In the packed conv, the percentages of the communcaton sectons (data exchange and gather sectons) s hgher than the gpu eye sm. The hgher percentage of the communcaton secton represents affects overhead degrades the performance of the packed conv. B. Comparson between packed conv and no exchange Fg. 16 shows the performance mprovement of no exchange aganst packed conv. The mprovement ncreases as the number of nodes snce frequent data exchangng among nodes leads to larger communcaton overhead. In addton, as shown n Fg. 17, due to omttng data exchangng effects more sgnfcant when the mage szes become smaller. C. Comparson between all acceleraton plans Fg. 18 s a whole result of evaluaton experment. As shown n the fgures, each acceleraton plan can be confrmed scalablty. However, the effects of these acceleraton plans become gradually restrctve. The lack of mprovng gather secton caused ths results. VII. CONCLUSION Achevements of our study have two perspectves; a smulaton for human vsual system and parallel acceleraton wth MPI and CUDA. In ths paper, we have consdered a mechansm to cause optcal lluson based on numercal smulaton of vsual neurons. Our outcomes nclude two folds. Frst, we found that the peak values of optcal flow became qute larger than movng speed of an nput scene, when an nput pattern (rotatng snake) whch cause the optcal lluson s used. Second, absolute values of the optcal flow to the rght and left drectons are consderably dfferent for nput of the rotatng snake. These two results are consdered as reasons that cancel mechansm for a small nvoluntary eyes movement n human bran can not work well at rotatng snake. Another contrbuton of ths paper s showng acceleraton technques for the numercal smulaton of vsual neurons on a 135

Fg. 14. executon tme [sec] performance [GFLOPS] 2 1.8 1.6 1.4 1.2 1.8.6.4.2 6 5 4 3 2 1 gpu_eye_sm 72x72 gpu_eye_sm 48x48 gpu_eye_sm 24x24 packed_conv 72x72 packed_conv 48x48 packed_conv 24x24 (a) Executon tme 11 gpu_eye_sm 72x72 1 gpu_eye_sm 48x48 9 gpu_eye_sm 24x24 packed_conv 72x72 8 packed_conv 48x48 packed_conv 24x24 7 (b) Throughput Performance comparson between gpu eye sm and packed conv. Fg. 17. Transton of percentage of data exchangng of packed conv n a case usng 16 nodes. Fg. 15. Fg. 16. executon tme [sec] performance [GFLOPS].35.3.25.2.15.1.5 4 3 2 1 packed_conv 72x72 packed_conv 48x48 packed_conv 24x24 no_exchange 72x72 no_exchange 48x48 no_exchange 24x24 (a) Executon tme 7 packed_conv 72x72 packed_conv 48x48 6 packed_conv 24x24 no_exchange 72x72 5 no_exchange 48x48 no_exchange 24x24 (b) Throughput Performance comparson between packed conv and no exchange. growth rate [%] 2 15 1 5 72x72 48x48 24x24 Performance growth rate between packed conv and no exchange. Fg. 18. Whole evaluaton experment result (executon tme) n a 16 nodes case. GPU-accelerated PC cluster, especally as a case study for analyss of optcal lluson. Fnally, we succeed n 77 % executon tme reducton compared wth the baselne program for the nput sze of 72x72, by usng 16 nodes. ACKNOWLEDGMENT Ths research s supported n part by JSPS Grants-n-Ad for Scentfc Research (C) Nos.22542 and 245371. REFERENCES [1] J. Ohmura et al., Mult-gpu acceleraton of optcal flow computaton n vsual functonal smulaton, 3rd Internatonal Workshop on Parallel and Dstrbuted Algorthms and Applcatons, pp. 228 234, 211. [2] H. Markram, The blue bran project, Neuroscence, vol. 7, pp. 153 16, 26. [3] H. Sasak, S. Satoh, and S. Usu, Neural mplementaton of coarse-to-fne processng n v1 smple neurons, Neurocomputng, vol. 73, pp. 867 873, 21. [4] R. E. Pno, M. Moore, J. Rogers, and Q. Wu, A columnar v1/v2 vsual cortex model and emulaton usng a ps3 cell-be array, 211, pp. 1667 1674. [5] I. Kurk, H. Ashda, I. Murakam, and A. Ktaoka, Functonal bran magng of the rotatng snakes lluson by fmr, Joumal of Vson, vol. 8, pp. 1 1, 28. [6] Y. Sato, S. Satoh, T. Myosh, H. Ire, and T. Yoshnaga, Parallel numercal smulaton for the lnear model of vsual neurons wth mp, SIC Techncal Report(IPSJ), vol. 211-HPC-129, pp. 1 8, 211, (n Japanese). [7] S. Shunj and U. Shro, Fractonal dervatve of gaussan functons : A model for spato-temporal receptve felds of v1 smple cells, IEICE techncal report, vol. 18, pp. 141 146, 29, (n Japanese). [8] L. B.D and K. T, An teratve mage regstraton technque wth an applcaton to stereo vson, Proceedngs of the Seventh Internatonal Jont Conference on Artfcal Intellgence(IJCAI-81), pp. 674 679, 1981. [9] Y. Wess, E. P. Smoncell, and E. H. Adelson, Moton llusons as optmal percepts, nature neuroscence, vol. 6, pp. 598 64, 22. [1] OpenMPI: Open source hgh performance computng, http://www.undata.ucar.edu/software/netcdf/. 136