212 Thrd Internatonal Conference on Networkng and Computng Parallel Numercal Smulaton of Vsual Neurons for Analyss of Optcal Illuson Akra Egashra, Shunj Satoh, Hdetsugu Ire and Tsutomu Yoshnaga Graduate School of Informaton Systems, Unversty of Electro-Communcatons, 1-5-1 Chofugaoka, Chofu-sh 182-8585, Tokyo, Japan Emal: egashra@comp.s.uec.ac.jp, {shun, re, yosnaga}@s.uec.ac.jp Abstract Detaled mechansm of optcal lluson caused by vsual neurons n human bran has not been well understood, and ts numercal smulaton s helpful to analyze vsual system of humans. Ths paper descrbes mplementaton technques of parallel numercal smulaton to help understandng optcal lluson by usng a GPU-accelerated PC cluster. Our parallel acceleraton technques nclude followng three ponts. Frstly, nput mages of the numercal smulaton s effcently calculated by dvdng t mages for multple computaton nodes usng MPI (Message Passng Interface). Secondly, convoluton, whch s domnated computaton for the optcal flow, s accelerated by GPU. Fnally, an algorthm to compute convoluton specfed to analyze optcal lluson s proposed to speed up the smulaton. Our expermental results show an nterestng nsght that values of optcal flow for mages causng optcal lluson are qute dfferent compared to that does not cause the optcal lluson. We also demonstrate that our mplementaton of smulaton works effcently on the GPU-accelerated PC cluster. Index Terms Parallel computng; CUDA; GPU; MPI; Numercal smulaton; Convoluton; Vsual neural system. I. INTRODUCTION Understandng human bran system s mportant not only to clarfy ts competence but also to develop engneerng applcatons. Vsual nformaton processng functon of human s one of a challengng topc to be examned, and there s an approach to perform computatonal numercal smulaton as well as clncal trals. Optcal lluson s an nterestng phenomenon that s characterzed by vsually perceved mages causng dfferent from objectve realty. Fg. 1 shows a popular mage called rotatng snake, whch causes the optcal lluson, that s to say, crcular mosac pattern s perceved as rotatng. Detaled mechansm nsde human bran has not been enough clarfed, therefore, numercal smulaton s helpful to consder a hypothetcal model s plausble or not. Ths hypothetcal model s provded by an mage processng flter (henceforth kernel) n the numercal smulaton of vsual neurons usng a lnear model [1]. As a pror research, Blue Bran Project [2] conducts fnegraned neural system smulaton that adopts a so-called compartment model [2]. Although The compartment model enables to descrbe neuron s level mathematcs, t does not sut to analyze vsual functon level theory because t s too fne-graned and tme consumng. Hence, we use the lnear model n the vsual numercal smulaton. In the lnear model, an output from a vsual neuron s calculated by convolutons wth nputs to the neuron and synapse wats, as explaned n secton III. The convolutons are smple and enormous. An effcent computaton of the convolutons s a key to speed up the smulaton. Our prevous work presents technques to mplement the effcent parallel computaton of the convolutons on a GPU-accelerated PC cluster [1]. In ths paper, we also use smlarly developed software to compute optcal flow of nput mages. The optcal flow s apparent moton pattern of objects n a vsual scene, and t s represented by vector values. We show a peak value of the optcal flow for an nput mage, whch created from the rotatng snake, becomes larger than expected. These results are qute nterestng snce the dfference between the computed peak and expected values results n that rotatng snake s perceved as movng. To the best of our knowledge, ths s the frst work to consder the reason for moton percepton of the rotatng snake based on the vsual numercal smulaton. Another contrbuton of ths paper s showng effcent mplementaton of the vsual numercal smulaton on a GPUaccelerated PC cluster. Utlzng a smulator developed from our prevous work [1] as a baselne program, we optmzed t for the purpose of nvestgatng optcal lluson. The baselne program performs the vsual smulaton n two steps. The frst step s dvdng an nput 2D mage nto regons and assgnng these regons one by one to each node of the PC cluster. The second step s computng convolutons for the assgned regons n parallel utlzng GPUs. Meanwhle, computaton of convoluton for optcal lluson has some specal features, whch are derved by regularty of nput data; By applyng these features, we modfy the algorthm to compute convoluton so as to reduce an amount of calculatons. The algorthm changes a 3D convoluton secton nto a 2D convoluton. Snce the convoluton secton s accelerated, we can omt the data exchangng among nodes. Fnally, we succeed n 77 % executon tme reducton compared wth the baselne program. The rest of the paper s organzed as follows; Secton II explans related studes. Secton III explans methodology to examne optcal lluson and mathematcal model for the vsual smulaton. Secton IV descrbes our parallel mplementaton and acceleraton methods. Secton V shows a prelmnary experment and results. Secton VI presents evaluaton results and dscusses performance. Fnally, secton VII concludes ths paper. II. RELATED WORK Our study relates the felds of HPC (Hgh Performance Computng) and human vsual smulaton. We summarzed some of strongly related ones n these two felds. RIKEN reports an mplementaton of an neural smulator [3]. The smulator target area s so-called V1 feld n the human bran, whch relates to the basc part of the vsual system. The US Ar Force also studes a 978--7695-4893-7/12 $26. 212 IEEE DOI 1.119/ICNC.212.27 13
Fg. 1. An mage called rotatng snake whch s popular to cause optcal lluson. neural smulator usng a PS3 cluster [4]. Ths study conducts V1 model smulaton wth 265 mllon neurons ncorporated across 1.6 mllon cortcal columns. In comparson wth these two studes, ths paper dffers n motvaton of vsual smulaton. Our motvaton s clarfyng functonal mechansm of the optcal lluson. Optcal lluson s caused by a mechansm of optcal flow n human vsual system. As physologcal works, there are some experments of optcal lluson [5]. On the other hand, a numercal smulaton s expected to clarfy a theoretcal mechansm of vsual functons. However, computer smulaton s tme consumng because of ts vast calculaton amount. Ths s one of the reason that prevents a detaled computatonal smulaton of human bran. Hence, we adopt parallel processng to accelerate human vsual smulaton. Our mprovement n vsual smulaton has been done n the order of followng. 1) 1D optcal lluson smulaton system [6] 2) 2D optcal flow smulaton system [1] 3) 2D optcal lluson smulaton system (ths paper) The frst study smplfes nput mages n a sngle dmensonal (1D) data, whch treats a small nvoluntary eye s movement as an optcal flow n a wde sense. It s consdered that human bran has ablty to cancel ths small eye s movement to perceve objects n a state of rest. We demonstrate that a movng pattern of 1D nput brghtness generated from the rotatng snake mage can be perceved wth larger peak values of optcal flow than the actual speed. The second one apples the frst smulator to treat general two dmensonal (2D) moves. It shows scalable performance to compute optcal flow on a GPU-accelerated PC cluster. Based on our prevous works, ths paper presents analyss of optcal lluson usng more realstc 2D nputs compared to [1] as well as conductng hgh-performance vsual smulaton. III. OPTICAL ILLUSION A. Mechansm of optcal lluson Actually, when human watches objects, human eyes are not at a standstll. Eyes are constantly movng, and ths acton s called small nvoluntary eye movement. Although our sght s expected to move constantly because of ths eye s movement, statonary objects can be properly perceved as keepng stll. The reason s that human vsual system corrects the nformaton from the retna wth optcal flow detected n the bran. However, n a case of optcal lluson, the vsual system may be deceved by detectng erroneous optcal flow. As shown n Fg. 2. A methodology to analyze optcal lluson. Fg. 2, ncorrect optcal flow leads to the optcal lluson. Based on ths assumpton, our methodology to analyze the optcal lluson s made by comparng V eye and V detect, where V eye s a movng speed of nput patterns to a vsual smulator and V detect s computed optcal flow by the smulator. B. Mathematcal mode for analyss of optcal lluson Here we explan a mathematcal model of human vsual system. We use a lnear model whch s gven by a 3D convoluton functon. To smulate human vsual system, the convoluton utlzes spato-temporal kernels, whch work as mage flters. Therefore, convolutons wth varous kernels provde smulaton capablty of varous mage processng. Equatons n (1) show a spatal kernel. g (x) descrbes a Gaussan functon and dg (x) s ts x-drectonal dfferental. Parameter σ s a standard devaton, determnes the shape of these equatons. g (x) = 1 σ 2π e x 2 2σ 2 dg (x) = x σ 3 2π e x 2 2σ 2 (1) Smlarly, equatons n (2) are functons for a temporal kernel. h (t) descrbes a tme lag of n-th order and dh (t) s ts tme dervatve. Parameter τ s a tme constant. h (t) = tn 1 τ n Γ(n) e t τ dh (t) = tn 2 τ n+1 Γ(n) ((n 1)τ t) e t τ (2) (t: frame number, n: the number of stages n human vsual system (n=8) [7]) Usng above mentoned spatal and temporal kernels, our 3D kernels, K x, K y and K t, are defned by equatons (3). These kernels are expanded three-dmensonally as shown n Fg. 3. Ths s called a separable kernel whch s calculated by multplcatons wth three values of 1D arrays for a spatal coordnate x, y, and temporal t. 131
Fg. 3. An example of 3D kernel (K x(x, y, t)). K x (x, y, t) =dg (x) g (y) h (t) K y (x, y, t) =g (x) dg (y) h (t) K t (x, y, t) =g (x) g (y) dh (t) (3) Input data of our vsual smulator are move data [1]. To smulate eye s movement, we smply move an nput mage to rght or to left n a fxed speed. In our experments, movement speed s 1 pxel per second (pxel/sec) to the rght (dx =1)or the left drecton (dx = 1). Thus, pxels or brghtness patterns generated from nput mages are shfted rght or left. Ths optcal flow calculaton method requres three dfferentals of I x, I I y and t (henceforth I x, I y and I t, respectvely), where I(x, y, t) s a brghtness value for a coordnate x, y and t. To smulate human vsual system, our smulaton system calculates dfferentals by convolutons wth nput I and kernel K. I x (x, y, t) = n 2 n 2 = n 2 j= n 2 k= m K x (, j, k)i(x, y j, t k) (4) (n: spatal sze of kernel, m: temporal sze of kernel) Smlarly, I y and I t are calculated from convolutons wth K y and I or K t and I. Optcal flow s calculated by Lucas-Kanade method [8]. The detals are as shown n equaton (5). q 1,q 2,...,q n, whch appear n the rght sde of ths equaton, are the spacal-temporal coordnates for pxels nsde the wndow, whch s a summaton range. [ u ] = v [ Ix(q ) 2 +ɛ Ix(q )Iy (q ) Ix(q )Iy (q ) Iy (q ) 2 +ɛ ] 1 [ Ix(q )I t (q ) Iy (q )I t (q ) (5) (u: x-drectonal value of optcal flow, v: y-drectonal optcal flow value, ɛ: the parameter for avodng aperture problem [9]) IV. IMPLEMENTATION Although optcal flow can be computed as smlar n [1], requred kernel sze and resoluton of convoluton are dfferent as follows. Wde spato-temporal convoluton sze In our smulaton program, a spato-temporal convoluton sze s 3 frames. Ths temporal sze s qute larger than general mage processng applcatons and contrbutes to support varous ] Fg. 4. Correspondence of Gaussan kernel values to x-coordnates n a case of.2 pxel ptch. TABLE I RANGES AND INTERVALS OF A KERNEL. mn max ntervals x-drecton -3. 3..2 y-drecton -3. 3..2 t-drecton. 29. 1. kernel types. Hgh precson of convoluton In order to apply precse value of the kernel to a brghtness value at each x-dmensonal coordnate, t s necessary to compute convoluton at very tny nterval. As shown n Fg. 4, we compute t at every.2 pxel nterval so that a Gaussan kernel draws a smooth curve. The y drecton kernel s also appled the tny nterval. The ntervals of each drecton s shown n Table I. Table I shows parameters to realze a realstc kernel for convoluton n our smulaton. Range and samplng nterval along each of 3D dmenson are shown as mn, max, and nterval, respectvely. Accordngly, the convoluton n our smulaton requres a lot of calculaton amount. The convolutons has multplcatons and summatons wth 3D kernel (31x31x3) and correspondng pxel value. In lke manner, optcal requres summatons and matrx operatons. The summatons add up adjacent convoluton results (I x, I y, I t ) n a wndows sze (15x15). The matrx operatons need several calculatons as shown n equatons n (5). Equaton (6) shows requred number of fp n our smulaton program per pxel to compute optcal flow usng the 3D kernel. Ths huge computaton tme domnates for our smulaton. {4 (31 31 3)} 3 + {5 (15 15)+13} 2 =348236 } {{ } } {{ } # of fp n convolutons # of fp n calculatons for optcal flow (6) For example, n a case that nput resoluton s 32x32, the total calculaton amount reaches 3.3 2 3 tmes. We use the mplementaton n [1] as a baselne. A man characterstc of the baselne s parallel processng on a GPUaccelerated PC cluster. It dvdes 2D nput frame data nto multple regons wth the same number of nodes n the PC cluster. Then, domnant computaton of convoluton to obtan the optcal flow s performed on a GPU n each node. Next, we explan the basc procedure of the baselne, after that we ntroduce 3 addtonal accelerated mplementaton to analyze the 132
Fg. 5. A concept of avodng a frame data from outsde of the mage. Fg. 7. A concept of packng convoluton. TABLE II A COMPARISON OF THE NUMBER OF FP OPERATIONS. number of floatng ponts calculatons [2 3 tmes] nput mage resoluton baselne packed convoluton 24x24 181 19 48x48 727 74 72x72 1635 168 Fg. 6. optcal lluson. A concept of gpu eye s movement smulaton. A. An executon work flow The operaton flow of our baselne smulaton program s as follows; 1) Intalzaton: Root node loads an nput move data from local drectory. The nput data are dvded nto mesh parttons and scattered to other nodes. Each node transfers the receved nput data to corresponded GPU. 2) Convoluton: Convolutons are performed by GPU at each node. 3) Data exchange: Wng area data are exchanged between neghbor nodes after the convoluton. 4) Optcal flow: Each node calculates optcal flow values wth referrng convoluton data. 5) Gather: Fnally, root node gathers calculated results of the optcal flow from all other nodes and generates the fnal optcal flow. B. Smulate eye s movement by GPU (gpu eye sm) Ths acceleraton, called gpu eye sm, smulates eye s movement by GPU. Fg. 6 shows to smulate eye s movement by GPU. The gpu eye sm requres no move data but only an mage data for nput. In the gpu eye sm, GPU smulates eye s movement by movng a regon of nterest (ROI) and fetches a frame data from the moved ROI. To avod fetchng a frame data from outsde of the mage, the mage s provded wth enough margn (Fg. 5). All processes except the move generaton process are same between the baselne program and the gpu eye sm. The baselne program requres large GPU memory regons to store the nput move data. On the other hand, n ths experment we only use statc mages. And emulates ts movement nsde the GPU. It ams to mprove effcency of GPU memory access. Besdes, gpu eye sm reduces communcaton overheads caused by scatterng nput data frame by frame from the root node. C. Packng the 3D convoluton nto 2D (packed conv) As shown n Fg. 7, the convoluton for the optcal lluson smulaton has some specfc ponts. Input frame data are smple pctures that shft a specfc brghtness pattern along x-dmenson. The 3D convoluton s a collecton of smple 2D convolutons. Takng account these characterstcs nto consderaton, 3D kernel data are ntegrated nto 2D so that the convoluton can be performed wth reduced number of calculatons amount. In Fg. 7, green and red lnes represent kernel and nput data, respectvely. In ths fgure, the left sde shows the collecton of 2D convoluton wth nput frame and kernel data. As shown n the mddle part of Fg. 7, ths convoluton collecton s equvalent to calculate convolutons wth nput mage and shfted kernel data. Fnally, ths collecton of shfted kernel data can be packed nto a 2D kernel data. In ths way, ths algorthm realzes to pack the 3D convoluton nto the 2D. As shown n Fg. 8, the generaton process of the packed kernel s realzed by superpostons of the 2D shfted kernels. Ths superpostons are done by matrx addtons between corresponded kernel data elements. Table II shows a comparson of the number of requred convolutons between the baselne and the packed convoluton. The latter reduces the number of calculatons approxmately nto 1/1. 133
Fg. 1. Result of the kernel check test. Yellow area shows combnatons of τ and σ passed. The parameter tunng narrows down the τ and σ parameter range to 3 %. Fg. 8. A concept llustraton of packng kernel (more detal). Fg. 9. Parameter tunng to compute reasonable optcal flow. D. Omttng data exchange (no exchange) Snce the packed convoluton reduces the calculaton amount, communcaton among nodes n the PC cluster becomes a crtcal part. In general, gather collectve communcaton s performed optmally by OpenMPI lbrary [1]. Therefore, we optmze the data exchange secton. In ths optmzaton, dynamc wng data exchange among neghbors s totally omtted but requred results of the convoluton are computed on each node by dstrbutng the wng area data wth overlapped to adjacent nodes. V. EXPERIMENTS A. Parameter Tunng A prelmnary experment has been carred out to verfy that our smulaton system can compute a correct optcal flow. To do so, parameter tunng s requred to decde proper kernels. Frstly, parameter tunng has been done for τ and σ whch are appeared n equaton (1) and (2). As shown n Fg. 9, we prepared a smple mage whose brghtness pattern s nclned along x-dmenson, then created a move by shftng the edge rght or left wth a fxed speed as ± 1 pxel/sec. Optcal flow for the nputs are computed wth varyng τ and σ comprehensve way. Computed optcal flow s compared wth the nput speed, that s ± 1 pxel/sec, and error between them s obtaned. Fg. 1 shows the result of the parameter tunng. The yellow area n ths fgure s a range of τ and σ that outputs an error less than ± 1 %. We use parameter values n ths area for the rest of experments. B. Analyss of optcal lluson After the parameter tunng, we use two types of nput moves generated from the rotatng snake (Fg. 11) and slghtly dfferent one (Fg. 12), by movng them at 1 pxel/sec rght or left Fg. 11. An nput mage generated from the rotatng snake (rotatng snake). Fg. 12. A dfferent pattern of brghtness compared to the rotatng snake (nonrotatng snake). drecton along the x-dmenson. Note, the former, hereafter called rotatng snake, s known to cause optcal lluson, and the latter, called non-rotatng snake, does not. Then, we compute the optcal flow values for each mage and compare those peak values for the followng four cases. case 1 A peak value when rotatng snake s shfted to the left drecton. case 2 A peak value when non-rotatng snake s shfted to the left drecton. case 3 A peak value when rotatng snake s shfted to the rght drecton. case 4 A peak value when non-rotatng snake s shfted to the rght drecton. Table III shows obtaned optcal flow when we use parameters of τ =1.1 and σ =.9 whch are selected form the combnaton n Fg. 1. Fg. 13 llustrates varatons of optcal flow value at each coordnate x. V r : A peak value of optcal flow n case 1. V n : A peak value of optcal flow n case 2. V r+ : A peak value of optcal flow n case 3. V n+ : A peak value of optcal flow n case 4. There are an mportant fndngs. Absolute values of V r+ and V r are qute smaller than V n+ and V n. It could be a reason 134
Optcal flow [pxel/sec] Optcal flow [pxel/sec] 1.5 -.5-1 Veye+ Vn+ Vr+ 1 2 3 4 x (a) optcal flow values n a case shfted to the rght drecton Veye- Vn- Vr- 1 2 3 4 x (b) optcal flow values n a case shfted to the left drecton Fg. 13. Comparsons of optcal flow for optcal flow values n cases shfted to the rght drecton (V r+, V n+ and V eye+) and left drecton (V r, V n and V eye ). TABLE III COMPARISON OF PEAK VALUES OF THE OPTICAL FLOW peak value of optcal flow [pxel/sec] τ σ V n+ V r+ V n V r 1.1.9.98.52 -.98 -.51 to cause optcal lluson because the cancellaton of the small nvoluntary eyes movement n human bran can not completely negate for the partcular pattern lke the rotatng snake. Namely, when t cancels.52 pxel/sec n case 3, the rotatng snake s perceved to move.48 pxel/sec (V lluson ) as shown n the equaton n (7). Smlarly, the rotatng snake s perceved to move -.49 (V lluson ) pxel/sec n case 1. }{{} 1. }{{}.52 = }{{}.48 (7) V eye+ V detected (=Vr+) V lluson VI. PERFORMANCE OF PARALLEL EXECUTION Ths secton dscusses performance of the vsual smulaton programs on a GPU-accelerated PC cluster. We use a 16-node cluster whch s provded wth GPU on each node. Table IV shows the hardware and software specfcaton of the cluster nodes. We measure an executon tme and throughput (GFLOPS) of the vsual smulaton programs to compute the optcal flow for three sze of nput move data. From the result, we found that the performance dfference between gpu eye sm and the baselne program s neglgble. Thus, hereafter we show the results of 3 mplementatons; the gpu eye sm, the packed conv and the no exchange. A. Comparson between gpu eye sm and packed conv A performance mprovement of packed conv from gpu eye sm s shown n Fg. 14. Man reason of ths speed up s reduced the number of operatons to compute the convolutons. We notce that 7 % reducton of the executon tme s attaned for the case of 72x72 frames on 16 nodes, compared from the executon tme of the gpu eye sm. On the other hand, the performance of packed conv s dropped, compared from gpu eye sm. The man reason for ths degradaton s reducton of the number of operatons to compute convolutons. And second reason s communcaton overhead n packed conv of the gather and exchange secton. TABLE IV EXPERIMENTAL ENVIRONMENT. CPU Intel Xeon Quad-Core CPU W352 Clock speed 2.67 GHz memory 6GB GPU NVIDIA C16 (GT2 archtecture) Clock speed 1.296 GHz Number of Streamng Processor 24 Peak performance 933 GFLOPS Memory 4GB Memory bandwdth 12 GB/sec Graphcs bus PCI Express x16 Generaton 2. OS CentOS 5.3 C Compler Intel C compler 11.1 CUDA CUDA Toolkt 3.2 TABLE V A BREAKDOWN OF EACH SECTION OF THE gpu eye sm AND THE packed conv. THE DATA ARE EXECUTION TIMES FOR 72X72 FRAME USING BY 16 NODES. THE NUMBERS IN BRACKETS ARE THE PERCENTAGE OF A TOTAL EXECUTION TIME nput data resoluton gpu eye sm packed conv convoluton [sec].1399 (83.9 %).234 (46.7 %) data exchange [sec].31 ( 1.9 %).35 ( 7. %) optcal flow [sec].7 (.4 %).7 ( 1.4 %) gather [sec].229 (13.8 %).225 (44.9 %) Table V s a breakdown of each secton of the gpu eye sm and the packed conv. In the packed conv, the percentages of the communcaton sectons (data exchange and gather sectons) s hgher than the gpu eye sm. The hgher percentage of the communcaton secton represents affects overhead degrades the performance of the packed conv. B. Comparson between packed conv and no exchange Fg. 16 shows the performance mprovement of no exchange aganst packed conv. The mprovement ncreases as the number of nodes snce frequent data exchangng among nodes leads to larger communcaton overhead. In addton, as shown n Fg. 17, due to omttng data exchangng effects more sgnfcant when the mage szes become smaller. C. Comparson between all acceleraton plans Fg. 18 s a whole result of evaluaton experment. As shown n the fgures, each acceleraton plan can be confrmed scalablty. However, the effects of these acceleraton plans become gradually restrctve. The lack of mprovng gather secton caused ths results. VII. CONCLUSION Achevements of our study have two perspectves; a smulaton for human vsual system and parallel acceleraton wth MPI and CUDA. In ths paper, we have consdered a mechansm to cause optcal lluson based on numercal smulaton of vsual neurons. Our outcomes nclude two folds. Frst, we found that the peak values of optcal flow became qute larger than movng speed of an nput scene, when an nput pattern (rotatng snake) whch cause the optcal lluson s used. Second, absolute values of the optcal flow to the rght and left drectons are consderably dfferent for nput of the rotatng snake. These two results are consdered as reasons that cancel mechansm for a small nvoluntary eyes movement n human bran can not work well at rotatng snake. Another contrbuton of ths paper s showng acceleraton technques for the numercal smulaton of vsual neurons on a 135
Fg. 14. executon tme [sec] performance [GFLOPS] 2 1.8 1.6 1.4 1.2 1.8.6.4.2 6 5 4 3 2 1 gpu_eye_sm 72x72 gpu_eye_sm 48x48 gpu_eye_sm 24x24 packed_conv 72x72 packed_conv 48x48 packed_conv 24x24 (a) Executon tme 11 gpu_eye_sm 72x72 1 gpu_eye_sm 48x48 9 gpu_eye_sm 24x24 packed_conv 72x72 8 packed_conv 48x48 packed_conv 24x24 7 (b) Throughput Performance comparson between gpu eye sm and packed conv. Fg. 17. Transton of percentage of data exchangng of packed conv n a case usng 16 nodes. Fg. 15. Fg. 16. executon tme [sec] performance [GFLOPS].35.3.25.2.15.1.5 4 3 2 1 packed_conv 72x72 packed_conv 48x48 packed_conv 24x24 no_exchange 72x72 no_exchange 48x48 no_exchange 24x24 (a) Executon tme 7 packed_conv 72x72 packed_conv 48x48 6 packed_conv 24x24 no_exchange 72x72 5 no_exchange 48x48 no_exchange 24x24 (b) Throughput Performance comparson between packed conv and no exchange. growth rate [%] 2 15 1 5 72x72 48x48 24x24 Performance growth rate between packed conv and no exchange. Fg. 18. Whole evaluaton experment result (executon tme) n a 16 nodes case. GPU-accelerated PC cluster, especally as a case study for analyss of optcal lluson. Fnally, we succeed n 77 % executon tme reducton compared wth the baselne program for the nput sze of 72x72, by usng 16 nodes. ACKNOWLEDGMENT Ths research s supported n part by JSPS Grants-n-Ad for Scentfc Research (C) Nos.22542 and 245371. REFERENCES [1] J. Ohmura et al., Mult-gpu acceleraton of optcal flow computaton n vsual functonal smulaton, 3rd Internatonal Workshop on Parallel and Dstrbuted Algorthms and Applcatons, pp. 228 234, 211. [2] H. Markram, The blue bran project, Neuroscence, vol. 7, pp. 153 16, 26. [3] H. Sasak, S. Satoh, and S. Usu, Neural mplementaton of coarse-to-fne processng n v1 smple neurons, Neurocomputng, vol. 73, pp. 867 873, 21. [4] R. E. Pno, M. Moore, J. Rogers, and Q. Wu, A columnar v1/v2 vsual cortex model and emulaton usng a ps3 cell-be array, 211, pp. 1667 1674. [5] I. Kurk, H. Ashda, I. Murakam, and A. Ktaoka, Functonal bran magng of the rotatng snakes lluson by fmr, Joumal of Vson, vol. 8, pp. 1 1, 28. [6] Y. Sato, S. Satoh, T. Myosh, H. Ire, and T. Yoshnaga, Parallel numercal smulaton for the lnear model of vsual neurons wth mp, SIC Techncal Report(IPSJ), vol. 211-HPC-129, pp. 1 8, 211, (n Japanese). [7] S. Shunj and U. Shro, Fractonal dervatve of gaussan functons : A model for spato-temporal receptve felds of v1 smple cells, IEICE techncal report, vol. 18, pp. 141 146, 29, (n Japanese). [8] L. B.D and K. T, An teratve mage regstraton technque wth an applcaton to stereo vson, Proceedngs of the Seventh Internatonal Jont Conference on Artfcal Intellgence(IJCAI-81), pp. 674 679, 1981. [9] Y. Wess, E. P. Smoncell, and E. H. Adelson, Moton llusons as optmal percepts, nature neuroscence, vol. 6, pp. 598 64, 22. [1] OpenMPI: Open source hgh performance computng, http://www.undata.ucar.edu/software/netcdf/. 136