Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications

Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications Theses of the Ph.D. dissertation Zoltán Nagy Scientific adviser: Dr. Péter Szolgay Doctoral School of Information Sciences University of Pannonia Veszprém, 2007

Introduction Though the scaling-down covers the problem of increasing computational needs there are some problems which are difficult to solve on traditional digital computers. Typical examples are pattern recognition, data organization, clustering and solution of partial differential equations. Neural networks are proved to be more feasible for these applications than digital computers but they are not used expansively in industrial applications because of the imperfections of the neural hardware. The most important drawback of a general neural network is that quick reprogramming is not possible which restricts its use in very specific applications. Additionally assuming a fully connected neural network is a major obstacle of the implementation because the complexity increases exponentially with the number of processors. Cellular Neural Networks solves this interconnection bottleneck by arranging the processing elements in a square grid and connecting each cell to its local neighborhood. This approach makes it possible to integrate large number of analog processors on a single chip. CNN was found to be very efficient in real time image and signal processing tasks where the computation is carried out by some kind of spatio-temporal phenomena. But the limited accuracy of the current analogue VLSI CNN chips does not make it possible to solve partial differential equations accurate enough to use the results in engineering applications. By using a digital architecture to emulate the CNN dynamics these limitations can be solved but the speed of these architectures is one order smaller than its analogue counterparts. Designing a full custom digital VLSI architecture is very time consuming and costly especially when small number of chips are manufactured. The development costs of an emulated digital CNN architecture can be reduced by using programmable devices during the implementation. The main advantage of the use of reconfigurable devices is that it makes the design and implementation of a digital architecture without any concern about the manufacturing technology possible. Additionally technology changes become easier because only small portions of the design should be redesigned or no redesign is required at all. Researches and applied methods In the course of my work I investigated how an emulated digital CNN-UM architecture can be implemented on reconfigurable devices. During the exploration I attended to develop and use design techniques which made it possible to design an emulated digital CNN-UM where the computational precision is configurable. Capabilities of the designed basic CNN- UM architecture were extended to make it possible to use arbitrary sized templates and to emulate multi-layer CNN array. The designed architecture was optimized by both area and speed. To simulate the different emulated digital CNN-UM architectures the ModelSim VHDL simulator from Model Technology was used while the Foundation ISE development system from Xilinx was used during the FPGA implementation. The implemented CNN-UM processors were tested on the XSV-300 prototyping board from XESS Corporation. During my work I investigated the required computational precision to solve different partial differential equations on emulated digital CNN-UM architectures. To solve partial differential equations several different numerical methods based on finite difference method were used. The solutions of the different partial differential equations were carried out by using proprietary software. To implement the CNN-UM architectures, which were optimized to solve partial differential equations, the C based Handel-C high-level hardware description language and the DK development system was used. The DK development system makes it possible to 2

sythesize the behavioral models described in Handel-C which increased the flexibility and shortened the design time compared to the traditional RTL level VHDL approach. The designed emulated digital processors, which were optimized to solve partial differential equations, were implemented on the RC-200 prototyping board form Celoxica. New scientific results 1. Thesis: Feasibility of the emulated digital CNN-UM processor implementations on FPGA circuits The CASTLE emulated digital CNN-UM architecture which was designed in SZTAKI makes it possible to emulate the CNN dynamics using different computational precision (1, 6 and 12 bit). The computing performance can be increased by reducing the precision but only a small portion of the chip is used in the low precision modes. However the predefined precisions are appropriate in general image processing tasks but often larger accuracy is required e.g. modeling of biological systems, solution of partial differential equations. On the recent analog and digital VLSI CNN-UM implementations only nearest neighbor templates can be used. Larger templates can be used after template decomposition but not every CNN template can be decomposed. In these cases software simulation must be used to compute the CNN dynamics but its performance is very low due to the increased computing requirements. Complicated biological and physical systems can be very efficiently modeled by using multi-layer CNN. But the analog VLSI CNN implementations can not be used in multilayer applications or its accuracy is not satisfactory. Thus software simulation is required in the analysis of the multi-layer CNN dynamics which is very slow especially when the array size is large or the time-constants of the layers are very different. To solve the previous problems a new emulated digital CNN-UM family called Falcon was developed. I have showed that the FPGA implementation of the Falcon emulated digital CNN-UM has orders of magnitude higher computing performance than the software simulation running on a 3.0GHz Pentium 4 processor. The capabilities of the Falcon emulated digital CNN-UM was extended to make application of arbitrary sized templates and emulation of multi-layer CNN possible. 1.1. Implementation and optimization of configurable emulated digital CNN-UM processor on Xilinx FPGA circuits Based on the CASTLE emulated digital CNN-UM architecture I have designed a new configurable emulated digital CNN-UM processor and optimized it on FPGA circuits. By using this new architecture, which called Falcon, arbitrary sized CNN arrays can be emulated with configurable computing precision. The main parameters of the processors such as width of the cell array, bit width of the state, input and template values, the number of space-variant templates and the number and arrangement of the processor cores in the architecture can be set in the synthsizable RTL description. By changing the previously specified parameters the size and performance of the Falcon architecture can be optimized for the given application. I have shown that the clock frequency of the Falcon architecture is 147-429MHz depending on the computing precision when implemented on Xilinx Virtex-IIPro FPGAs. Computation of a new cell value is carried out in 3 clock cycles thus the performance of the processor is 49-143 million cell iteration/s. I have shown that the computing performance is 3.5-10.4 times higher than the performance of a Pentium 4 processor 3

running on 3.0GHz clock frequency. Performance of the Falcon architecture can be further increased by using more processing elements. The number of implementable processor cores on the largest Virtex-IIPro 125 FPGA is 11-185 depending on the computing precision. 1.2. Implementation of a CNN-UM for arbitrary sized templates I have worked out a new method to run arbitrary sized templates on emulated digital architectures. I have designed a new emulated digital architecture where the template size can be configured in the synthesizable RTL level description. According to the configuration parameters in the RTL description the number of functional units is changed automatically. According to the n n template size n multipliers are required which can compute a new cell value in n clock cycles. The control unit of the processor is automatically adapted to the length of the different iteration cycles. I have shown that the larger number of functional units does not influence the operating speed. The clock frequency is independent from the template size and 147-429MHz can be achieved on the Virtex-IIPro FPGAs. Due to the longer iteration cycle performance of the Falcon architecture is decreased to 29-85 million cell iteration/s in case of 5 5 sized templates. I have shown that the computing performance is 3.3-9.8 times higher than the performance of a Pentium 4 processor running on 3.0GHz clock frequency. The increased number of functional units reduces the number of implementable processors. In the case of 5 5 sized templates and by using the Virtex-IIPro 125 FPGA 6-111 processors can be implemented depending on the computing precision. 1.3. Implementation of a multi-layer CNN-UM I have extended the capabilities of the Falcon emulated digital CNN-UM architecture to emulate multi-layer CNN cell array. The new architecture emulates a fully connected CNN thus every layer is connected to the other with globally configurable sized templates. The multi-layer Falcon architecture is constructed from the main elements of the singlelayer processor. In the case of r layers r memory units and r interconnected arithmetic units are required for each layer (r r altogether). The number of clock cycles required to compute a new cell value is independent from the number of layers and only depends on the template size. Area requirement of the multi-layer processor is greatly increased by the several interlayer connections. In case of 3 layer network with 3 3 sized templates the number of implementable processors is 1-20 on the Virtex-IIPro 125 FPGA. I have shown that this area increase does not affect the operating frequency and 147-429MHz can be achieved. In case of 3 layer network with 3 3 sized templates the multi-layer Falcon processor is 49-143 times faster than a Pentium 4 processor running on 3.0GHz clock frequency. 1.4. Area optimization of the Falcon emulated digital CNN-UM architecture on FPGAs by using distributed arithmetic I have designed an area optimized version of the arithmetic unit of the Falcon emulated digital CNN-UM architecture by using distributed arithmetic. This architecture can be used to run space invariant templates. I have shown that the area requirement of the optimized arithmetic unit is about 40% smaller while its computing performance is unchanged. Additionally the new arithmetic unit is more scalable than the conventional arithmetic unit. In the case of the conventional arithmetic unit and assuming n n sized template the template operation can be carried out by using 1, n and n 2 multipliers and the computation can be carried out in n 2, n and 1 clock cycles respectively. I have shown that 4

in case of distributed arithmetic the cycle time depends on the precision of the state variable for example: if the precision is 12 bits the cycle time can be 1, 2, 3, 4, 6 and 12 clock cycles. 2. Thesis: Using application specific emulated digital CNN-UM in the solution of partial differential equations The solution of partial differential equations (PDE) has long been one of the most important fields of mathematics, due to the frequent occurrence of spatio-temporal dynamics in many branches of physics, engineering and other sciences. The array structure and local connectivity of the CNN paradigm makes it a natural framework to solve partial differential equations by using finite differencing. But in most cases multilayer CNN is required. Additionally in case of some important equation, for example the Navier-Stokes equations, the interaction between the cells is nonlinear. By using the recent analog VLSI CNN-UM chips only approximation of the multi-layer behavior is possible and an additional problem is the implementation of the nonlinear interactions. The 7-8 bit accuracy and the 128 128 array size of the recent analog VLSI CNN-UM chips are not enough in some engineering applications. By using the Falcon emulated digital architecture the array size and the number of layers are not problems. But the accuracy of the solution should be examined from a different aspect: what the required minimal precision to get right solution is. Template operators required to solve partial differential equation on CNN are usually symmetrical, space invariant or the ratio of the template values are constant. These properties make it possible to specialize the Falcon emulated digital CNN-UM architecture to solve the given partial differential equation. Implementation of these specialized processors requires smaller area and its performance can be improved significantly. In these cases the conventional VHDL based RTL level design method is very time consuming thus high level synthesis methods should be used during the design of the processors. 2.1. The effect of the computing precision on the accuracy of the solution in case of the solution of partial differential equations I have developed two new heuristic methods which can be used to determine the optimal computing precision during the fixed-point solution of partial differential equations and systems of ordinary differential equations. The efficiency of the methods was proved by algorithmic considerations and experiments. I have tested the new heuristic methods on different types of partial differential equations and systems of ordinary differential equations. I have shown experimentally that the new heuristic methods are general. 2.2. Application of high-level synthesis and rapid prototyping techniques in the design of partial differential equation solver architectures I have examined the solution of two partial differential equations (tactile sensor, barotropic ocean model) and I have designed a new computing architecture to solve these equations which fit well into the structure of emulated digital CNN architecture and permit fast and efficient computation. I have introduced a new method which can be used to design specialized emulated digital architectures for solution of partial differential equations in a fraction of time than the conventional design methods. I have demonstrated the operation and efficiency of the method in the solution of two partial differential equations (tactile sensor, barotropic ocean model). The architecture makes it possible to 5

emulate locally connected cell arrays with arbitrary cell characteristics. To change the characteristics of the cell only the arithmetic unit should be modified but by using a highlevel hardware description language this can be done simpler and its simulation is orders of magnitude faster than the conventional VHDL based approach. Application of the results Shortly after the publication of the theory of Cellular Neural Networks lots of analogic algorithms were published to solve wide variety of tasks. But the lack of the appropriate hardware platform raised difficulties during the practical application of the results. Introduction of the first analog VLSI CNN-UM chips boosted up the research and made implementation of the theoretical results possible. In spite of the computing performance of the analog VLSI CNN-UM chips are very significant their accuracy is inadequate in some cases. Additionally they are very sensitive to the different types of noises. To overcome these difficulties emulated digital CNN-UM architectures were designed which are slower than their analog counterparts but their accuracy and noise sensitivity are much better. But both solutions have common drawback because only nearest neighborhood templates can be used on these architectures. The Falcon emulated digital CNN-UM architecture presented in the dissertation makes it possible to run analogic algorithms with high accuracy requirement while its computing performance is comparable to the analog implementations. The configurable computing precision makes it possible to optimize the resource requirements of the different analogic algorithms. On the extended Falcon architecture arbitrary sized templates can be used and multi-layer CNN cell array can be emulated. The multi-layer CNN can be used in the solution of the state equation of complex dynamical systems and partial differential equations. Such a dynamical system can be for example a qualitatively correct mammalian retina model. The usefulness of the Falcon emulated digital CNN-UM architecture is demonstrated during the solution of several different partial differential equations. Two heuristic methods are presented to determine the optimal computing precision which makes it possible to reduce the area, power and I/O requirements of the architecture. The Falcon emulated digital CNN-UM architecture can be very efficiently used to solve problems where the dynamics of the system should be determined with high accuracy. 6

The Author s Publications Journal papers [1] Z. Nagy, P. Szolgay Configurable Multi-Layer CNN-UM Emulator on FPGA IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Vol. 50, pp. 774-778, 2003 [2] Z. Nagy, P. Szolgay Fast and efficient multi-layer CNN-UM emulator using FPGA Periodica Polytechnica Electrical Engineering, Vol. 47, No. 1-2, pp. 57-70, 2003 [3] P. Kozma, Z. Nagy, P. Szolgay Seismic wave propagation modelling on emulated digital CNN-UM architecture Periodica Polytechnica Electrical Engineering, Vol. 49, No. 3-4, pp. 183-193, 2005 [4] Z. Nagy, P. Szolgay Solving Partial Differential Equations On Emulated Digital CNN- UM Architectures International Journal Functional Differential Equations, Vol. 13, No. 1, pp. 61-87, 2006, ISSN: 0793-1786 [5] Z. Nagy, Zs. Vörösházi, P. Szolgay Emulated digital CNN-UM solution of partial differential equations International Journal of Circuit Theory and Applications, Vo. 34, Issue 4, pp. 445-470, 2006, DOI: 10.1002/cta.363 International conference papers [6] A. Katona, Z. Nagy, The functional test of a real-time image processor model Proceedings of INTCOM99, Budapest, Hungary, 1999 [7] A. Katona, Z. Nagy, The implementation of a real-time image processor on FPGA Proceedings of INTCOM2000, Veszprém, Hungary, 9-14 September, 2000 [8] Z. Nagy, P. Szolgay An emulated digital CNN-UM implementation on FPGA with programmable accuracy Proceedings of the 4 th IEEE DDECS Workshop, Győr, Hungary, April 18-20, 2001 [9] Z. Nagy, P. Szolgay Fast and efficient multi-layer CNN-UM emulator using FPGA Proceedings of the 3 rd Conference of PhD Students in Computer Science, Szeged, Hungary, July 1-4, 2002 [10] Z. Nagy, P. Szolgay Configurable Multi-Layer CNN-UM Emulator on FPGA Proceedings of the 7 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA 2002, Frankfurt/Main, Germany, July 22-24, 2002 [11] Z. Nagy, P. Szolgay Configurable multi-layer CNN-UM emulator on FPGA using Distributed Arithmetic Proceedings of the 9 th IEEE International Conference on Electronics, Circuits and Systems, Dubrovnik, Croatia, September 15-18, 2002 [12] Z. Nagy, P. Szolgay Numerical solution of a class of PDEs by using emulated digital CNN-UM on FPGAs Proceedings of the 16 th European Conference on Circuits Theory and Design, Cracow, September 1-4, 2003 [13] Z. Nagy, Zs. Szolgay, P. Szolgay Tactile Sensor Modeling by Using Emulated Digital CNN-UM Proceedings of the 8 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA 2004, Budapest, Hungary, July 22-24, 2004 [14] Z. Nagy, P. Szolgay Emulated Digital CNN-UM Implementation of a Barotropic Ocean Model Proceedings of the International Joint Conference on Neural Networks, IJCNN 2004, Budapest, Hungary, July 25-29, 2004 7

[15] L. Beke, Z. Nagy, P. Szolgay Low-cost CNN-UM global analogic programming unit implementation on FPGA Proceedings of the 8 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA 2004, Budapest, Hungary, July 22-24, 2004 [16] Z. Nagy, Zs. Vörösházi, P. Szolgay An Emulated Digital Retina Model Implementation on FPGA Proceedings of the 9 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA 2005, Hsin-chu, Taiwan, May 28-30, 2005 [17] Z. Nagy, P. Szolgay Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs Proceedings of the 8 th Military and Aerospace Programmable Logic Devices International Conference, MAPLD2005, Wasgington DC., USA, September 7-9, 2005 http://klabs.org/mapld05/abstracts/153_nagy_a.html [18] Z. Nagy, Zs. Vörösházi, P. Szolgay, Mammalian retina model implementation on emulated digital FPGA, Joint Hungarian-Austrian Conference on Image Processing and Pattern Recognition, ISBN 3-85403-192-0, pp. 295-302, Veszprém, 2005 [19] Z. Nagy, Zs. Vörösházi, P. Szolgay An advanced emulated digital retina model on FPGA to implement a real-time test environment Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, ISCAS2006, Island of Kos, Greece, May 21-24, 2006 [20] Z. Nagy, Zs. Vörösházi, and P. Szolgay A Real-time Mammalian Retina Model Implementation on FPGA, Proceedings of the 10 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA2006, Istanbul, Turkey, August 28-30, 2006 [21] Zs. Vörösházi, Z. Nagy, A. Kiss, P. Szolgay An Embedded CNN-UM Global Analogic Programming Unit implementation on FPGA, Proceedings of the 10 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA2006, Istanbul, Turkey, August 28-30, 2006 [22] Z. Kincses, Z. Nagy, P. Szolgay Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA, Proceedings of the 10 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA2006, Istanbul, Turkey, August 28-30, 2006 [23] S. Kocsárdi, Z. Nagy, S. Kostianev, P. Szolgay FPGA Based Implementation of Water Reinjection in Geothermal Structure, Proceedings of the 10 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA2006, Istanbul, Turkey, August 28-30, 2006 [24] P. Sonkoly, P. Kozma, Z. Nagy, P. Szolgay Acoustic Wave Propagation Modeling on 3D CNN-UM Architecture, Proceedings of the 10 th IEEE International Workshop on Cellular Neural Networks and their Applications, CNNA2006, Istanbul, Turkey, August 28-30, 2006 [25] P. Szolgay, S. Kocsárdi; Z. Nagy, P. Sonkoly, Zs. Vörösházi, Complex computational problems in cellular architectures RSEE 2006. Proceedings of the 6 th international conference on renewable sources and environmental electro-technologies, pp: 111-115, Stana De Vele, 2006 8