Investigation of emulated-digital CNN-UM architectures: Retina model and Cellular Wave-Computing Architecture implementation on FPGA



Similar documents
Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications

Architectures and Platforms

BSc in Computer Engineering, University of Cyprus

Digital Systems Design! Lecture 1 - Introduction!!

7a. System-on-chip design and prototyping platforms

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

International Workshop on Field Programmable Logic and Applications, FPL '99

High-Level Synthesis for FPGA Designs

Implementations of CNN-based image processing and adaptive optic system on FPGA

How To Design An Image Processing System On A Chip

Systolic Computing. Fundamentals

Introduction to Digital System Design

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

NIOS II Based Embedded Web Server Development for Networking Applications

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

CFD Implementation with In-Socket FPGA Accelerators

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

A Computer Vision System on a Chip: a case study from the automotive domain

Model-based system-on-chip design on Altera and Xilinx platforms

Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?

Low-resolution Image Processing based on FPGA

A Mixed-Signal System-on-Chip Audio Decoder Design for Education

Reconfigurable System-on-Chip Design

FPGA area allocation for parallel C applications

An Open Architecture through Nanocomputing

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Float to Fix conversion

Neural Network Design in Cloud Computing

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Hardware/Software Co-Design of a Java Virtual Machine

Custom design services

AC : PRACTICAL DESIGN PROJECTS UTILIZING COMPLEX PROGRAMMABLE LOGIC DEVICES (CPLD)

A General Framework for Tracking Objects in a Multi-Camera Environment

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

Codesign: The World Of Practice

ELEC 5260/6260/6266 Embedded Computing Systems

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Synchronization of sampling in distributed signal processing systems

MEng, BSc Applied Computer Science

9/14/ :38

Open Architecture Design for GPS Applications Yves Théroux, BAE Systems Canada

Networking Virtualization Using FPGAs

BUILD VERSUS BUY. Understanding the Total Cost of Embedded Design.

IBM Deep Computing Visualization Offering

A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale

Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC

FPGAs in Next Generation Wireless Networks

Parallelized Architecture of Multiple Classifiers for Face Detection

Secured Embedded Many-Core Accelerator for Big Data Processing

MEng, BSc Computer Science with Artificial Intelligence

Rapid System Prototyping with FPGAs

FPGA Design From Scratch It all started more than 40 years ago

The Big Data methodology in computer vision systems

An Embedded Hardware-Efficient Architecture for Real-Time Cascade Support Vector Machine Classification

White Paper FPGA Performance Benchmarking Methodology

Implementation and Design of AES S-Box on FPGA

FPGA-based MapReduce Framework for Machine Learning

FPGA-based Multithreading for In-Memory Hash Joins

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

HARDWARE IMPLEMENTATION OF TASK MANAGEMENT IN EMBEDDED REAL-TIME OPERATING SYSTEMS

Introduction to System-on-Chip

High Performance Computing. Course Notes HPC Fundamentals

VON BRAUN LABS. Issue #1 WE PROVIDE COMPLETE SOLUTIONS ULTRA LOW POWER STATE MACHINE SOLUTIONS VON BRAUN LABS. State Machine Technology

Extending the Power of FPGAs. Salil Raje, Xilinx

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Latency in High Performance Trading Systems Feb 2010

A Survey of Video Processing with Field Programmable Gate Arrays (FGPA)

White Paper. S2C Inc Technology Drive, Suite 620 San Jose, CA 95110, USA Tel: Fax:

dspace DSP DS-1104 based State Observer Design for Position Control of DC Servo Motor

The Department of Electrical and Computer Engineering (ECE) offers the following graduate degree programs:

Curriculum for a Master s Degree in ECE with focus on Mixed Signal SOC Design

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

Analecta Vol. 8, No. 2 ISSN

Control 2004, University of Bath, UK, September 2004

Phase Change Memory for Neuromorphic Systems and Applications

High-Speed SERDES Interfaces In High Value FPGAs

Networking Remote-Controlled Moving Image Monitoring System

FPGA Design of Reconfigurable Binary Processor Using VLSI

Electronic system-level development: Finding the right mix of solutions for the right mix of engineers.

A Compact FPGA Implementation of Triple-DES Encryption System with IP Core Generation and On-Chip Verification

Automotive Software Engineering

HowHow to Get Rid of Unwanted Money

MATLAB/Simulink Based Hardware/Software Co-Simulation for Designing Using FPGA Configured Soft Processors

Power Reduction Techniques in the SoC Clock Network. Clock Power

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Systems on Chip Design

Transcription:

University of Pannonia Information Science and Technology Doctoral School Investigation of emulated-digital CNN-UM architectures: Retina model and Cellular Wave-Computing Architecture implementation on FPGA Theses of Ph.D. dissertation Zsolt Vörösházi Supervisor: Péter Szolgay DSc Veszprém, Hungary 2009

Motivations and aims 1 I. Motivations and aims Nowadays, both the analog- and the digital circuit technology and fabrication are continuously improving and supplementing each other. This improvement is well featured by the scaling-down (micro-minimalization) effect based on the Moore s law. The choice between these two technologies in case of the high-performance, real-time, and near-sensor signal processing tasks is primarily determined by the method of the application. To support the decision critical and typical physical parameters are calculated: such as area (A), speed (S), dissipated power (P) of the complex, large-scale integrated VLSI circuits. Recently, the parallel array processing has become the focus point of the state-ofthe-art analog circuit technology and its digital counterpart. However, following this type of design methodology an important problem was emerged: most of the designers and researchers intended to construct a globally interconnected processor array structure, but its complexity grows exponentially according to the increasing number of processor elements in an array. Cellular Neural / Nonlinear Networks (CNN) are defined as analog, non-linear, parallel computing structure of array including a lot of elementary processor units (e.g. nucleus) arranged in a 2-dimensional regular grid. They can be implemented not only in a single-layer, but well formed in a multi-layer architecture, as well. The processor elements are locally connected (discrete) in space, but they operate continuously in time. The program of the CNN network (called template ) can be defined by the strength of the local interconnections between the elementary cells, or in other words by setting the matrix of the weight factors. The result of the computation is derived from both the spatio-temporal dynamic of the processing elements and the template operations (called analog transient) together. If each elementary CNN cell is extended with a local analog and logical memory unit, a local control unit, and an optical sensor input, moreover adding a global control unit to this integrated cell array the CNN Universal Machine (CNN-UM) architecture can be constructed, recently defined as Cellular Wave Computing Architecture. The CNN-UM is universal both in terms of Turing Machine and it may work as a non-linear computing operator. Each elementary instruction of the CNN-UM defines complex, spatio-temporal dynamic behavior. In the present era, this novel computing architecture based on the CNN paradigm has been implemented on several different platforms. First hardware prototypes of the CNN networks contained analog / mixed-signal VLSI chips. The huge computing performance of these CNN chips (a few TeraOPS - 10 12 operations/sec) far exceeds the performance of all the other digital processor implementations, and the power dissipation is very low. However, they have some disadvantages, which impede their wide spread usability in industrial applications. They are suffered from noisesensitivity, lack of flexibility, and as the most important problem the limited analog accuracy (about 7-8 bits in I/O) giving inaccurate solution in most of signal processing tasks.

Motivations and aims 2 The simplest, most accurate, and most flexible, but slowest CNN-UM implementation is the CNN software simulator running on a traditional computer. The software simulator is generally used to ease the template design and optimization process. Moreover, during the measurements some comparisons should be made between the speed-up of different CNN-UM approaches and the computing performance of the CNN software simulator, which latter is considered unity. As an alternate way, CNN software simulation can be accelerated by many-core technologies using GPU-based (Graphics Processor Unit) implementations, such as the NVidia CUDA, or the IBM CELL architecture. The emulated-digital CNN-UM solution means the best compromise between the analog VLSI CNN-UM implementations and software simulators regarding computing performance and accuracy. The emulated-digital solutions have many different physical forms, such as ASIC (Application Specific Integrated Circuit) like the CASTLE array processor, DSP-based (Digital Signal Processor) CNN-HAC prototyping board, or they can be built up on an FPGA (Field Programmable Gate Array) e.g. the FALCON architecture. In case of this emulated-digital approach the behavior of the analog CNN cell network can be approximated by a discretized model in space and time, while the locally connected digital processor elements are arranged in an array. Hence, the nature of CNN provides a flexible and effective computing structure for the complex spatio-temporal dynamical computations of various bio-inspired systems (such as a retina), moreover it makes possible to generate the activity patterns in video real-time, as well. The neuromorph structure of the multi-layer CNN retina model is derived from both morphological and electro-physiological information measured by neurobiologist. According to the latest results of neurobiological investigations a mammalian rabbit retina consists of about 10-12 different types of ganglion channels, but further channels might be explored due to improvements in the methodology and measurement techniques. Each channel is made up from several (at least 10) diversely inter-connected stack of strata, on which large number of simple processor elements (neurons) are arranged on a two-dimensional structure. The difficulty lies in this evolvable computing problem, that we could handle large number of CNN layers with different physicaltiming parameters, and various connectivity properties beyond the increased computation power requirements. Universality of a CNN-UM network is based on the stored-programming principle, which task can be solved by integrating an embedded Global Analogic Programming Unit (hereafter GAPU) into the cell network. This GAPU is responsible for controlling the sequential instructions of the complex, sophisticated analogical (analog-logical) CNN algorithms; moreover, it can store the necessary values (input, state, bias) to perform the computations. I have chosen the FPGA-based reconfigurable computing devices both for neuromorph structured mammalian retina model implementation and for elaboration of the GAPU. The reason is that today s modern FPGAs provide good alternative to perform complex spatio-temporal, multi-layer CNN dynamical computations at high precision owing to their advantageous features, such as high flexibility and

Motivations and aims 3 computing performance, rapid prototyping development, and low cost (in low volume). Therefore, it is worthwhile to review the different CNN-UM approaches especially regarding the FPGA-based emulated-digital CNN-UM implementation. It is very interesting how its inherent computational potential can be exploited in the solution of various real-time processing tasks.

Methodology of the research 4 II. Methodology of the research Topic of the dissertation is based on one hand the neuromorph structured, multilayer mammalian retina model implementation, and on the other hand the Global Analogic Programming Unit (GAPU) implementation on FPGA architecture. During the software development and setup phase of different test applications I used several industry-standard programming and simulation EDA tools, such as Xilinx ISE synthesis tools with EDK embedded processor developer kit, Celoxica/Agility DK Design Suite supporting the Handel-C high-level description language, Mentor ModelSim simulator, and MATLAB programming tool. Moreover, for hardware prototyping I have tried to choose various development boards, which are capable of covering the distinct directions and levels of FPGA evolutions (e.g. Celoxica RC203 and RC2000 boards with Virtex-II, a Xilinx V2Pro board equipped with Virtex-II Pro, and finally a Xilinx ML-506 card embedded with Virtex-5 FGPA). I m intended to use the modern hardware-software co-design and co-verification techniques, which means the state-of-the-art and most popular form of the FPGAbased reconfigurable computing (RC) implementation. In these types of hardwaresoftware co-design tasks the partitioning step means the key problem, in which the designer can freely determine the distinction of hardware and software parts. However, the partitioning may also depend on the available resources and achievable performance. The reconfigurable FPGA architectures make it possible to perform optimized DSP operations by utilizing the internal dedicated building blocks at high speed: e.g. calculating a convolution by means of Multiply-Accumulate (MAC) DSP block. Today (2009) largest and most powerful high-end Xilinx FPGA is the Virtex-6 SX family (XC6VSX475T), which contains at most 2016 dedicated MAC DSP blocks. During the comprehensive research work I first examined how the bio-inspired mammalian retina model can be implemented on a FPGA-based emulated-digital CNN-UM architecture, which means an open and emerging computation problem. Although, several analog VLSI CNN-UM devices with complex-cells (e.g. CACE1k, CACE2k) are available, they can only handle external layers from a given retina channel. Using this type of CNN approach in case of an OPL (Outer Plexiform Layer) at most 2 or 3 strata can be simultaneously modeled in video real-time (about 25 fps). Because of this main limitation I have chosen the emulated-digital CNN-UM approach, which makes it possible to implement different retina channels with various configurable timing-, and connectivity parameters supposing a globally connected multi-layer structure, as well. The aim of the proposed multi-channel, multi-layer retina model implementation on FPGA is to get qualitatively correct results compared to the original neurobiological measurements in order to mimic the behavior of the living retina. Furthermore, the implementation provides real-time processing and several orders of magnitude faster than a software simulator. By the help of this FPGA-based implementation model parameters of the mammalian retina can be verified rapidly

Methodology of the research 5 and set correctly. (The run-time of a multi-layer retina model using a software simulator on conventional PC might last about several hours depending on the size of the model and its parameters.) Complexity of this task comes from the handling of large number of strata and the very different parameters of the layers, such as feed forward-, feedback connections, couplings, and time constants. The governing equations, which describe the dynamics of the neuromorph-structured retina model, do not have an exact analytical solution. Therefore, during the investigations I made comparisons between the results of different numerical integrating formulas (e.g. Forward Euler method, higher-order Runge-Kutta methods) to solve these types of ODEs. I examined the critical parameters in the following key aspects: simulation time-step, resource utilization (area), computing performance and accuracy. During the implementation I attempted to elaborate an FPGA-based reconfigurable computing architecture, which can be well configured in arbitrary manner and due to the applied design methodology and rapid prototyping platform the behavior of various bio-inspired multi-channel vision models can be explored in real-time. The implemented emulated-digital CNN-UM architecture on FPGA makes it possible to perform complex, spatio-temporal dynamical equations described by coupled ODEs. Using fixed-point computing method it was important to know how accurately the novel implementation approximates the results of the microbiological measurements. Moreover, area requirements and the largest achievable performance at different computing precisions were measured. Considering scalable precision there was another vital question, what the lowest computing accuracy is by which the model gives qualitatively acceptable responses. (Considering the original microbiological measurements a real mammalian retina works at about 6-bits of analog computing precision.) CNN templates and settings of the interactions are calculated from the parameter tables of the CNN-based neuromorph retina model, whereas the behavior of the model is derived from neurobiological measurements. On one hand, both templates and algorithm solutions of the multi-layer retina model have been implemented on software simulator and tested on different conventional microprocessors. To achieve better performance, ANSI-C/C++ source codes are extended with optimized functions of the Intel Integrated Performance Library. This package is optimized for image-, and signal-processing tasks. On the other hand, the entire retina system on FPGAbased emulated-digital CNN-UM architecture was constructed by modifying and extending the Falcon emulated-digital CNN-UM architecture. Different kinds of prototyping platforms equipped with various Xilinx FPGAs were used for the neuromorph retina model calculations. Finally, the computed results have been compared to the original neurobiological measurements to verify the effectiveness of the proposed FPGA-based CNN-UM architecture. The dissertation also deals with the Global Analogic Programming Unit implementation on the reconfigurable emulated-digital CNN-UM architecture using FPGA. Due to its modular structure and operation it can be simply integrated with the previously elaborated original Falcon CNN processing architecture, thereby, a

Methodology of the research 6 universal Cellular Wave Computer architecture on FPGA can be implemented. During the implementation process, I have integrated the Xilinx MicroBlaze embedded soft-processor IP core, which has RISC instruction set, into the GAPU. Then, the embedded system as a global CNN control unit has been extended with some storage elements. Finally, this novel architecture has been integrated with the modified Falcon-ML processor array architecture and Vector Processing Elements. These improvements make it possible to effectively exploit the large computing performance of the Falcon processor in order to construct a fully functional, standalone, and real-time image processing system. By using the proposed embedded GAPU implementation, on one hand, template sequences of the complex sophisticated analogic CNN algorithms can be easily executed, and on the other hand, it is capable of controlling program organizing constructions (e.g. iteration, branch etc.) and I/O instructions similar to various commercial Visual System-on-Chip implementations. (Without using the embedded GAPU implementation only a single instruction or a template operation could be handled via a host PC, which is limited the performance of the Falcon processor significantly). During verification and performance tests the computing time, required to calculate the CNN equations, has been measured repeatedly (50-times). From this set the best runtime has been selected and compared to the estimated performance of different commercial CNN-UM implementations. The quality of the proposed embedded, emulated-digital CNN-UM GAPU implementation is demonstrated and verified by running a complex sophisticated analogic CNN algorithm, in which case consecutive steps of template operations and replacements are required. Based on real experiments, several important issues relating to the acceleration efficiency, computing accuracy, cell size, and area consumption are discussed and compared to the results of the software simulator and the concurrent state-of-the-art CNN-UM implementations. The research work has been done at the Cellular Neural Network Applications Laboratory of the Department of Image Processing and Neurocomputing (its new name is Department of Electrical Engineering and Informational Systems), University of Pannonia.

New Scientific Results 7 III. New Scientific Results Thesis Group 1: Implementation of a CNN-based neuromorph mammalian retina model on FPGA architecture I have implemented a novel single-, and multi-channel, multi-layer retina models on a reconfigurable emulated-digital CNN-UM architecture by applying the latest results of microbiological measurements relating to the CNN-based framework of the neuromorph mammalian (e.g. rabbit) retina model. The difficulty of this challenging task lies in the solution of a complex spatio-temporal problem which requires huge computing power. Real-time processing capability has been verified and tested on three different FPGA-based prototyping systems, such as Celoxica RC2000, Digilent XUPV2P, and Xilinx ML-506. I have shown experimentally that the single-, and multi-channel retina model implementations on FPGA achieve orders of magnitude performance increase over the software solutions while still providing high-flexibility in parameter settings. This implementation also makes it possible to handle various neuromoprh retina models or biological systems more easily and effectively owing to its rapid and effective parameter tuning and refining procedure. Related publications: [1],[3],[4],[5],[6],[7],[9],[12] Thesis 1.1: I have elaborated a reconfigurable emulated-digital CNN- UM computing architecture (Falcon-ML), which is feasible to implement CNN-based neuromorph, single-, and multi-channel, multilayer mammalian retina model computations on FPGA effectively. The architecture is tailored to implement single-, and multi-channel, multilayer retina model, therefore I have completely redesigned the Arithmetic Unit for the calculations with diffusion-, Gaussian-type symmetrical templates, and Intra/Inter layer zero-neighborhood connections. The Template Memory unit has been also expanded in order to store the various parameters related to the connections of the multi-channel, multi-layer retina structure. Thesis 1.2: I have shown experimentally that the performance of the optimized retina processor elements for calculating CNN dynamics can be significantly improved by decreasing the computing precision. Not knowing the exact analytical solution of this complex spatio-temporal problem, I have considered the double precision floating point numerical implementation as an accurate solution. In general, it is inferred that at least

New Scientific Results 8 22-bit computing precision is necessary to obtain qualitatively correct results from various CNN-based neuromorph mammalian retina channels. I have compared the results of different fixed-point computations to the double precision floating point results and neurobiological measurements. At low precisions (less than 14-bit) the error values of the FPGA-based neuromorph retina model implementation are very high because the model does not respond to the input stimulus. At least 16 18 bit precision is required to get some response on the output of the model. If the CNN dynamics of the retina model should be computed more accurately, at least 28 30 bit precision is required. Thesis 1.3: I have given equivalent transformations between the computing performance, the image size, the number of layers and the precision of the elaborated FPGA-based single-, and multi-channel neuromorph retina model implementation. These critical parameters determine the limitations of the FPGA implementation. Considering the qualitatively correct 22-bit state precision the elaborated architecture achieves 14-1400 times higher performance compared to the software solution running optimized codes on Intel Core2 Duo E8400 microprocessor. The multi-layer CNN simulation kernel is written in C using optimized functions of the Image Processing Library from Intel. To emulate a single retina channel at least 10 CNN layers are required, while any further ganglion channels increase the complexity of the CNN network by additional 7 layers. I have implemented the CNN-based neuromorph retina model on several different FPGA-based experimental systems (using Virtex-II, Virtex-II Pro, and Virtex-5 FPGAs), I have explored the maximal number of implementable Falcon processing elements on each platform, while I have estimated the results of the largest Virtex-6 FPGA. Considering 22-bit state precision, 2-7 ms time-step, at 64 64 (on Virtex-II) and 474 474-sized images (on Virtex-6) 1-48 parallel retina channels can be implemented and emulated in real-time depending on the dedicated FPGA resources. Larger images can be processed if more BRAM memories are available, but the processing will not be done in real-time. To process higher resolution images an external on-board memory is required. In this case the processing time is at least a half at most 3 orders of magnitude slower than using internal BRAM memories, due to the memory I/O bandwidth limitation.

New Scientific Results 9 Thesis Group 2: Implementation of embedded CNN-UM Global Analogical Programming Unit as a Cellular Wave Computer on FPGA architecture I have implemented a Global Analogical Programming Unit on a FPGA-based emulated-digital CNN-UM architecture to obtain a fully functional Cellular Wave Computing architecture. I have completely redesigned the local Control Unit of the Falcon reconfigurable emulated-digital CNN processor and optimized it for the communication with the GAPU, the modified processor called Falcon Processing Element (FPE). I have elaborated a new GAPU architecture to control the sequential template and/or arithmetic-logic operations and program organizing instructions of analogical CNN algorithms. To perform arithmetic and logic operations a new Vector Processing Element (VPE) has been implemented. Finally, the processing array consisting VPE and FPE units has been integrated with the GAPU implementation. I have demonstrated the operation and effectiveness of the proposed embedded GAPU architecture by executing a complex sophisticated skeletonization analogic CNN algorithm. Real-time image processing capability of this autonomous system has been verified and tested on different prototyping systems. I have shown experimentally, that on the largest FPGA architecture at least two orders of magnitude performance advantage can be achieved over the software simulator, while it also provides several times speed-up over competing analog VLSI CNN-UM implementations. Related publications: [2],[8],[10],[11] Thesis 2.1: I have elaborated and implemented a new emulated-digital CNN-UM GAPU architecture, as a Cellular Wave Computer on FPGA by integrating an embedded Xilinx MicroBlaze soft-processor core to control the sequential and program organizing instructions of analogic CNN algorithms effectively. Based on the original reconfigurable emulated-digital CNN-UM processor (FALCON) I have elaborated a new computing architecture, called Falcon Processing Element (FPE), which is optimized for the communication with GAPU. The local Control Unit of the original Falcon architecture has been completely redesigned. I have implemented a new Vector Processing Element (VPE) to perform the arithmetic and logic operations by utilizing the dedicated resources on the FPGA. The processing array of FPE and VPE units is integrated with the GAPU implementation. Without the GAPU, the full processing time of the previous solutions is mainly affected by the communication time between the host PC and the Falcon PE, which is necessary for downloading template sequences, images of input and initial state, and program organizing instructions (such as branch, cycle, etc.), and uploading the result of the computation in each steps of the

New Scientific Results 10 algorithm. I have reduced the communication time by storing these parameters and instructions in the internal registers of the GAPU, similarly to the standard CNN-UM structure. The embedded GAPU can communicate directly with the Falcon PEs across the high-speed PLB bus of the Xilinx MicroBlaze soft-core at a frequency of the FPGA s internal clock. Therefore, the Falcon PE is more efficiently (in 91% of the full computing time) utilized when performing complex analogic CNN algorithms. Thesis 2.2: I have proved experimentally that implementation and integration of FPEs, VPEs with a GAPU unit on the reconfigurable emulated-digital CNN-UM system is most optimal at 16-bit computing precision, where the number of implementable Falcon and Vector PEs is the largest. The 18-bit state-precision gives the optimal resource occupancy, which is best suited to the bit width of the dedicated BlockSelectRAM memories (e.g. BRAM18k) and multiplier blocks (e.g. MULT18 18). However, using a Xilinx MicroBlaze embedded soft-processor core the supported high-speed communication bus (e.g. the PLB bus) can be defined as 128-bit wide (multiples of 16 bit), therefore the practical computing precision of the FPE is also 16 bit. Only small amount of the available logic and dedicated chip resources is occupied by the proposed GAPU implementation embedded with a Xilinx MicroBlaze IP core, neither the number of implementable Falcon and Vector Processing elements nor the cumulative computing performance is decreased significantly. Thesis 2.3: I have shown experimentally that the elaborated FPGAbased CNN-UM GAPU implementation provides several orders of magnitude faster processing speed over software simulation and it may also outperform the current analog VLSI CNN-UM systems, depending on the selected FPGA. The computing performance is determined by the image-size, the accuracy of the solution and the number of available dedicated memory resources. I have implemented the embedded GAPU architecture on several different FPGA-based experimental systems (using Virtex-II, Virtex-II Pro, and Virtex- 5 FPGAs), I have measured the maximal performance on each platform, while I have estimated the results of the largest Virtex-6 FPGA. The skeletonization algorithm is selected and executed using nearest neighborhood templates to measure the performance. The functionality of the GAPU is examined both on 128 128-, and 512 512-sized images by running 10 Forward Euler iterations,

New Scientific Results 11 supposing the optimal 16-bit state-, constant- and 8-bit template precision. The CNN software simulation kernel of the skeletonization algorithm is written in C using optimized functions of the Intel Image Processing Library. Depending on the dedicated resources, the cumulative performance of the Falcon processing array extended with the proposed GAPU implementation can reach 1.33 billion CNN celliteration per second or 135 billion CNN operation per second. Depending on the selected image-size (128 128, or 512 512) 3-150-fold speed-up can be achieved over the software simulator running optimized code on an Intel Core2Duo E8400 microprocessor. Performance of the GAPU implementation can reach or may exceed (1-order of magnitude faster) the performance of the analog ASIC VLSI CNN-UM chips (e.g. ACE16k, Q- Eye).

Possible Applications 12 IV. Possible Applications Several years ago I had an opportunity to participate in the long design and verification process of the emulated-digital CNN-UM array architecture, called CASTLE, which has been implemented at Analogical and Neural Computing Laboratory, Hungarian Academy of Sciences. I have attained a fundamental knowledge about the high-level hardware description languages and the full-custom ASIC VLSI layout design-simulation procedure by elaborating the global timing-, and control unit of the CASTLE array processor architecture. Implementation of the standalone FPGA-based CNN-UM system with GAPU integration benefits from this know-how. Instead of using the expensive and long development procedure of the full-custom ASIC VLSI technology, I decided to apply reconfigurable computing (RC) devices for CNN computations. Reconfigurable computing architectures, such as FPGAs, make it possible to implement low-cost, flexible and reprogrammable emulated-digital CNN-UM systems for various application areas. In the dissertation the investigated emulated-digital CNN-UM architectures have been implemented on reconfigurable computing devices and they can be employed in the following possible application areas: On one hand, different neuromorph structured, multi-layer, multi-channel retina models can be analyzed or bio-inspired biological systems can be modeled on FPGA, where the high-speed processing capability is essential. High computing performance has been achieved by using simple locally interconnected processing elements, which arranged in a large array. Moreover, this implementation has orders of magnitude higher speed advance over software simulators, which provides examinations of the differently organized retina models in video realtime by rapid reconfiguration capability. By the proposed implementation it might be possible to explore and understand the relation between the stimulation of a neuron in its corresponding receptive field and the recorded spiking patterns for a given ganglion channel. The quality of the retina model can be examined by the comparison of the FPGA-based measurements and the results of the neurobiological measurements. Knowing the differences between them the structure and parameters of the retina model can be tuned and refined. By using the properly defined retina model a smart vision system can be implemented on FPGA, which makes more effective object recognition-, tracking-, and classification possible for example in surveillance or reconnaissance applications. On the other hand, the Falcon reconfigurable emulated-digital CNN-UM processor array embedded with the GAPU implementation gives a fully functional, stand-alone image processing system. Using the GAPU implementation complex sophisticated analogical CNN algorithms can be executed in real-time. It makes possible to perform sequences of template

New Scientific Results 13 operations, analog and logic operations and program organizing instructions on a single FPGA based system. The GAPU implementation can be easy integrated with the previously elaborated single-layer Falcon architecture (Falcon-SL), the previously mentioned Falcon multi-layer retina architecture (Falcon-ML), or the nonlinear template runner architecture (Falcon-Nonlinear), as well. Therefore, the wide spread applicability of the GAPU implementation is further expanded towards low-cost, smart and complex image processing systems.

List of Publications 14 IV. List of Publications Journal Papers [1] Z. Nagy., Zs. Vörösházi, P. Szolgay Emulated Digital CNN-UM Solution of Partial Differential Equations International Journal of Circuit Theory and Applications, Wiley, Vol. 34: Special Issue : Special Issue on CNN Technology (Part 2), July-Aug. 2006. pp. 445-470 (IF: 0.686 2006), ISSN: 0098-9886 [2] Zs. Vörösházi, A. Kiss, Z. Nagy, P. Szolgay Implementation of embedded emulated-digital CNN-UM Global Analogic Programming Unit on FPGA and its application International Journal of Circuit Theory and Applications, Wiley, Vol. 36: Special Issue: Cellular Wave Computing Architecture, July-Sep. 2008. pp. 589-603 (IF: 2.389 2008), ISSN: 0098-9886 [3] Zs. Vörösházi, Z. Nagy, P. Szolgay FPGA-Based Real Time, Multichannel Emulated-Digital Retina Model Implementation EURASIP Journal on Advances in Signal Processing, Hindawi, Vol. 2009, Special Issue on CNN Technology for Spatiotemporal Signal Processing (IF: 1.055 2007), ISSN: 1687-6172 International Conference Papers [4] Z. Nagy, Zs. Vörösházi, P. Szolgay An Emulated Digital Retina Model implementation on FPGA. CNNA 2005 9 th IEEE International Workshop on Cellular Neural Networks and their Applications, Hsinchu, Taiwan, 28-30 May, 2005, pp. 278-281. [5] Z. Nagy, Zs. Vörösházi, P. Szolgay Mammalian Retina Model Implementation on Emulated Digital FPGA HACIPPR 2005 5 th Joint Hungarian-Austrian Conference on Image Processing and Pattern Recognition, Veszprém, Hungary, 11-13 May, 2005, pp. 295-302. [6] Zs. Vörösházi, Z. Nagy, P. Szolgay An Advanced emulated digital Retina Model on FPGA to implement a real-time test environment ISCAS 2006 IEEE International Symposium on Circuits and Systems, Kos, Greece, 21-24 May, 2006, pp. 278-281. [7] P. Szolgay, S. Kocsárdi, Z. Nagy, P. Sonkoly, Zs. Vörösházi Complex Computational Problems in Cellular Architectures RSEE 2006, Oradea, Romania, 8-10 June, 2006, pp. 111-115.

List of Publications 15 [8] Zs. Vörösházi, A. Kiss, Z. Nagy, P. Szolgay An embedded CNN-UM Global Analogic Programming Unit implementation on FPGA CNNA 2006-10 th IEEE International Workshop on Cellular Neural Networks and their Applications, Istanbul, Turkey, 28-30 Aug, 2006, pp. 318-322. [9] Z. Nagy, Zs. Vörösházi, P. Szolgay A Real-time Mammalian Retina Model Implementation on FPGA CNNA 2006-10 th IEEE International Workshop on Cellular Neural Networks and their Applications, Istanbul, Turkey, 28-30 Aug, 2006. (live demo) [10] Zs. Vörösházi, A. Kiss, Z. Nagy, P. Szolgay FPGA Based Emulated-Digital CNN-UM Implementation with GAPU CNNA 2008 11 th IEEE International Workshop on Cellular Neural Networks and their Applications, Santiago de Compostela, Spain, 14-16 July, 2008, pp. 175-180 [11] Zs. Vörösházi, A. Kiss, Z. Nagy, P. Szolgay A Standalone FPGA Based Emulated-Digital CNN-UM System CNNA 2008 11 th IEEE International Workshop on Cellular Neural Networks and their Applications, Santiago de Compostela, Spain, 14-16 July, 2008, (live demo) pp. 4. [12] Zs. Vörösházi, Z. Nagy, P. Szolgay An Advanced Real-Time, Multi-Channel Emulated-Digital Retina Model Implementation on FPGA CNNA 2008 11 th IEEE International Workshop on Cellular Neural Networks and their Applications, Santiago de Compostela, Spain, 14-16 July, 2008 (live demo), pp. 6.