Cognitive Computational Models for Intelligent Engineering Systems

Transcription

1 Cognitive Computational Models for Intelligent Engineering Systems Ph.D. Dissertation Barna Reskó Supervisor: Péter Baranyi, D.Sc. Co-supervisor: Péter Korondi, D.Sc. Budapest, 2008.

2 Alulírott Reskó Barna kijelentem, hogy ezt a doktori értekezést magam készítettem és abban csak a megadott forrásokat használtam fel. Minden olyan részt, amelyet szó szerint, vagy azonos tartalomban, de átfogalmazva más forrásból átvettem, egyértelműen, a forrás megadásával megjelöltem. A disszertáció bírálatai és a védésről készült jegyzőkönyv megtekinthető a Budapesti Műszaki és Gazdasságtudományi Egyetem Villamosmérnöki és Informatikai Karának Dékáni Hivatalában. Budapest, június 25. Reskó Barna i

3 Acknowledgments I would like to express my gratitude to my supervisors, Péter Baranyi and Péter Korondi for their scientific guidance and encouragement throughout my research in Hungary and abroad. Without their unconditional support this dissertation would have been impossible. I wish to thank Prof. Hideki Hashimoto the opportunity to spend a year and a half doing research in his laboratory at The University of Tokyo. Also, I would like to express my gratitude to Prof. Kesheng Wang for his supervision and scientific guidance at the Norwegian University of Science and Technology. I appreciate the support of Ministry of Education, Culture, Sport, Science and Technology of Japan, the Norwegian Research Council, the Ministry of Education of the Republic of Hungary and the Computer and Automation Research Institute of the Hungarian Academy of Sciences to have covered my financial needs during my doctoral studies in Japan, Norway and Hungary. Grateful acknowledgments are due to my friends and colleagues, Zoltán Petres, Ádam Csapó, András Róka, Szabolcs Nagy and Andor Gaudia for their enthusiastic support in managing my responsibilities in Hungary while I was doing my research in Japan and Norway. I would like to thank my father for the strength and belief he left behind for me. I cannot be thankful enough to my family for their self-devotion and encouragement even in the hardest times. ii

4 Contents 1 Introduction Cognitive informatics and its directions Cognitive informatical approach to vision General description of the technical problem Structure of the dissertation Formalisms of the dissertation Goals of the dissertation 6 I Preliminaries 9 3 Research related to the neurobiology of vision The pathway of visual information processing The eye-retina system The lateral geniculate nucleus The visual cortex The primary visual cortex (V1) Architecture of the primary visual cortex Maps in the visual cortex Simple and complex functional units Lateral inhibition Eye-motions Involuntary eye movements Vision as a cognitive function Is the brain modular or unitary? Object recognition The importance of corners Cognitive vision models Ice-cube model Simple functional unit receptive field models Classical receptive fields Non-classical receptive fields with surround suppression Computational model of contour integration Cognitive models for object recognition Geometric blur iii

5 5.3.2 Recognition algorithm Hardware implementations of models in cognitive informatics 30 II Theoretical achievements 32 6 The Visual Feature Array Concept Information processing structures in the VFA concept Cognitive units Data arrays Operations Uniform model of the primary visual cortex in the VFA concept Structure of the VFA model Simple cognitive units in the VFA model Complex cognitive units in the VFA model Input filter operations in the VFA model End stopping filtering Gabor function based filtering Foveated filtering Summary of filtering operations Lateral operations in the VFA model Contour integration in the VFA model Lateral inhibition in the VFA model Projective operations in the VFA model Projections Corner and vertex detection Discussion Comparison with edge detection operators Evaluation of corner detection in the VFA model Computational complexity of the VFA model Cognitive plausibility of the VFA model Opto-mechatronical computation of the VFA model Oriented motion blur filtering Filter model Contour integration Computation in the VFA model using the oriented motion blur filter Opto-mechatronical device for motion blur filtering Motion blur hardware principle Implementation of the hardware Experimental results Hardware evaluation Oriented contour detection Discussion Computational complexity and quality Comparison with other image processing hardware solutions iv

6 7.4.3 Cognitive plausibility of the opto-mechatronical filtering Cognitive informatics model of the Eye-retina system Non-overlapped receptive field model Optical overlap of receptive fields Schematic eye model Effects of eye-lens dynamics and aberrations on retinal blur Lens blur based retina model Software simulation Laboratory experiment Discussion Hypotheses III Applications 81 9 Object recognition system based on the VFA model Specification of the object recognition system Design of the object recognition system VFA component Object recognition engine Abstract image categorization Formulation of the image categorization problem Test images Model acquisition Recognition results Discussion Robot guiding in industrial environments Formulation of the robot guiding problem Robot localization using the VFA based object recognition system Object recognition system setup Recognition process Experimental results Discussion IV Conclusion Theses 97 A Output images from the VFA model 101 A.1 Filtering operations A.1.1 End stopping filtering v

7 A.1.2 Gabor function based filtering A.1.3 Foveated filtering operation A.2 Lateral operations A.2.1 Contour integration A.2.2 The content of V after the application of lateral operations A.3 Opto-mechatronical computation of the VFA model A.3.1 Simulation results A.3.2 Experimental results B Discussion and comparison of the results of the dissertation 110 B.1 Comparison of contour detection B.2 Comparison of corner detection B.3 Comparison of Gabor function based and opto-mechatronical VFA implemtations B.4 Robustness to noise of the opto-mechatronical filtering Author s publications 116 Bibliography 119 vi

8 List of Figures 3.1 The two types of receptive fields, on/off and off/on, with white color indicating excitatory and dark color indicating inhibitory afferent connections (a). The receptive field placed over a homogeneous light as shown in (b) yields a low response, while the same receptive field placed over a strong contour (c) yields a high response Contrary to the previous theory, it was shown that receptive fields of ganglion cells of the same type do not overlap in the mammalian central fovea The cortical areas of visual information processing In the experiment of Roger Tootell the cells processing the viewed cross show a retinotopic arrangement The orientation map in primary visual cortex of a tree shrew. The different colors represent columns that have different orientation preferences [15] The three types of cortical receptive fields oriented 60 degrees from horizontal An example of a contour integrated into a circle from Gabor patches ( [18]) The recognition of objects is strongly based on the perception of vertices and corners. For a human the flashlight is easy to recognize on image (a) where the corners are visible, but it takes much longer to recognize on image (b) with the corners occluded The ice-cube model proposed by Hubel and Wiesel. Reprinted from [45] Intensity map of a Gabor function. Grey peripheral pixels indicate zero values (g λ,σ,θ,φ (x,y) = 0), brighter and darker pixels represent positive and negative values respectively The results of surround suppression. The input image is (a), the gradient magnitude is (b), the anisotropic surround suppression is (c) and the isotropic surround suppression is (d). Reprint from [38] The geometric blur operation. The red dot indicates the fixation point, left side shows the input image, right side shows the result of the geometric blur. Reprint from [9] vii

9 6.1 The VFA concept The Visual Feature Array model. Rectangular shapes represent data arrays, elliptical shapes represent operations The binary filter matrices of the VFA model that represent the orientation tuned end-inhibited functionality of V1. The above matrices are referred to in the model as R (0,5), R (22.5,5), R (45,5), etc Intensity map (left) and 3D surface (right) of a Gabor function based filter kernel for contour integration An example for the contour integration. Result obtained after 30 iterations with c = A simulation of lateral inhibition with α = 0.1 (a), α = 0.2 (b), α = 0.3 (c), α = 0.35 (d) A simulation of lateral inhibition with α = 0.1 (a), α = 0.2 (b), α = 0.3 (c), α = 0.35 (d) Projective operations in the VFA model. The projections allow to extract complex visual features, such as crossings, endpoints, corners and vertices The receptive field model used for endpoint detection The motion blur of two different angles will cause the ratios between the integrals of p 1 and p 2 be different The mirror can be rotated around axis a 1 causing translation of the projected image on the image sensor. The mirror and axis a 1 can be rotated around axis a 2, which modifies the orientation of the translation caused by the rotation around a The mechanical hardware to move the mirror The time function of the mirror vibration with a time period of 62.5ms The percentage of the period spent in any position is shown on the graph below the time-function of the vibration. The same is shown for the contour integration filter. There is a similarity between them, however there is an undesirable noise on the time-function of the vibration The test setup to capture motion blurred images. The image is reflected into the camera by the vibrating mirror Overlapping and non-overlapping filtering architecture Non-overlapping processing structure. The input image is tiled using a mosaic arrangement of the filter matrix F in a non-overlapped manner The convolution based edge detection (left) and the non-overlapped filtering (right) Accommodation-dependent schematic human eye model Calculated RMS spot size obtained using optical parameters for different accommodation fluctuations stimulus level by constant image distance viii

10 8.6 The input images taken using a sharp focus s = 3m (a), a less sharp focus s = 2.5m (b), an optimal focus causing a blur diameter of 5 pixels s = 2m (c), and a larger focus deviation s = 1.5m (d) The results obtained using a sharp focus s = 3m (a), a less sharp focus s = 2.5m (b), an optimal focus causing a blur diameter of 5 pixels s = 2m (c), and a larger focus deviation s = 1.5m (d) The VFA model with the attached object recognition engine The system proposed by Berg in [9], and its amended version using the VFA model The images used in the experimental object recognition task The pixel neighborhood sampling pattern used in the recognition engine in the context of the image categorization problem. The pattern is composed of 61 pixels The nodes in the data array of complex cognitive units detected by the VFA model, using Gabor function based filtering (top) and opto-mechatronical filtering (bottom). The number of nodes n = 20 and the number of orientation layers is h(θ) = The geometry of the robot localization problem. 3 objects determine the position of the robot, the fourth object adds redundancy and allows error detection The images used in the three examples. Rows 1 to 3 contain the images taken in the three scenarios. Horizontal image positions correspond to columns of A A.1 Original test image (a) and the result of the primary edge detection (b) A.2 The reconstruction of the edge-detected image from line segments of 3 pixels (a), 9 pixels (b), 33 pixels (c) and the reconstructed image (d) A.3 The iso-orientation layers of the data array V with the values of θ shown under each image. The superposition of layers is shown on the lower-right corner. Red dots on the original image indicate the detected corners, for details see section A.4 The original image of a mobile robot (a), the data array V obtained using the Gabor function based filtering operation (b) and by using foveated input filtering operation with the lower corner of the laptop as the center of foveation (c) A.5 The overall activation of V with c = 0 (a), c = 0.3 (b), c = 0.15 (c) and with c = 0.16 (d), through 100 iterations ix

11 A.6 The content of data array V before the first iteration (a), and after the 100 th iteration using a threshold c = 0 (b), c = 0.15 (c), and c = 0.16 (d) A.7 The iso-orientation layers of the data array V at different orientations. The values of θ are indicated under each image. The superposition of layers is shown on the lower-right corner. Red dots on the original image indicate the detected corners, for details see section A.8 The original image (e) is blurred horizontally (a) and vertically (b). Their edge detected versions are shown in (c) and (d) respectively. Overlapped projection of (c) and (d) yields (f), which shows how the corners compose an intersection, useful in the vertex detection of the VFA sub-model. The contour integration ability of the motion blur filter is demonstrated by the small gap in the rectangle. The top edge is considered to be continuous A.9 Horizontal (a) and vertical (c) blurred images, and their edge detected counterparts (b) and (d) respectively A.10 Horizontal (a) and vertical (c) blurred images, and their edge detected counterparts (b) and (d) respectively B.1 Images used in the subjective evaluation of the VFA model B.2 Corner detection results compared. The VFA based corner detection results (left), and the results of the Rosten algorithm [76] [77] (right) B.3 Results obtained by the Gabor function based input filtering and contour integration (left) and the opto-mechatronical filtering (right) B.4 The corner detection results obtained using the opto-mechanical filtering (a), and the same operation on images with four different kind of noise: Gaussian (b), Poisson (c), Peckle (d) and Salt & Perrer (e) x

12 Chapter 1 Introduction 1.1 Cognitive informatics and its directions The new results of cognitive sciences together with the emergence of modern tools of high computational capacities lead to the appearance of cognitive informatics (CI), as described on a webpage of the University of California "a cutting-edge and multidisciplinary research area that tackles the fundamental problems shared by modern informatics, computation, software engineering, AI, cybernetics, cognitive science, neuropsychology and medical science. CI is the transdisciplinary study of cognitive and information sciences, which investigates into the internal information processing mechanisms and processes of the natural intelligence - human brains and minds - and their engineering applications in computing and Information and Communication Technology (ICT) industries". This dissertation does not intend to achieve any results in the research field of classical medical biology, but by considering their approach, it tries to propose models in technical applications of cognitive informatics that are capable of solving complex practical problems. Concerning the national relevance of the methods of cognitive science, the names of Béla Julesz and Csaba Pléh definitely have to be mentioned. There are also different technical approaches in Hungary such as those dealing with the dynamical behavior [4] and cellular level modeling [21] of neurons, information theoretical questions [1], building on the approach of methods of cognitive sciences. An outstanding example in Hungary, Tamás Roska and his research group should be mentioned. They have achieved excellent results in implementing the retina like CNN analogic computer, with which they have successfully solved several technical problems. Moore s law has well predicted the exponential growth of computational capacity of silicon based computational tools in the last decades, which inspired the emergence of classical AI. The level of intelligence of such AI systems however did not follow the exponential development of the hardware they were deployed to. By now it should be obvious that even if the complexity of computers is comparable to that of the human brain (at least from the aspect of the number of basic computational units, i.e. transistors and neurons) a computer can not solve many of the problems that are easily solved by 1

13 basic human intelligence. This difference is even more apparent if we compare the speed of a transistor (range of nanoseconds) and a neuron (range of milliseconds), used as basic computational units. The key of computational capacity of the brain besides the large number of computational units (neurons), lies in the broad interneuronal connections, working simultaneously in a very complex and yet uncovered parallel and recurrent structure. This explains why the efforts of conceiving artificial intelligent systems beyond the limits of transistors have turned towards the approach of neurobiology and cognitive sciences for seeking new directions. The technical approach to cognitive and biological methods is not a novelty, their relation is in fact characterized by three main directions. The direction of biology inspired intelligent methods covers the algorithms techniques in the field of artificial intelligence, mathematics and numerical modeling that are based on the neurobiological and cognitive aspects of brain functions [7,37,78]. These methods typically use higher order mathematics and perform exhaustive algorithmic calculations, and were designed to suit the underlying mathematical toolbox and run on a Von Neumann type processor. The direction of biological modeling aggregates the models of informatics, mathematics, information theory and other technical fields, which were conceived with the goal of a deeper understanding of the biological structure or "hardware", and functions of the cerebral cortex. The biological models are made to support or confute theories in biology and cognitive sciences. The field of neurocomputation is closely related to this direction, where the informatical models of neurobiology that are often used to solve problems of distant areas. The goal of this direction could be grasped as the design of the "cognitive hardware". The direction of cognitive informatical modeling includes models that aim to provide solutions to problems in technical informatics as well as in information and communication technology industries, by considering the approaches and methods of cognitive science. It is not the goal of these models to perfectly copy the structure and characteristics of biological systems, but to resemble in functionality to the cognitive processes according to which they were designed. Cognitive informatics is the methodology of artificial or imitated real cognitive process implemented on tools of informatics. The goal of this area could be summarized as the design of the "cognitive software". It is important to note that according to the potential application of such models their efficient parallel implementation is a basic requirement. The border between the above fields is quite ambiguous, with many overlapping. The main difference between them is in their goals and their approach to cognitive sciences. My research work presents new results in the field of cognitive informatics, and as such it builds on both neural-networkslike numerical, and other parallel computational tools on the implementation level. It is well known that medical, biological and brain research areas of the cognitive science mainly deal with anatomical, cellular-, and neurobiological and physiological modeling based on different paradigms of the cerebral 2

14 cortex. In my research work I consider the above models with the effort of using their approach in cognitive informatical modeling of particular technical problems. 1.2 Cognitive informatical approach to vision An intelligent system is composed of three main units, responsible for perception of the environment, processing the acquired information and actuation into the environment. There are no sharp borders between the three units, the intelligent behavior can not be limited to the processing unit only. This is why the methods of cognitive science are relevant not only to processing, but also to perception and actuation. Many brain researchers consider the "eye as a window to the brain", which explains why neurobiological and cognitive research efforts targeting natural intelligence have turned towards vision. Visual information processing is the most important perceptual modality, since the largest amount of information about the environment is gained through our eyes [58]. Consequently a wide range of vision related cognitive research has already been inspired from the aspects of neurobiology [6,15,17,29,34,51,60,66,89], cognition [12, 35, 71] and neurocomputation [25, 38]. David Hubel was the first to unveil the structure and functionalities of the primary visual cortex [45] and describe the orientation selectivity. It was later showed by Shevelev that the primary visual cortex is responsible for encoding the seen corners and crossings [79 81]. Based on the results of cognitive and biological research, many started to deal with the modeling of vision from biological [46] and cognitive aspects [22,48,49,93]. Based on their approach, new vision-models of cognitive informatics have emerged, but their goal was not to imitate the visual cortex on the cellular level, but to apply cognitive functions in solving complicated problems from the field of technical informatics [9, 10, 16, 26, 38, 39, 54, 62, 73, 74, 91, 92]. Most of these models consider the orientation selectivity described by Hubel, but on the implementation level they are very diverse and incompatible. Beyond the models of cognitive informatics, parallel computational hardware tools have appeared on the analogy of the neural structure of the visual cortex. These tools allowed the fast and parallel execution of simple cognitive models of visual processing. Such tools were the pixel-wise processing units integrated in image sensors [3] [2] [21] and FPGA based implementations [67]. 1.3 General description of the technical problem The research areas covered by the methods of cognitive science have significantly broadened and have received more attention in the last few years. Fields in engineering science dealing with high complexity intelligent systems tend to build more often on the results of cognitive science. An example can be the cognitive informatical tools of modern image processing, object 3

15 recognition and machine vision, designed with the motivation of cognitive science. Consequently, it is an important task to model the processes described by cognitive science, using leading edge computational tools, and to provide a toolbox of cognitive informatics for solving complex engineering problems. The scientific problem in the focus of this dissertation is to provide vision based cognitive informatical models. 1.4 Structure of the dissertation The dissertation is divided into four parts. Part I discusses the preliminaries of the dissertation. In part II the new theoretical results are discussed, while in part III the applications of the theoretical results are presented. Finally part IV concludes the dissertation. Part I gives an overview on the research results in cognitive science with an outlook on neurobiological aspects, informatics and engineering. The results of the dissertation are based directly or indirectly on the contents of this part. In some cases a broader discussion is given for the sake of completeness. Chapter 3 presents the scientific results of vision related neurobiology. The new results of the dissertation consider the different approaches presented in this chapter. In Chapter 4 the cognitive aspects of vision is discussed. This chapter concentrates on vision on the functional level. Chapter 5 presents existing computational models of vision inspired by either biology or cognition. The new results presented in the dissertation are closely related, built upon or attached to the models described in this chapter. Part II presents the theoretical results of the dissertation. Chapter 6 discusses - from several aspects - the proposed cognitive computational concept and a deployed uniform model of the primary visual cortex. The structure and operation of the model are dealt with in this chapter. Chapter 7 gives a new, eye-motion inspired orientation sensitive filtering method based on motion blur, which allows the almost constant time computation of the model proposed in Chapter 6. Finally, Chapter 8 presents a model of the eye-retina system based on the retina structure and the fluctuations of accommodation. Part III is devoted to applications based on the theoretical results of Part II. As such, Chapter 9 presents the proposed model combined with an existing object recognition engine, and Chapter 11 presents its application in mobile robot localization. Part IV summarizes the new results and achievements of the dissertation. 1.5 Formalisms of the dissertation The formalisms used in the dissertation are summarized in the table below. 4

16 Category Style Example Scalar small character, italic a, b, c Vector small character, bold a, b, c Matrix capital character, bold A, B, C Matrix size n dimensional matrix A R d 1... d n Matrix element lowercase characters A i, j,k Matrix parameter uppercase character in parentheses A (t) Function small character f (x, y) Operation capital character, written F Fixed filter parameters lowercase small characters α, b Convolution operator star 5

17 Chapter 2 Goals of the dissertation Goal 1: Cognitive informatical model of the primary visual cortex It is well known that many cognitive informatical models are applied in aim of solving problems too hard for a computer, but easy for a human. The goal of such models besides solving specific tasks is to have similar functional properties to the brain. In constructing such models it seems obvious to consider a functional similarity with the brain also on the level of basic components. The first step of visual information processing in the primary visual cortex is the simultaneous, parallel extraction of features, which is of great importance from the point of higher level visual information processing, such as object recognition. The cognitive informatical models known from the literature deal with the modeling of this function, but on the conceptual and implementation level they are very different and incompatible. Based on the scientific literature one can conclude that there is no generally accepted uniform cognitive informatical model for the feature detecting functionality of the primary visual cortex. Also, the known cognitive informatical models do not define operations on the data structures of V1-like models. Based on the above described, the specific goals of the dissertation were: To propose a concept that provides a uniform and general framework for the informatical modeling of cognitive functions. To work out a cognitive informatical model within the proposed concept, which implements the major low level feature detection functionalities of the primary visual cortex. It is a further goal of the model to be directly compatible with informatical models implementing higher order cognitive functions, and as such be applicable in solving a wide range of technical problems. To achieve a maximal cognitive plausibility according to the scientific literature in neurobiology and cognitive sciences, while keeping its efficiency in solving technical problems. 6

18 The model should be directly implementable on modern parallel computational tools. Goal 2: Opto-mechatronical implementation The functions of the primary visual cortex are of high computational complexity, which explains why high speed computability of related informatical models is a basic requirement for their efficient implementation. The primary visual cortex can perform the modeled functions at very high speeds by virtue of its parallel computational structure and high information connectivity. Such a high connectivity is not available by modern silicon based integrated circuit manufacturing technologies. In this respect, the goals of the dissertation was to design and implement an opto-mechatronical device that performs the major, computationally complex functions of the primary visual cortex. A basic requirement towards the device was to perform the cognitive functions in a quality comparable to that of the original model, but at a computational cost several orders of magnitudes lower. Goal 3: Informatical role of optical aberrations The non-overlapped arrangement of ganglion receptive fields discovered by Packer and Dacey in 2002 is in a conceptual conflict with classical, convolution based linear filtering methods. Based on this I proposed to solve the following particular problems: To investigate the above mentioned conflict and to propose a cognitive informatical model for its solution. In the design of the cognitive informatical model I intend to consider the effect of optical aberrations, fluctuations of accommodation and the disjunctive arrangement of ganglion receptive fields. My further goal is to implement the model in the form of an experimental hardware-software system, and to prove the solution of the conflict by the means of laboratory experiments. Goal 4: Application of the proposed model The proposed cognitive informatical model performs the first step of visual information processing using tools of technical informatics. Based on Goal 1 the model has to be compatible with informatical models implementing higher order cognitive processes with the purpose of solving technical problems, such as image categorization and robot localization. Based on this I considered the following goals: 7

19 To attach the model to a modern object recognition engine, such that the image primitives describing the objects and necessary for the engine are provided by the proposed model. The so obtained system has to be applicable in solving categorization and recognition tasks. Categorization of abstract (hand drawn, wire-line) images using the object recognition system. To support the localization system of a mobile robot at the Laboratory of the NTNU by the recognition of surrounding key-objects using the proposed object recognition system. 8

20 Part I Preliminaries 9

21 Chapter 3 Research related to the neurobiology of vision This chapter is devoted to present some basic results in the field of neurobiology and physiology of vision, related to the research work later presented in this dissertation. 3.1 The pathway of visual information processing The main inspirator of the presented research work is the mammalian vision system, and the performed visual information processing. This section introduces the visual pathway with the steps of visual information processing from the eyes to the visual cortex The eye-retina system Light arriving from the outside world enters the eye through its lens and is projected to the retina, a photosensitive layer in the posterior surface of the eyeball, where visual processing begins. The retina is populated with photoreceptors that include 120 million rods and 6 7 million cones responsible for phototransduction [68]. The rods are sensitive to light intensity in a wide range of the light spectrum. Cones on the other hand have three types, each of them sensitive to a different color spectrum [24]. These photoreceptors modulate the activity of the bipolar cells, which in turn connect with more than one million special neurons, so called ganglion cells in each eye. The axons of the ganglion cells leave the eye at the optic disc and form the optic nerve, which carries visual information from the retina to the visual cortex in the brain. The bipolar cells and the ganglion cells are organized in such a way that each cell is responsive to light falling on a small circular patch of the retina, which defines the cell s receptive field. Both bipolar cells and ganglion cells have two basic types of receptive fields: on-center/off-surround and off-center/onsurround. The center and its surround are always antagonistic and tend to 10

22 (a) (b) (c) Figure 3.1: The two types of receptive fields, on/off and off/on, with white color indicating excitatory and dark color indicating inhibitory afferent connections (a). The receptive field placed over a homogeneous light as shown in (b) yields a low response, while the same receptive field placed over a strong contour (c) yields a high response. cancel each other s activity if homogeneous light is projected on the receptive field [6, 51]. Consequently, the on/off or off/on arrangement of the receptive field makes a ganglion cell more responsive to differences in the level of illumination between the center and surround of its receptive field. Uniform illumination of the visual field is less effective in activating a ganglion cell than is a well placed spot or line or edge passing through the center of the cell s receptive field (Figure 3.1). The photoreceptor density on the retina is larger on the fovea, while it decreases towards the peripheral regions. Also, the size of receptive fields increase towards the peripheral regions. In the fovea, the central part of the receptive field is composed of one single photoreceptor. In the peripheral regions both the central and surround regions are composed of several photoreceptor cells [59]. The size of the receptive field in foveal regions is about 20µm, while that in the peripheral regions is about 600µm [66]. For a long time, experiments have shown extensive overlaps between receptive fields on the retina. However, when comparing the number of axons of photoreceptor cells to the number of axons in the optic nerve, it was discovered that the 130 million axons of rods and cones are condensed into 1.2 million axons in the optic nerve. Because of this fact, it was assumed that the retina performs some kind of information compression [60]. With the evolution of experimental methods, it was possible to make a distinction between many kinds of ganglion cells. Devries and Baylor [29] were able to distinguish between 11 kinds of ganglion cells, based on their receptive fields and response characteristics. It was also shown that receptive fields of ganglion cells of the same type do not overlap in the central fovea; the center of these receptive fields are located at a distance of one diameter, as shown in Figure 3.2. These measurements were confirmed by Packer and Dacey [66]. 11

23 Figure 3.2: Contrary to the previous theory, it was shown that receptive fields of ganglion cells of the same type do not overlap in the mammalian central fovea The lateral geniculate nucleus Optic nerve fibres from the eyes terminate at two bodies in the thalamus (a structure in the middle of the brain) known as the Lateral Geniculate Nuclei (or LGN for short). The LGN cells mediate the visual input received from the retina towards the primary visual cortex. Besides input from the retina, the LGN receives a large amount of input from the cortex itself. The exact cause of these connections has not yet been revealed, but there are assumptions that these inputs modulate the LGN, assisting visual attention [65]. The brevity and the subject of the dissertation does not allow to go deeper in the structure and functionality of the LGN, further details on this subject can be found at [57] The visual cortex Much of the primate cortex is devoted to visual processing. In the macaque monkey at least 50% of the neocortex appears to be directly involved in vision, with over twenty distinct areas. Some of the areas concerned are quite well understood, others are still a complete mystery. The visual cortex is usually divided into five visual areas, called V1, V2, V3, V4 and MT or V5. Nearly all visual information reaches the cortex via V1, the largest and most important visual cortical area. Because of its stripy appearance this area is also known as striate cortex. Other areas of the visual cortex are known as the extrastriate visual cortex; the more important areas are V2, V3, V4 and MT (Figure 3.3). Below some basic information about visual areas V1 and V2 are discussed. Details about V3, V4 and MT are out of the focus of this dissertation. Further information about them can be found at [14, 17, 53, 61]. Area V1 In primates nearly all visual information arriving from the LGN enters the cortex via area V1. This region represents about 15% of the whole neocortical surface in the macaque monkey, though it is probably only about 5% of the neocortex in the human. It is the most complex region of the cortex with at least 6 identifiable layers (layer 1 is close to the cortical surface, layer 6 adjoins the white matter below) even though it is only about 0.5mm thick in the monkey. 12

24 Figure 3.3: The cortical areas of visual information processing. The V1 performs basic visual feature extraction, such as oriented edge detection, spatial frequency and disparity perception. The representation of the visual field in V1 is retinotopic, however a very strong distortion can be detected [89]. The part of the visual field that is projected on the foveal regions of the retina occupies a much larger area of the V1 than the peripheral regions, as shown by the experiments of Tootell on Figure 3.4. This is an obvious consequence of the photoreceptor density changes in the retina from the fovea to the peripheral regions, according to section The central part of the receptive field is scanned by much more photoreceptors, which in turn need much more cortical neurons for further processing. This explains why vision is more sensitive to details around the point of fixation projected on the fovea. Area V2 Area V2 has a long common border with V1. It receives strong but patchy feedforward input from V1 and sends strong connections to V3, V4, and V5. It also sends strong feedback connections to V1. In contrast to V1, V2 has a rather disorderly topographic organization. The responses of many V2 neurons are modulated by more complex properties, such as the orientation of illusory contours and whether the stimulus is part of the figure or the background [71]. Summary of the information processing of the visual cortex In spite of the broad and exhaustive research done about the areas of the visual cortex in the last decades, there are still too many ambiguities in the structure and functionality of different visual areas. The explanations of measured phenomena in many cases are contradicting, or simply unknown. What is known is that the information processing of the visual stimulation is done by the different areas of the visual cortex in one way or the other, 13

25 Figure 3.4: In the experiment of Roger Tootell the cells processing the viewed cross show a retinotopic arrangement. which send information to each other through a complex network of neuronal connections. The role of V1 is of great importance, being the first step in visual information processing, nearly all the areas of the visual cortex receive input from V1, hence depend on it. Unlike other visual areas, V1 is quite well studied [45], and there is a strong accordance about its cognitive structure and functionalities, detailed in the next section. 3.2 The primary visual cortex (V1) In the previous section the pathway of visual information processing was presented, starting from the eyes to different areas of the visual cortex. This section concentrates on the V1 area of the visual cortex, with a detailed description of the cognitive architecture and functionalities Architecture of the primary visual cortex The architecture of the primary visual cortex has been worked out in great detail. The three basic organizing principles of V1 are the laminar arrangement of neurons (horizontal organization), columnar arrangement of neurons (vertical organization), regular spacing of anatomical and functional groups of neurons (periodic organization). 14

26 Brodmann has introduced 6 cortical layers, layer 1 being the most dorsal and layer 6 the most ventral. Each layer contains neurons of different type and functionality. The term "cortical column" refers to the notion that cells arranged vertically from the surface of the cortex to the white matter, and has been hypothesized to represent a basic functional unit for sensory processing. Thus, a cortical column can be defined on the basis of anatomical features (e.g. stereotyped patterns of pyramidal cell apical dendrite bundles), functional features (e.g. columns of cortical cells all responding to the same stimulus orientation), or both. In following sections a focus will be given to the functional features represented by cortical columns. The regular spacing of anatomical and functional groups of neurons means that those responsive to a certain feature can be detected, with a spatial regularity forming a topographic representation of the visual field. Such a regular representation exists for each feature. Neighboring neuron groups responsive to different features represent the same (or nearly the same) part of the visual field. The spatial organization of the groups of different features is explained in [8] Maps in the visual cortex As mentioned previously, the neurons of the visual cortex are arranged into columns, forming functional units. The word column suggests a vertical organization, which is in fact the case, since the neurons in different cortical layers above each other tend to respond to the same type of input stimulation at the same place of the visual field. The functional examination of the visual cortex can be done by measuring the action potentials of neurons inside the functional units. Hubel and Wiesel have studied the responses of neurons in the visual cortex, and defined several columns responsive to different visual features [45 47]. In the following sections the orientation selectivity as the main feature detected by V1 will be discussed. Orientation selectivity V1 is the first site where strong orientation and direction selectivities are observed in the macaque monkey [47]. Hubel and Wiesel used electrode dyes to measure action potentials of V1 neurons while showing oriented bars and edges to the macaque monkey. Many neurons in V1 respond best to an edge or bar of light at a specific orientation. This preferred orientation remains constant while penetrating vertically to the surface of the cortex, but varies gradually if the penetration is parallel with the surface. The measurements showed that a gradual shift of the orientation spans through 180 degrees while traveling about 1mm in the cortex. They have also found abrupt changes in the orientation. The geometry of the orientation map in V1 was not clear at the time. Hubel and Wiesel proposed the ice-cube model to describe the columns of orientation selectivity and ocular dominance, which will be detailed in section

27 Figure 3.5: The orientation map in primary visual cortex of a tree shrew. The different colors represent columns that have different orientation preferences [15]. Later, the overall map of orientation columns could be simulated by computers [85 87], and visualized by optical [15] and fmri [48] imaging methods. Cortical tissue changes its reflectance properties very slightly when neurons are active, and so by examining changes in reflected light from the cortical surface as visual stimuli of varying orientations are presented one can build up a picture of the complete map (Figure 3.5). The periodicity of this pattern varies depending on the species and location in the cortex, and also varies substantially between individuals [44]: in fact each ocular dominance pattern is apparently as unique as a fingerprint. A notable feature is the presence of pinwheels, point singularities around which all orientations are represented in a radial pattern. The organized arrangement of orientation maps in the mammalian visual cortex has not been found in rodents or lagomorphs. Examinations performed with a highly visual rodent, the gray squirrel has revealed that no orientation map exists in its visual cortex, however a robust orientation tuning of single cells was found. Therefore, it seems unlikely that orientation maps are important for orientation tuning of single cells. In vertical electrode penetrations, little evidence was found for columnar organization of orientation-selective neurons. The conclusion is that an orderly and columnar arrangement of functional response properties is not a universal characteristic of cortical architecture [43]. The higher level spatial organization of neurons in more developed animals thus has a cause or purpose different from enabling excellent orientation perception Simple and complex functional units In this section a more detailed discussion is given about the orientation selectivity of cortical neurons. Columns in the primary visual cortex can be classified in two major classes according to their response characteristics: simple and complex functional units [46]. Simple functional units tend to receive input mostly from the LGN, while complex functional units receive projections mostly from other 16

28 Figure 3.6: The three types of cortical receptive fields oriented 60 degrees from horizontal. V1 cells. Both of these functional units exhibit a property known as orientation selectivity, meaning that they do not respond simply to light or dark in the visual field, but more typically to bars or edges of light with a particular orientation [47]. While simple functional units respond to an oriented edge at a particular position of the visual field, complex functional units exhibit more robust functionalities, such as length-tuning, crossing detection and directional selectivity. Hubel and Wiesel first described complex functional units in which the response to a stimulus increases with the length of the stimulus up to some optimum value, after which further increases in length decreased the response [45]. Functional units with preference to crossings and corners of oriented lines in the visual field were also detected in the primary visual cortex [79 81]. These functional units also receive their inputs from other cortical columns, and thus they are considered as complex functional units as well. Orientation selective functional units In case of the retina and the LGN the receptive field of the ganglion cells and LGN neurons have a center-surround shape. The situation is found to be similar in the case of functional units in V1 layers receiving input from the LGN. Different response characteristics can be recorded in cells that build on the outputs of these cortical center-surround functional units. They do not perform any firing for a stimulus of ambient light, or a spot of light, but they tend to strongly activate on oriented edges or lines. The receptive field characteristics of such orientation selective cortical functional units can be measured by a small spot of light scanning through the visual field, while measuring the functional units responses. The measurements have shown that there are three basic, oriented receptive field structures [45] (Figure 3.6): White line over a dark background, Dark line over a white background, Transition from a dark region to a white region. The exact receptive field characteristics of cortical orientation selective cells have been measured by Daugman [25], showing that the difference between a cortical neuron s input weights and an appropriately parameterized 17

29 Gabor function gives a zero mean error function. According to these results, the input weights of neurons can be modeled by Gabor-functions. The phase parameter of the Gabor functions allows to tune the on-center off-surround, off-center on-surround, and the antagonistic version of receptive fields. The actual shape of the resulting receptive field is elliptical, not rectangular, as it was first supposed by Hubel. An orientation selective simple functional unit increases its output activity when the length of an appropriately oriented stimulating line is increased. The response intensity reaches a maximum value when the length of the line reaches the length of the receptive field, and remains at a constant high level afterwards. On the other hand, there are functional units that also increase their output activity with the increasing length of the line, but after a limit, their activity falls back to zero. Such functional units responding to oriented line segments are called end-stopping functional units [45]. Crossing, corner and vertex sensitive functional units In the visual cortex there are complex functional units receiving input from other simple functional units, and show more sophisticated response characteristics. Here we concentrate on those having the highest output activities when some kind of vertex is presented as a stimulus to the receptive field. The crossing of two differently oriented lines produces an intersection. Experimental evidence suggests that complex functional units are responsible for the detection of such nodes and intersections. Through research on the cortical V1 area, some functional units were discovered to be sensitive to different kinds of line segment intersections. Some of them were responsive to both intersections and corners, some were responsive to only intersections or corners, some were tuned to the orientations of the two lines composing the intersection, while others were responsive only to the angle between the lines of the intersection, irrespective to their orientations [79 81] Lateral inhibition In this section we concentrate on the orientation selective simple functional units in the primary visual cortex, and discuss how they are mutually interconnected within the cortex. According to laboratory measurements, the response characteristics of these simple functional units can be described by Gabor functions [25]. The first step of cortical representation of the visual field can thus be considered as a linear spatial filtering of the image received from the eyes. The filters in this case are oriented Gabor functions. Single cell electrode measurements of the visual cortex indicate that the angular accuracy of orientation selective functional units are much sharper than that of Gabor functions. Moreover, aligned oriented edges or lines tend to merge in chains, and a line segment among the set of differently oriented line segments tend to pop out from the image. These phenomena can t be explained by the Gabor function based linear filter, and suggest the presence of lateral connections between simple functional units. 18

30 Figure 3.7: An example of a contour integrated into a circle from Gabor patches ( [18]). The lateral interactions between oriented spatial filtering cells have been studied in a contrast masking paradigm by Polat and Sagi [69, 70]. They have found that three different functional interactions existed between Gabor patches: short-range, non-oriented, suppressive long-range, side-way, oriented, excitatory long-range, aligned, oriented, excitatory These results explain the effect of sharp orientation tuning: in one retinotopic location several differently tuned functional units compete with each other in a "winner-take all" manner, due to the first point above. The one that has the best match between its receptive field orientation and the visual stimulus will activate, suppressing the others with even slightly different orientation preferences. Contour integration can also be explained by the third point of Polat and Sagi. Given that co-aligned functional units excite each other, they tend to form a chain [33] or closed loop [50] of Gabor patches. One example is shown on Figure 3.7, where the circle in the center of the image pops out for the human observer. 3.3 Eye-motions The neural system has developed in a way that it activates on changes in the environment, rather than on constant stimuli. This applies to the visual system from the retina to the visual cortex. Vision responds to changes both in space and time, whereas does not respond to constant stimuli. The Gabor patches discussed in the previous sections implement sensitivity to spatial changes of illumination. 19

31 The world as our environment is composed of both static and dynamic scenes. The perception of static scenes would be impossible if the eyes were motionless in a static position. In such a case the seen image would simply fade out [30, 75]. To avoid this effect, the eyes perform several involuntary motions, and always change to viewing direction. It is the saccadic motion that aligns the gaze from one fixation point to another through several degrees in a very short time. Such a motion is necessary because of the foveated cellular organization of the retina. In order to observe different detailed regions of the scene, the gaze has to pass through these regions one after another [94]. The saccadic motion of the eyes can voluntarily be controlled. Recent findings confirm that saccades have a key role in higher-order, cortical object recognition [28, 40]. Studies have shown that eyes are not motionless even when they fixate on a static point of the scene. Three different eye motions were observed during fixation [56] Involuntary eye movements The eye motions during fixation can not be controlled voluntarily. The three major kind of movements of the eyeball widely accepted today are microsaccades, drifts and tremors (also referred to as nystagmuses) [56]. The accommodation of the eye lens to enable sharp focus is also considered as an involuntarily eye movement. Microsaccades are abrupt jerks in fixation. These jerks take about 25 ms, and have amplitudes of several hundred receptive fields. Drifts and tremors both occur between microsaccades. Tremors are involuntary, rhythmic oscillations of the eye, that have frequencies of about 90 Hz and amplitudes of roughly the diameter of a cone on the fovea (therefore the diameter of the smallest of photoreceptor cells). There is currently no conclusive evidence on the functional role of tremors. Drifts are a unidirectional wandering of the eye, with an amplitude of about a dozen photoreceptor cells, and are much slower than tremors. Drifts are conjectured to play an important role in the compensation of microsaccades [56]. The role of the three kind of eye movements in perception seem to have an importance in maintaining visual acuity [13,23] Artificially eliminating these movements, researchers found that vision faded away [30, 75]. During steady viewing of a target, the accommodation response fluctuates by a small amount around the mean accommodation level. These fluctuations of accommodation are dominated by low frequency oscillations approximately in the range Hz, and include a high frequency oscillation whose peak frequency is typically in the range 0.9-2,5 Hz [84]. It was also found that low frequency fluctuations of accommodation increase with small pupils, i.e. a larger depth of field implies a stronger fluctuation. The role of fluctuations in accommodation is still unclear, however it can be useful in accommodation control and maintenance. 20

32 Chapter 4 Vision as a cognitive function This chapter introduces the cerebral cortex from the functional point of view, with a special respect on visual information processing. The findings presented here are related to the cognitive science. 4.1 Is the brain modular or unitary? Humans and highly developed animals perform sophisticated cognitive functions, such as perception of different modalities (including vision), controlled motion, thinking, awareness, language, memory, attention, etc. All these functions are proved to originate from the cerebral cortex. The cerebral cortex can be subdivided into regions from several aspects: developmental differences (neo-, paleo-, and archicortex), position (frontal, lateral, occipital, etc. lobes), function (52 Brodmann areas). Of course the subdivision of the cerebral cortex can be done until the neuronal level, or even further. It can be considered as a network of very complex neuronal interconnections. The issue for the brain theorist is to map complex cognitive functions upon the interactions of these rather large entities, anatomically defined brain regions, or these very small and numerous components, the neurons. The first approach looks at the brain as a system with a modular structure, while the second looks at it as a system with a unitary structure. For the scientists trying to figure out how each module works, the brain becomes an enormously complex puzzle and tempts them in some experimental situations to believe that function is well localized even down to a single neuron that does only one thing. At other times, brain studies suggest the opposite that localized function can move around and that distributed function is required for cognition. This paradoxical view is similar to the dualistic aspect of quantum physics where one set of observations suggests that light acts like particles and another set that suggest that light acts like waves. In brain terms, we have to allow for 21

33 events that are both localized and distributed; discrepant observations are not contradictory but rather are complementary [35]. The different anatomical parts of the brain show a different modular architecture. The brain stem and subcortical parts are composed of very well separated regions, and are responsible for very basic functions, like breathing, body temperature control, etc. The cortex has also different levels of modularity. The visual cortex for example is highly modular with a layered, columnar architecture, and orientation maps. The frontal lobe on the other hand lacks any structural modularity, the neurons are densely interconnected in a random manner, forming almost a complete graph. The anatomy seems to support the theory which claims the existence of both modular and unitary architecture of the cerebral cortex. The primary visual cortex, being a rather old part of the cerebral cortex, shows a high level of modularity, and cognitive functions can be tightly associated with these modules [45]. 4.2 Object recognition Vision not only comprises the extraction of visual features from the image projected on the retina. More complex shapes, forms or objects are also recognized at larger synaptical distances from the retina. The actual synaptical distance can be measured by the reaction time of test persons who have the task to press a button if they recognize a certain shape or object. There is thus a link between low level features and object level representations, and there has been a great deal of research about the way humans represent and recognize objects. The exact cognitive processes that lead from visual features extracted in the primary visual cortex to object recognition are still unsolved The importance of corners It was shown by Biederman [11, 12] that the corners on an image play a substantial role in object recognition. If contours of an object are occluded, the recognition is easy, takes a short time, while occluding the corners may cause difficulties in recognizing the object, making it a long process or even an impossible task. This assumption can be verified by looking at the three images on Figure 4.1. This results suggests that the extraction of corners in V1 is an important step between simple functional units in the primary visual cortex, and the object recognition in higher cortical areas. 22

34 (a) (b) Figure 4.1: The recognition of objects is strongly based on the perception of vertices and corners. For a human the flashlight is easy to recognize on image (a) where the corners are visible, but it takes much longer to recognize on image (b) with the corners occluded. 23

35 Chapter 5 Cognitive vision models With the advances in computer systems and microelectronics, very complex computational tools have emerged that could run sophisticated and complex software systems. The urge of making complex and intelligent software systems with the purpose of solving computationally hard technical problems led researchers to turn towards the brain, a natural system with very high level of complexity and intelligence, for new directions and ideas. As a result, a large number of cognitive informatics models and systems were developed, many of them dealing with visual information processing, or cognitive vision. In this chapter an overview about cognitive informatical vision models is given, along with their implementations using both software and hardware computational tools. 5.1 Ice-cube model The architecture of the primary visual cortex was described in section Hubel and Wiesel who discovered the orientation selectivity of the V1 neurons, and their columnar organization, have proposed the ice-cube model of the visual cortex [45]. They suggested that neurons showing both ocular dominance and orientation selectivity can be grouped according to their functionality and placed next to each other. The measurements suggested an angular resolution of 10 through 180. The result was a series of 2 18 blocks (representing columns) placed next to each other with neighboring orientation preferences and ocular dominance (Figure 5.1). A set of 36 such blocks aggregated the functionality of the primary visual cortex processing the information arriving from one small patch of the visual field. The ice-cube model was the first one that could be considered as a cognitive model of vision, although it had no relation to technical informatics. It explained the regularity found in both ocular dominance and orientation selectivity, and also the orthogonality between the stripes of the above two features. Later anatomical studies showed that the spatial arrangement of orientation selective cells are much more complex, and orientation maps can be defined on V1, as described in section The ice-cube model of Hubel and Wiesel are however still correct regarding the functionality and not the anatomical position of V1 functional units. 24

36 Figure 5.1: The ice-cube model proposed by Hubel and Wiesel. Reprinted from [45]. 5.2 Simple functional unit receptive field models Classical receptive fields The notion of simple functional unit receptive fields (today referred to as classical receptive fields) was defined as the area on the retina or the visual field from where the output of the actual functional unit can be directly influenced. Daugmann has showed that the spatial summation properties of simple functional units can be modeled by a family of two-dimensional Gabor functions [25]. This receptive field function is described by the two-dimensional function g λ,σ,θ,φ (x,y),(x,y) R 2 with a set of fixed parameters indicated in lowercase letters: g λ,σ,θ,φ (x,y) = e x2 +γ 2 ỹ 2 2σ 2 x = xcosθ + ysinθ, ỹ = xsinθ + ycosθ, cos (2π xλ ) + φ, (5.1) where γ is the spatial aspect ratio, which defines the ellipticity of the receptive field and is chosen to be 0.2 in normal circumstances. The standard deviation, σ, defines the size of the receptive field. λ defines the wavelength to which the receptive field is sensitive, so that the ratio σ/λ determines the number of parallel excitatory and inhibitory stripe zones present in the receptive field. The values of these parameters can be tampered with in order to receive optimal results depending on the problem type Non-classical receptive fields with surround suppression In section the lateral interactions between V1 neurons have been discussed. Computational models have been elaborated to model such large distance lateral interactions between neurons, and simulate their effects on visual information processing. Petkov has proposed the non-classical receptive field model, with a surround suppression effect, as the model of broad lateral inhibitory connection in V1. The edge detection was based on oriented derivative functions, Gabor 25

37 Figure 5.2: Intensity map of a Gabor function. Grey peripheral pixels indicate zero values (g λ,σ,θ,φ (x,y) = 0), brighter and darker pixels represent positive and negative values respectively. functions in [39] and oriented derivative of Gaussian functions in [38]. The final output of the functional units was calculated using a surround suppression filter. For this, a suppression kernel S σ was defined based on the difference of two Gaussian functions: as: 1 D σ (x,y) = 2π(4σ) 2 e x 2 +y 2 2(4σ) 2 1 2πσ 2 e x 2 +y 2 2σ 2. (5.2) The weighing function S σ for the surround suppression was then defined S σ (x,y) = where the function ramp is defined as ramp(d σ(x,y)) ramp(d σ (x,y) 1, (5.3) ramp(z) = { 0 z < 0 z z 0 (5.4) and. is the L 1 norm. The surround suppression weighing function S σ is convolved with the gradient magnitude function M σ obtained as a result of using a gradient detection operator on the original image. The surround suppression was used in an anisotropic and isotropic way. In the anisotropic way, the suppression is strongest between gradients of the same orientation, while decreases with the growing difference between the angle of gradients. In the isotropic way all the gradients are included in the suppression with the same weight, irrespective to its orientation. The results of the method on synthetic images are shown on Figure

38 Figure 5.3: The results of surround suppression. The input image is (a), the gradient magnitude is (b), the anisotropic surround suppression is (c) and the isotropic surround suppression is (d). Reprint from [38] In the work of Petkov the different orientations of gradients are calculated from the image, and are used as continuous values. No significance was given to the distribution of angles or the interaction between them Computational model of contour integration In section the contour integration property of V1 neurons has been discussed. The V1 neurons thus perform lateral interactions where they excite those neighbors that activate on a collinear line segment, and inhibit those that are not collinear but have similar orientations. A biologically plausible computational model was proposed by Itti et al. for the contour integration in [62]. In their work, a hyper-kernel was defined, describing the weights of the excitatory and inhibitory lateral connections between neighboring neurons. Each hyper-kernel slice has a reach of 12 pixels (reaching out to a span of 12 neurons) for excitation and ten for inhibition. One certain slice of the hyper-kernel represents the weights between two oriented contours. The orientation of contours on the image is quantized into 12 different orientations, which in all yields 144 slices for the hyper-kernel. These represent all the possible connections between two neurons. Excitation is strongly sensitive to the preferred orientation between two neurons, while inhibition is mostly sensitive to the spatial location between two neurons. The model calculates the output of each neuron by taking into account the oriented edge magnitude on the input, and the effect of other neurons on the actual neuron. This calculation is done iteratively until a convergence. The result is a contour integrated and surround suppressed contour image with 12 distinct orientations. Another contour integration method has been proposed by Ma and Wang [55]. In their work they propose a hierarchical contour integration method based on the characteristics of edge elements. The method consists of anisotropic edge extension and statistics based contour line connections. 27

39 5.3 Cognitive models for object recognition Object recognition is one of the ultimate goals of computer vision. The problem in object recognition is to determine which, if any, of a given set of objects appear in a given image or image sequence. Thus object recognition is a problem of matching models from a database with representations of those models extracted from the image luminance data. Of course, the representation of the object model is extremely important. Clearly, it is impossible to keep a database that has examples of every view of an object under every possible lighting condition. Two approaches have been developed to deal with the many possible transformations that an object may undergo in the imaging process: firstly, determine the transformation in question and then try to undo its effects, and secondly, find measurements of the object that are invariant to these types of transformations. There are two stages in any recognition system. The first is the acquisition stage, where an object epitome library is constructed from certain descriptions of the objects. The second is recognition, where the system is presented with a perspective image and determines the location and identity of any known objects in the image. Generally, the most reliable type of object information that is available from an image is geometric information. So object recognition systems draw upon a library of geometric models, containing information about the shape of known objects. The geometric structure of an object is based on features (lines, corners, curves, etc.) that come from the object, and the geometric relation between the object features. The representation of the object thus has to contain the features and the geometric relations between them. The representation has to be invariant to transformations that do not change the object. For example, the distance between two points is unchanged for a Euclidean transformation (translation or rotation). The main problems to solve in finding an efficient object representation are: Find a robust and invariant feature representation Find a robust and invariant geometric representation Find a correspondence algorithm between an object model and potential object representations Geometric blur A visual feature can be described by a representation of its neighborhood. A light intensity window as a description is not robust, because a small change in the surrounding of the same feature may result in a confusion with other features. For finding a suitable feature description, researchers have turned to the structure of the retina, which creates the representation of visual features, later used for object recognition in the brain. The ganglion cell density on the 28

40 retina decreases towards the peripheral regions as discussed in section Also, the size of ganglion receptive fields increases towards the peripheral regions, causing a strongly blurred peripheral vision. The result is a detailed description close to the fixation point, and a less detailed, fuzzy description far from the fixation point. Based on the retinal representation of visual features, Alex Berg has proposed the geometric blur operator [10] at the Berkeley University. The geometric blur operator is defined on an input image with a fixation point. The output is an image filtered with a spatially varying gaussian kernel G (σ), whose standard deviation σ increases proportionally with the distance of the center of the gaussian kernel from the fixation point. The output image can thus be computed as: I (σ) = I G (σ). (5.5) The geometric blur around the point (x 0,y 0 ) is then B (x 0,y 0 ) x,y = I (α (x,y) +β) x 0 x,y 0 y, (5.6) where α and β are constants that determine the amount of blur. The geometric blur is easiest to understand by looking at a result of the operator applied over a T shape, shown in Figure 5.4. The geometric blur performs best on sparse signals, because the blurred regions are unlikely to overlap and interfere with each other. For this reason, a better pixel surrounding is obtained, if the operator is applied on an orientation selective contour map of the input intensity image. In practice the geometric blur is performed on a discrete number of orientation channels. Figure 5.4: The geometric blur operation. The red dot indicates the fixation point, left side shows the input image, right side shows the result of the geometric blur. Reprint from [9]. A pixel surrounding description can be obtained by sampling the geometric blur only at discrete positions. A sparse sampling is satisfactory because a rapid change in illumination of the blurred image is unlikely over a short distance. The sampling density should be proportional with the size of the gaussian kernels applied at the certain distance from the fixation point. A possible sampling pattern is shown on Figure

41 5.3.2 Recognition algorithm The recognition algorithm attempts to find a best matching correspondence to an object model stored in the epitome library. An object model is composed of feature points and the geometric relations between them. A feature point is represented using the geometric blur operator as discussed in the previous section. The geometric relations between feature points are described by the length and the angle of the segment between them. A transformation cost is defined between the object model and a set of feature points that may represent the object sought for in the image. The set of feature points is obtained by scanning the image for locations with similar geometric blur descriptors. The cost function is composed of a term for pixel match quality and a geometric distortion cost. The pixel match cost C match is the sum of the correlations between matching pixel pairs in the model and the image: C match = c(i,i ), (5.7) i where c(i,i ) is the correlation between the geometric blur descriptor of the model and a candidate point p i and p i. The distortion term C distortion is defined as: C distortion = distortion(r i, j,s i, j ), (5.8) i, j where r i, j and s i, j are vectors between feature point pairs on the model and the candidate object. The distortion between two vectors is calculated according to the difference in their lengths and angles. For more details see [9]. Finding the best matching object is defined as an Integer Quadratic Programming problem, and a gradient method is used to find a solution for the matching problem [9]. 5.4 Hardware implementations of models in cognitive informatics The application of cognitive informatics models in solving practical problems requires an efficient implementation on a hardware platform. Given that such models are of a high computational complexity, they require high performance computational tools. The most obvious hardware solution would be a conventional PC with a strong processor and a large operative RAM memory. The structure of a PC on the other hand does not perfectly suit the structure of the models they have to implement. The cause is that modern processors are designed to perform a chain of large number of complex operations. Cognitive informatics models on the other hand typically perform a large number of simple, simultaneous operations organized in a sophisticated network. Their operation can be better described by the SIMD (Single Instruction Multiple Data) hardware model. 30

42 To overcome the problem of using conventional processors in cognitive informatics, the application of special and in some cases unusual hardware tools has emerged. Using digital signal processors (DSP) for image processing was one solution [67], but the application of field programmable gate arrays (FPGA) is also common in image processing. Even more specific hardware solutions were developed to perform a large number of similar, simple operations, at a very high speed [2, 21]. 31

43 Part II Theoretical achievements 32

44 Chapter 6 The Visual Feature Array Concept In the present dissertation the concept of the Visual Feature Array (VFA) is proposed. The VFA is an informatics concept which aims to provide solutions to implement cognitive functions, similar to those found in the cerebral cortex. The present concept concentrates on visual perception, but it is applicable for other perceptual modalities as well. The question addressed by the VFA concept concerning the cerebral cortex is about what the cognitive processes are responsible for. This question is in contrast with the main questions of Adorjan [1], who is seeking answers to how and why things happen in the cerebral cortex. These two questions are out of the scope of the VFA concept, which again, concentrates on the cognitive processes from a functional point of view. The heart of the proposed concept is a VFA model, which is a cognitive informatical model based on the information processing structure defined in the VFA concept. The VFA model within the concept obtains information from the environment, on which it performs operations, and in turn supports higher order models of cognitive informatics. In the proposed concept using the proposed VFA model, cognitive functions in accordance with those performed in the primary visual cortex can be implemented. The implementation can be described by different properties, such as complexity, precision, cognitive relevance, which depend on the conceptual context and internal characteristics of the VFA model. The VFA concept is illustrated in Figure 6.1. Information Processing Structure Environment VFA Model High level cognitive models Operations Data arrays Precision Complexity Cognitive relevance Figure 6.1: The VFA concept. 33

45 6.1 Information processing structures in the VFA concept The information processing structures of the VFA concept describe the available units and relations to construct a VFA model. VFA models can be composed of cognitive units organized in data arrays, and operations performed over them. The models constructed in the VFA concept are functionally similar to the information processing procedures in the cortex, where a large number of basic functional units affect the output of other large number of basic functional units, according to the topology and strength of their connections. Based on this, three basic model components are defined in the VFA concept Cognitive units As discussed in section the visual cortex and any cortical area performing perceptual tasks in general is characterized by a columnar architecture. A column in the cortex can be considered as the basic functional unit, and is composed by a set of interconnected neurons. On the analogy of columns, the notion of cognitive units is defined. In models of the VFA concept a cognitive unit represents the output of a basic functional unit in the cortex. The output value of a cognitive unit can be any nonnegative real number Data arrays The data arrays are multidimensional arrays that contain cognitive units in their elements. The arrays were conceived to represent the output of the modeled cognitive structures. In the models deployed in the proposed concept the cognitive units in the arrays represent visual features, however features of other perceptual modalities could also be represented. This is the cause why the proposed concept was named Visual Feature Array Operations An operation takes a data array of n dimensions as its operand, and places its result in a data array of m dimensions. The operations of the VFA concept are SIMD type. The operations have static and running parameters, the latter having discrete values on a limited interval. Three basic operation types are defined: Filtering operations Lateral operations Projective operations. Filtering operations 34

46 Filtering operations written in the form F R d 1...d n R d 1...d m, d i N. They take a data array A R d 1...d n and output a data array B R d 1...d m. The performed operation is described by B = F (A). The relation between the data array dimensions can be written as n m and m = n + k, where k N is the number of independent filter parameters. The last k dimensional subspace of B belonging to one element of A contains the results of the filtering operation performed over that element of A. Lateral operations Lateral operations are written in the form L R d 1...d n R d 1...d n, d i N. They take the data array A R d 1...d n and output a data array B R d 1...d n in a way that the input data array is replaced by the output data array. This can also be written using the temporal parameter t N in the upper index: A (t) = F (A (t 1) ), where t = 1 indicates the first input to the operation. Lateral operations thus allow the definition of iterative or recurrent functionalities. Projective operations Projective operations are written in the form P R d 1...d n R d 1...d m, d i N. They take a data array A R d 1...d n and output a data array B R d 1...d m. The relation between the data array dimensions can be written as n m and m = n k,k N. Such operations apply the same commutative operator on all the values in a k dimensional subspace of the input data array. The result is stored in the output data array, which lacks the k dimensional subspace in which the operation had been performed. The performed operation is described by B = P (A). 6.2 Uniform model of the primary visual cortex in the VFA concept A uniform model has been proposed within the VFA concept to provide a cognitive informatics model for the cognitive information processing performed in the primary visual cortex. The proposed model will be referred to as the VFA model. The VFA model obtains visual information through its input, implements cognitive functions by performing operations on the input information, and outputs abstracted pieces of information in matrix forms, directly applicable by other cognitive informatics models Structure of the VFA model The proposed VFA model for visual information processing from the retina to the primary visual cortex is shown on figure 6.2. The input to the model is a 35

47 grey-scale image of size x max y max, stored in the first 2D data array, denoted by the data array I N x max y max. A filtering operation F N x max y max N x max y max h(p 1 )... h(p k ) is performed on the input image I, k being the number of filter parameters, p i being the i th and h(p i ) being the number of possible discrete values of filter parameter p i. The result is the data array V N x max y max h(p 1 )... h(p k ). Although there is no theoretical limit to the number of filter parameters k, this value is usually 1 or 2, depending on the desired filter functionality. The filtering operation performed over the input image can be written as follows: V = F (I). (6.1) A lateral operation L N x max y max h(p 1 )... h(p k ) N x max y max h(p 1 )... h(p k ) is performed on the data array V obtained in equation 6.1. This operation does not affect the dimensions or size of the data array, but allows the implementation of lateral connection based functionalities. The lateral operation performed on the data array V places its output to the same data array, denoted using the time parameter t. As such, the operation performed can be written as follows: V (t) = L(V (t 1) ). (6.2) The result of a lateral operation is rewritten in the input array, which yields an iterative process. The iterations can be interrupted if a predefined number of iterations max(t) is reached, or a stationary V is obtained. This can be checked if the L 1 norm of the difference between two consequent data arrays is less then a preset limit ε: V (t) V (t 1) < ε. (6.3) A projective operation in the VFA model projects the V data array along its dimension corresponding to the filter parameter p l, using a commutative operator. The operation can be defined as P N x max y max h(p 1 )... h(p k ) N x max y max h(p 1 )... h(p l 1 ) h(p l+1 )... h(p k 1 ). (6.4) It takes the data array V as its input and its result is stored in the data array C N x max y max h(p 1 )... h(p l 1 ) h(p l+1 )... h(p k 1 ). The projective operation in the VFA model can be written as: C = P (V). (6.5) The output of the proposed VFA model can be V or C, or their subspaces. These data arrays hold values that represent features in the input image, well applicable by other cognitive informatics models. The VFA model is quite under-specified to allow a very flexible adaptation to the actual task and connecting applications. Specifying the actual size of the data structures and the exact functionality of the operations yields different VFA model variations. In the following sections several versions of the VFA model will be discussed, all of them are based on the cognitive functionalities of the primary visual cortex. 36

48 Lateral operations Input image Data array I Input filtering Simple cognitive units Data array V Projective operations Complex cognitive units Data Array C Output Output Figure 6.2: The Visual Feature Array model. Rectangular shapes represent data arrays, elliptical shapes represent operations Simple cognitive units in the VFA model Simple functional units of V1 are represented by the data array V in the VFA model. This aspect of the VFA model is analogous to the ice-cube model of Hubel and Wiesel [45], but in this case the model is monocular, i.e. it considers input from only one image sensor. However, a new dimension can easily be added to the model to achieve binocular representations. Features that compose objects are connected in space and time. Representing spatial connectivity is done by retinotopy. Temporal representation of visual information processing lies outside of the scope of the proposed model. Retinotopic representation is thus important, but functional maps may be disregarded in the VFA model, based on [43]. The first two dimensions x and y of V correspond to horizontal and vertical image coordinates. Fixing a value for parameters p 1... p k in V yields a 2D subspace which contains a retinotopical representation of visual features defined by the parameters p 1... p k. The value contained in V x,y,p1,...,p k can be a binary or an 8-bit grey-scale value, and represents the output of a filter kernel applied on the input data in the position x,y using the parameters p 1,..., p k. The values contained in the data array V are sparse, because in several positions the filters give results close to zero. The sparseness of input data is a basic requirement for most of the object recognition algorithms [9] [62], and as such the VFA model provides a suitable input for other cognitive informatics models of higher level vision Complex cognitive units in the VFA model Complex functional units defined by Hubel receive their inputs from simple functional units in V1, and represent complex features, such as corners, crossings and other vertices. In the VFA model this relation is modeled by the connection between data arrays V and C. Elements of C contain the output of complex cognitive units, and can be used to corner, crossing and vertex representations. Data array C is also retinotopic, with its first two coordinates 37

49 corresponding to image coordinates x and y. Data array C is also a sparse array of either binary or 8-bit grey-scale values, and is suitable to be applied in other cognitive informatics models of higher vision. 6.3 Input filter operations in the VFA model The functional and data structure of the VFA model has been discussed in sections The actual values contained in the data arrays of the VFA model are obtained through the operations between them. The operations described in this section are input filtering operations, and are responsible for setting the values in data array V. The presented operations are all different implementations of the input filtering operation F of the VFA model. The proposed implementations differ in function, cognitive and biological plausibility, computational complexity, robustness and hardware implementability. A comparison between them will be given in section End stopping filtering The first filtering operation presented in this dissertation is inspired by the end-stopping functionality of V1, described in Section The proposed end-stopping filter operation is formally referred to as F e, and is composed of two parts. First an edge detection operation is performed, followed by the extraction of length and orientation selective line segments. The operation F e receives an image on its input, which is immediately subjected to an edge detection filter. This filter is based on the receptive field characteristics of the retinal ganglion cells. In the small region of the visual field which is centered around the position of the receptive field of the ganglion cell the afferent connections have a relatively high positive weight, while in the surrounding regions the synapse weights are inhibitory. The receptive field is modeled with a matrix M R 3 3 in equation 6.6, with higher positive input weight values in the middle and small negative values in the surrounding regions. It is to note that M is the widely accepted Laplace edge detection kernel. M = (6.6) The input image I is convolved with the filter M, and subjected to a threshold operation using the threshold value q [min(i M),max(I M)] to obtain the data array E: { 1 if (I M)x,y > q E x,y = (6.7) 0 otherwise Going further on the visual pathway we find that the receptive fields in LGN are similar to those in the retina, so their effect can also be covered by the same M filter matrix in the proposed end-stopping filtering operation, 38

50 Figure 6.3: The binary filter matrices of the VFA model that represent the orientation tuned end-inhibited functionality of V1. The above matrices are referred to in the model as R (0,5), R (22.5,5), R (45,5), etc. without losing the cognitive functionality. The LGN in turn projects into the visual cortex, where further cognitive feature extraction takes place. This process is also considered as the part of the proposed filtering operation of the VFA model. After the edge detection discussed above, an edge detected image is available in the data array E. Similarly to the visual cortex, several different features can be extracted from the edge detected image E. In the present input filter operation the features are line segments of varying length and orientation. A line segment of orientation θ and length l is represented by a binary matrix R (θ,l), with 1 in the positions belonging to the feature, and 0 in points that are not part of the feature. The possible values for θ can be expressed as θ k = k π h(θ),k N. The choice for binary matrices to detect visual features was made because it allows an acceptable approximation of the sought features, and binary operations are rather easy to implement in a digital hardware, such as an FPGA. A series of mask matrices for all the possible five-pixel-long lines are shown in Figure 6.3. The extraction of the features begins with the longest line segments, those spanning through the largest angle in the visual field, and thus causing activation in the largest number of ganglion cells, or pixels in the context of a CCD image sensor. The extraction of a feature represented by R (θ,l) is achieved by a matching between this later and E, considering all the possible translations of R (θ,l) over E. The matching is done between overlapping pixels only where the matrix R (θ,l) = 1, e.g. supposing a translation r = (x 0,y 0 ), pixel matching in E x,y is only performed if R (θ,l) x x 0,y y 0 = 1. Such a matching is necessary to avoid low matching output when several lines pass through the area covered by R (θ,l). Since the matching is performed on an edge detected image, there will be no spots of ones to detect as multiple parallel line segments. In the proposed filtering operation the image features are extracted one after the another, and the feature extracted is removed from the image, so other consequent features don t find it again. Let s suppose that the actual image that we extract a feature from is E (i) and after the removal of the feature E (i+1) is obtained. The removal means that pixels in E (i) are set to zero in positions where the overlaid feature matrix R (θ,l) is one. In the beginning E (i) = E, which is obtained in equation

51 The extracted features are collected in a four dimensional data array defined as V [0;1] x max y max h(θ) h(l), where h(θ) is the number of possible orientation values for θ, and h(l) is the possible number of length values for l. If a feature was extracted and removed from E (i) by R (θ k,l) with a translation r = (x 0,y 0 ), it is placed in data array V such that V x,y,k,l = 1 if R (θ k,l) x x 0,y y 0 = 1. (6.8) Because of the gradual removal of the features, it is important that the matrices used for feature extraction are ordered according to the length of the feature they represent. This means that R (θ,l1) is applied before R (θ,l2) if and only if l 1 l 2. This method will cause a long line segment to exert an inhibitory effect on shorter ones. The features to be extracted have orientations uniformly distributed with a specified angular resolution. The angles θ represented in the VFA are between 0 and 180 degrees. If the number of different orientations is h(θ), the orientations can be written as: θ [0,π] R,θ k = k π,k [0,h(θ) 1] N. (6.9) h(θ) The value h(θ) can be determined according to the actual requirements. For maximal cognitive functional plausibility, h(θ) = 18, which makes an angular resolution of 10, as found in the visual cortex according to section 5.1. The length of the features can also be uniformly distributed on an interval between the shortest and the longest line segment, but the application of a discrete number of lengths is also possible. The number of different line segments is h(l). The projection of the data array V along the dimensions of θ and l yields the top-down reconstruction of the edge map of the original image I using the detected line segments. The reconstruction will exclude the edge elements detected as noise, which was not recognized as a visual feature (a line segment of certain length and orientation). Examples of the output of end stopping filtering can be found in appendix A Gabor function based filtering The second possible implementation of the input filtering operation is referred to as F g. It is called the Gabor function based filtering operation, since it uses the biologically most plausible receptive field descriptions, the Gabor functions. As discussed in section 5.2.1, the receptive field of a simple functional unit in the visual cortex can be modeled by a Gabor function, introduced in equation 5.1 with an example plot on Figure 5.2. In case of Gabor function based filtering, a three dimensional data array is used, defined as V R x max,y max,h(θ). The first two dimensions correspond to image coordinates, while the orientations are organized along the third dimension θ. The possible angles are equally distributed along the third dimension 40

52 θ, and are computed as θ k = k π,k N. (6.10) h(θ) Using the Gabor function based filtering operation, the data array V can be considered as a stack of retinotopic layers, with the layer k being responsive to the orientation θ k. Let the layer k of V be r λ,σ,θk,φ(x,y), where the contour elements are oriented in the angle θ k. The layer r λ,σ,θk,φ(x,y) is calculated as the convolution of the input image I with a Gabor function-like kernel G (θ k). The kernel G (θ k) is obtained from a Gabor function such that g λ,σ,θk,φ(0,0) provides the kernel center, and the kernel size is fixed to cover all non-zero values of the Gabor function. The kernel is also normalized so that the sum of it s elements add up to zero. The layer k of the data array V is thus obtained as r λ,σ,θk,φ(x,y) = I G (θ k). (6.11) Stacking the layers to obtain the data array V is written as: V x,y,k = r λ,σ,θk,φ(x,y),k N,k < h(θ). (6.12) The obtained data array V contains the contour elements organized in isoorientation layers, where neighboring layers contain neighboring orientations (Figure A.3). The Gabor function based filtering is functionally identical to ice cube model. The segment extraction is done by the implementation of classical receptive fields, modeling the input weight distribution of cortical simple functional units. The values stored in data array V are quantized on the interval [ ], which gives a higher accuracy and representation power than the end-stopping based filtering operation in section The application of Gabor functions allows to detect contours of different scales. Some test images obtained using the Gabor function based input filtering operation are shown in appendix A Foveated filtering The input filtering operation proposed in section uses Gabor functions to model the receptive fields. The only running parameter of the filtering operation is the orientation of the Gabor function. The size of the applied Gabor functions is constant in the operation, which results in a uniform accuracy in the data array V. The whole field of view is represented with a constant, high information density. If we consider the mammalian retina and visual cortex, the information density of the visual field representation decreases towards the peripheral regions, as discussed in sections and The foveated input filtering operation referred to as F f is proposed in this section in order to obtain a biologically plausible operation from the foveation s point of view. The kernels of the foveated filtering operation increase their size towards peripheral regions, keeping the orientation as a 41

53 running parameter. The obtained data array V will again have iso-orientation layers with foveated image representation. The Gabor functions that determine the kernels applied in F f are calculated using the Gabor function g λ f (d),σ f (d),θ,φ (x,y),(x,y) Ω R 2, where f (d) is a function of the distance d between the center of foveation and the point (x,y). The function f (d) can be linear: or quadratic: f (d) = a d + b, (6.13) f (d) = a d 2 + b d + c. (6.14) The constants a,b,c and d should be chosen such that f (0) = 1 is satisfied. The foveated input filtering operation has the advantage of positional ambiguity towards the peripheral regions. When comparing two subsequent fixation points with the purpose of matching, an exact similarity is required in regions close to the fixation points, while a much less exact similarity is required towards the peripheral regions. Similar technique is used in the geometric blur, which was discussed in section The foveated input filter can also be used for data compression. The output of this operation will have high information density in the center of foveation, which is considered to contain important details. Using foveated input filtering, the biological plausibility of the VFA model is increased, which is a generic goal of the dissertation. Some test images obtained using the foveated input filtering operation are shown in appendix A Summary of filtering operations Three input filtering operations have been proposed within the VFA model. The proposed operations share different advantages, and the choice which to use depends on the practical application of the VFA model they are embedded in. The function of the three filtering operations is quite similar, they intend to sort the contour elements from the image according to their parameters, which is angle and length. The foveated implementation is more plausible according to the methods of cognitive sciences. The end stopping filter yields a binary contour map, while the output of the other two are grey scale. The binary output might be poor in details for certain applications, however the outstanding hardware implementation possibilities may increase its potentials. The Gabor function based filtering operation provides a compromise between computational complexity and cognitive plausibility, and therefore this operation will dominate in the rest of the dissertation. The comparison of the filtering operations with other methods is not given here. A detailed discussion will be given later in section

54 1.5 1 G (θ k ) y x Figure 6.4: Intensity map (left) and 3D surface (right) of a Gabor function based filter kernel for contour integration. 6.4 Lateral operations in the VFA model Lateral operations were defined by equation 6.2. These operations can model lateral connections between cortical neurons, which are responsible for their mutual interactions. A lateral operation L defines the way an element of the data array V (t) is calculated from the previous values of the same data array V (t 1). Since the result of the operation is stored in the input data structure, an iterative algorithm is obtained. The stability properties of these operations need to be evaluated. In the VFA model two lateral operations are proposed. These operations are responsible for the lateral inhibition and contour integration functionalities of the visual cortex, discussed in section Contour integration in the VFA model Let s define the lateral operation L c in the VFA model to implement the contour integration functionality. The operation implements excitatory lateral connections between iso-orientated elements of the data array V: V (t) = L c (V (t 1) ). (6.15) The input weight of each element in the relative position (x,y) to the actual element is defined by an elongated Gabor function, using the parameters γ = 0.3, λ = 12 σ and φ = 0 in equation 5.1. The parameter σ gives the size of the Gabor function, while θ is used to adjust the orientation. The gratings of the Gabor functions are not used in the case of contour integration. For this reason the applied Gabor function is set to zero, where 2π(3(xcos(θ) + ysin(θ))/λ + φ) > 0.5π. The values provided by the above function are again used to define a filter kernel G (θ k), with its center corresponding to the position (0,0) of the function. An example for the so obtained filter kernel with parameters σ = 30 and θ = π/6 is shown in figure 6.4. Each orientation layer of the data array V is iteratively filtered using a contour integration filter defined above: 43

55 Figure 6.5: An example for the contour integration. Result obtained after 30 iterations with c = V (t) x, y,k = V(t 1) x, y,k G(θ k),k N,k < h(θ). (6.16) In the next step values in the data array that are smaller than a threshold ratio c [0,1] are set to zero. { Vx,y,k if V V x,y,k = x,y,k > c max(v) 0 otherwise (6.17) This operation yields a dynamic behavior of data array V over the iterations. The contour integration functionality is thus implemented in the VFA model. If a small part of a line segment is missing, or there is a gap between two collinear line segments, the iterative application of the above described filter will connect the two ends of the line segments, eliminating the gap. If the gap is larger than the size of the filter, it will not be eliminated. An example of the contour integration is shown in Figure 6.5 The contour integration algorithm proposed in 2007 by Ma and Wang [55] is quite similar in its concept to the proposed method in the VFA model. The algorithm of Ma and Wang is however much simpler on the implementation level, it makes the contour integration pixel-by-pixel, and lacks any real cognitive plausibility. The connection of contour endpoints based on Gestalt rules gives closed loops. Such a functionality is of higher order cortical processing, and as such, is out of the scope of this dissertation. 44

56 Stability of the contour integration The iterative application of the defined contour integration filter poses the question of stability. The data array V should converge to a stable state. The stability of the operation L c can be adjusted by the constant c, used to cut the recurrent values in V. The stability of the operation L c has four different states, obtained by simulations: 1. divergence to infinity 2. "broad" convergence 3. "narrow" convergence 4. convergence to zero If c is too small, the overall activation of elements in V will increase and diverge to infinity. This is shown in figure A.5 (a). The original test image and the content of V after 100 iterations is shown in figure A.6 (a) and (b). If c is too large, a very small number of elements remains non-zero in the data array V. This leads to a convergence to zero in the overall activation of the elements. The overall activation is shown on Figure A.5 (b), as a function of the iterations. After the tenth iteration no more positive values exist in V. There is a suitable choice for c such that the overall activation of V converges to a constant non-zero value. The threshold c can be chosen to induce a "vast" convergence or a "narrow" convergence. In case of a vast convergence, the stable state includes patches that are more thick than in the case of "narrow" convergence. Furthermore, the overall activation of the data array V shows different dynamics. Figures A.5 (c) and A.6 (c) show the convergence with the threshold c = 0.15, while figures A.5 (d) and A.6 (d) show the convergence using c = There is a sharp border between the four states of stability depending on the value of c and the number of connected pixels in a line. The appropriate adjustment of c allows to arrive in the desired stable state during the iterative process using the contour integration operation L c. The desired minimal length of line segments can also be adjusted using the threshold c. The required number of iterations can be read from the figures, which is about 15 in the case of "vast" convergence and 40 in the case of "narrow" convergence. Furthermore, working with "vast" convergence, the final patch will overlap the endpoints of the original line segments, which will be an advantage in the endpoint detection, described in the following Lateral inhibition in the VFA model A lateral operation L i has been defined in the VFA model that is responsible for the implementation of the lateral inhibition functionality between layers of the data array V. Non-zero values for a given contour in multiple neighboring layers might appear in the data array V. A lateral connection between layers 45

57 is defined in the data array which is bound to clear elements that belong to a wrong layer. The lateral inhibition operation in the VFA model is defined over elements with matching image coordinates of three neighboring layers. The output of V x,y,k is calculated from the previous value of itself and its neighbors as follows: V (t) x,y,k = V(t 1) x,y,k α V(t 1) x,y,(k 1)mod h(θ) α V(t 1) x,y,(k+1)mod h(θ). (6.18) The values in V are cut using a threshold ratio c [0,1] R of the maximal value of V: { Vx,y,k if V V x,y,k = x,y,k > c max{v} (6.19) 0 otherwise According to experimental results fixing c = 0.1 gives acceptable results. This operation is done on the data array V iteratively, similarly to the contour integration operation discussed in section The parameter α is used to adjust the strength of the negative feedback. The results obtained using the lateral inhibition is shown in appendix A.2. Stability of the lateral inhibition The iterative application of equations 6.18 and 6.19 again poses stability questions. The desired outcome is a stationary state with the local maxima remaining positive and all other elements being zero. Simulations have shown that by setting the parameter α too small yields many local maxima in the stationary state, while setting it too large will cause all the elements to become zero. The input to the simulation was a vector with local maxima. The results of the simulation are shown in Figure 6.6 with two strong local maxima, and in Figure 6.7 with one strong and one weak local maxima. Setting α > 0.5 yields an instant global convergence to zero, which is not desirable. The results shown above converge to a stationary state, which justifies that the proposed lateral operation for lateral inhibition can be used iteratively. In general, setting α = 0.2 gives acceptable results. Lateral inhibition using max operator The lateral inhibition functionality can be obtained using the max operator as well. Doing so keeps the local maxima in each image coordinate along the orientation dimension in the data array V. The filtering of local maxima can be done using the following equation: V (t) x,y,k = { V (t 1) x,y,k if max{v (t 1) 0 otherwise x,y,k 1,V(t 1) x,y,k,v(t 1) x,y,k+1 } = V(t 1) x,y,k (6.20) The advantage of using the maximum function is that the result of the lateral operation is obtained in one iteration. There is however a loss in biological plausibility, which is a general goal of the dissertation. 46

58 (a) (b) (c) (d) Figure 6.6: A simulation of lateral inhibition with α = 0.1 (a), α = 0.2 (b), α = 0.3 (c), α = 0.35 (d) (a) (b) (c) (d) Figure 6.7: A simulation of lateral inhibition with α = 0.1 (a), α = 0.2 (b), α = 0.3 (c), α = 0.35 (d). 47

59 6.5 Projective operations in the VFA model In the VFA model projective operations have been defined, which allow the computation of line segment crossings, vertices and endpoints Projections The projective operations can be the combination of other operations of the VFA model. The general property of projective operations is that they reduce the dimensionality of their input data array. In general a projective operation P takes the data array V as input. A set of dimensions of the input data array is chosen, along which a commutative operator Op is performed on the elements. The projective operation can be written as in equation 6.5, repeated below: C = P Op (V). (6.21) Suppose that the projective operation performs the operator along the third dimension (corresponding to orientations) of data array V. In such a case the output of the operation can be written as: C x,y = Op{V x,y,1,...,v x,y,h(θ) }, (6.22) A binary masking vector m can be defined that allows to chose the set of elements to take part in the projective operation. Let m {0,1} h(θ), equation 6.22 becomes C x,y = Op{m 1 V x,y,1,...,m h(θ) V x,y,h(θ) }. (6.23) The operator Op can be defined for example as a summation followed by a division by h(θ) to yield an average. It can also be defined as a max or min operator, or a logical or operator when working with the binary VFA model Corner and vertex detection As it has been discussed in section 4.2 that corners and crossings are of particular importance in object recognition. For this reason the VFA model includes a corner, crossing and endpoint (for brevity, these features will be called vertices) detection functionality, identical to that of complex functional units in the primary visual cortex, as discussed in section The vertex detection is built on the data array V of the VFA model. Projective operations are used to implement the desired functionality. Two compound projective operations are defined: Intersection extracting projective operation: P X, Endpoint extracting projective operation: P E. 48

60 The operation P X extracts the crossings in the data array V and outputs them into a two-dimensional binary data array X. The operation P E finds the endpoints in the data array V without respect to their orientations, and outputs them into a two-dimensional binary data matrix E: X = P X (V), (6.24) E = P E (V), with X x,y = 1 and E x,y = 1 meaning the presence of a crossing or a corner in the image location x,y respectively. The presence of a corner in the image location x,y can be obtained by a projection of X and E using a operator: For a given image location x,y equation 6.25 becomes C = P C (X,E). (6.25) C x,y = X x,y E x,y. (6.26) The meaning of equation 6.26 is that there is a corner in image position x,y, if there is a crossing and an endpoint in x,y. The structure of the vertex detection VFA sub-model is shown in Figure 6.8. The vertex detection incorporates the detection of crossings (intersections) and endpoints, which is discussed below. Detecting intersections As it was presented in section 3.2.3, biological evidence shows that there are crossing tuned complex functional units in V1 that respond depending on the orientation of the two lines composing the intersection. Such functional units receive input from orientation selective simple functional units in V1, modeled by the cognitive units in data array V in the VFA model. Projective operations for the detection of orientation selective crossings have been introduced in the VFA model, as part of the projective operation P X to find the intersections in V. Neighboring orientations are not considered because the lateral inhibition inhibits values in one of them. A series of operations P I1...P In are defined for all the orientation pairs that result in a three dimensional binary data array T {0,1} x max y max n, such that T x,y,m = (P Im (V)) x,y,m [1,n] N. (6.27) The operation P Im represents two non-neighboring orientations θ k1 and θ k2 along the third dimension of the data array V, with k i [1..h(θ)] N. The operation P Im in image position x,y can be written as T x,y,m = (V x,y,k1 > t) (V x,y,k2 > t), (6.28) where t is a threshold above which an edge is considered by the operation. 49

61 V Pin Pe Pi2 Pu Pi1 Crossing operation PX Endpoint operation PE Pu X E Pc Corners Figure 6.8: Projective operations in the VFA model. The projections allow to extract complex visual features, such as crossings, endpoints, corners and vertices. 50

62 The data array T is in turn given to a simple projective operation P u that uses an operator along the third dimension of T: X x,y = n T x,y,k. (6.29) k=1 The output is written in the data array X of the VFA model, and is considered the output of the compound projective operation P X. The intersections can also be obtained by using a projective operation with "+" as the internal operator, followed by a threshold operator. The threshold has to be set to give positive values for image locations where the sum of edge values in different orientation layers indicate the presence of intersections. At this point the lateral inhibition presented in section becomes important, because neighboring orientations will not add up in the projective operation, causing false corner signals. Detecting endpoints Endpoint detection is the other important functionality of the vertex detection VFA sub-model, implemented by the compound projective operation P E. The endpoint detection is done by a series of oriented endpoint sensitive projective operation P ek, one for each orientation layer in V. The output of the operation is a boolean value obtained using a threshold value t. The operation of P ek can be written in the form of a convolution as P ek (V) = V x, y,k F (θ k,σ), (6.30) where F (θ k,σ) is a filter kernel obtained from the function f in equation 6.31 by setting the kernel center to f (0,0). The kernel is shown in figure 6.9. f θk,σ(x,y) = e x2 +y 2 2σ 2 (xcos(θ k + π 2 ) + ysin(θ k + π ) 2 ). (6.31) The endpoint detection functionality calculates the first-order directional derivative in terms of the directional selectivity of each θ k -oriented layer. Values close to zero are obtained when there is no endpoint in the receptive field, and a larger value (in absolute value) otherwise. A temporary data array T is obtained after applying a threshold on the projection in equation 6.30: { 1 if (Pek (V)) T x,y,k = x,y > t, (6.32) 0 otherwise. The data array T is in turn given to a simple projective operation P u that uses an operator along the third dimension of T: E x,y = k T x,y,k. (6.33) k=1 51

63 1 0.5 F (θ k,σ) y x Figure 6.9: The receptive field model used for endpoint detection. The result is the data array E, which can directly be interpreted as the endpoints on the image, and can be used for the corner detection along with the data array X containing intersections, obtained in equation The result of the corner detecting projective operations compared with the fast corner detection algorithm proposed in [76] [77] is shown in appendix B Discussion Comparison with edge detection operators Because of the difficulty of obtaining ground truth for real images, the traditional technique for comparing low-level vision algorithms is to present image results, side by side, and to let the reader subjectively judge the quality. Scientifically this is not always a satisfactory strategy, however in obvious cases it can be convincing. A more rigorous comparison strategy, also based on human rating experiments and the analysis of variance (ANOVA) technique has been proposed by Heath [41]. According to the proposed technique, 8 images were used in the experiment, listed in appendix B.1. The judges were shown the 8 series, each of them containing the original image and four contour maps in a 2 2 grid, obtained using the Canny algorithm [19], the Sobel operator, the Laplace operator and the VFA model using the Gabor function based input filtering and lateral operations. Each algorithm had a random position on all the sheets containing the 8 series. The judges were asked to rate the images from 1 (low) to 5 (high). In the experiment two questions were asked. The first question was about the correlation of edges and the contours, with the rating 1 meaning edges seem to be without coherent organization into an object, and the rating 5 meaning all relevant edge information for recognizing an object with no distracting edges. The second question concerning the images was about the cognitive relevance of the image processing results. For this, the judges were asked to 52

64 determine whether the images were pleasurable or comfortable to look at. Again the possible rating was from 1 (disturbing) to 5 (pleasurable). Results In the experiment 17 people were asked to rate the 4 methods form "1" to "5" according to 2 question on 8 different images. In the first step the 8 images were pairwise compared using the ANOVA technique to see whether they give similar results to all the 4 methods, meaning that the images have no effect on the results of the evaluation. It was revealed that the evaluation results can be divided into several groups according to the images, which suggests that the ratings obtained depend on the images as well, not only the algorithms. Accordingly the 4 methods can not be evaluated with respect to all the images at the same time. This is why the evaluation has been performed on every image question pair separately. Using the ANOVA technique the first step was to perform the F-test for all image question pairs in order to see if there is a significant difference between the four methods. The results are shown in appendix B.1 in Tables B.1 and B.2. In the case of the first question all the images yield a significant difference between the methods considering a 95% confidence interval, while in case of the second question, the Synthetic house image does not provide significant differences with 95% confidence interval. For this reason this image has been removed from the evaluation in case of the second question. The average ratings for the algorithms are shown in Tables 6.1 and 6.2 for the two questions respectively. The rank of the VFA is also displayed to show its position compared to the other methods. Canny Laplace Sobel VFA VFA rank Drawn house Fort Synthetic house Balls Trash can Plane Camcorder Stairs Average Table 6.1: Average rating to the first question for individual images and methods. According to the results given to the first question the conclusion is that the VFA model performs just as good as the Laplace operator, better than the Canny method, and worse than the Sobel operator. The differences however are not so considerable. Another conclusion, if one looks at the results obtained for each image, is that the rating of each method highly depends on the image used for the rating. This means that there is something about each method which makes them better for one image or the other. 53

65 Canny Laplace Sobel VFA VFA rank Drawn house Fort Synthetic house Balls Trash can Plane Camcorder Stairs Average Table 6.2: Average rating to the second question for individual images and methods. Total average excludes the insignificant image of the Synthetic house. The results obtained for the second question intend to measure the quality of the four methods from a different point of view: cognitive relevance. According to the results the VFA model performs better than the Canny and the Laplace, and is comparable to the Sobel operator. One conclusion is that the output obtained from the VFA model is rather pleasurable to look at, which might be the cause of its cognitive relevance, causing small amount of cognitive collision in the brain. The other conclusion again is that the results strongly depend on the image used for the evaluation. The experimental results discussed above prove that the VFA model is comparable in output quality with other widely accepted methods. The other conclusion obtained in case of both questions, the high dependence if the image used for the evaluation suggests that none of the methods follow exactly the natural cognitive processes used for contour detection. If such a method was conceived, the subjective human experiments would yield good, image invariant ratings Evaluation of corner detection in the VFA model In order to assess the quality of the corner detection functionality of the VFA model, a comparison with other corner detection methods is necessary. The most widely used and well known corner detection method was proposed by Harris. In the tests performed, however, the Harris operator performed too poorly compared to a fast corner detection algorithm proposed by Rosten [76, 77]. For this reason, this later has been used to compare the VFA model with. Four images have been chosen for the evaluation, the results of both algorithms are plotted side-by-side in Figure B.2 in appendix B.2, according to the Heath method. The corners chosen by the algorithms are indicated by red dots. On the first image of a hand-drawn-house the VFA model finds all the necessary corners, and 3 false corners. The Rosten algorithm misses 3 corners, and finds 2 false ones. On the second image of the six balls, the Rosten algorithm finds 2 points in the center of the balls as corners, which is obviously false. On the other hand, the VFA model finds many points on the periphery of the balls. This 54

66 can be considered as a success or as a failure as well. However, from the architecture of the VFA such a result is not surprising, and even useful when being used for nodes in the model acquisition in object recognition tasks. On the third image of a wooden house the two algorithms perform similarly, however the Rosten algorithm finds many false and double corners, while the VFA model fails to find some of the corners. And finally on the third image of the wooden stairs, the Rosten algorithm again finds many false corners and misses quite a lot too, unlike the VFA model, which finds more real corners without any false detection. The obtained results prove that the VFA model can be used in corner detection tasks, with comparable or even better results than other corner detection algorithms Computational complexity of the VFA model The VFA model has been proposed to provide cognitive computational model to perform the basic functions of the primary visual cortex. As any model in informatics, the VFA model is aimed to be implemented using computational tools. The most obvious tools are Neumann-type computers. This necessitates the discussion of the computational complexity of operations defined in the VFA model. As before, the image size is denoted as x max y max, the size of the filter is p q, and h(θ) denotes the number of different orientations. The number of iterations will be denoted as i, where applicable. Input filtering The input filtering operation using Gabor functions is done by the application of 2D filters on the 2D input image, and yields a computational complexity of O(x max y max p q h(θ)). (6.34) The memory requirement of the data array V is x max y max h(θ) data units. Lateral operations Lateral inhibition The contour integration lateral operation performs a convolution of the contour integration filter kernel with the orientation layers of data array V. Suppose that the number of iterations necessary for the steady state is i, the size of the filter is p q, the complexity of L c becomes O(x max y max p q h(θ) i). (6.35) It is necessary to know the value of i in order to calculate the complexity. According to the graphs plotted on Figure A.5 the limit of i 30 can be fixed to guarantee reliable convergence. 55

67 No extra memory is required for this operation. Lateral inhibition The lateral inhibition lateral operation L i is performed iteratively between a pixel and it s two neighbors in neighboring orientation layers. The complexity of the operation before the steady state is reached becomes O(x max y max h(θ) i). (6.36) It is necessary to know the value of i in order to calculate the complexity. According to the graphs plotted on Figures 6.6 and 6.7 the limit of i 10 can be fixed to guarantee reliable convergence. No extra memory is required for this operation. Projective operations Intersection extraction The projective operation to extract intersections P X is composed of projective operations P Ik and P u. The complexity of P Ik is O(x max y max ), and k max = h(θ) (h(θ) 3) 2. The complexity of P u is O(m n h 2 (θ)). The total complexity for P X is thus O(x max y max h 2 (θ)). (6.37) The number h(θ) of orientation layers in data array V is normally fixed to h(θ) = 18, which makes the complexity in equation 6.37 to simplify to O(x max y max ). (6.38) The projective operation P X requires an extra temporary memory of x max y max h 2 (θ) data units. Endpoint extraction The endpoint extraction in projective operation P E is done by the application of an endpoint sensitive filter for all the orientation layers respectively. Supposing a filter size of p q, the complexity of P E becomes O(x max y max p q h(θ)). (6.39) Extra memory of x max y max data units is required for this operation. Corner extraction 56

68 The corner extraction done by the projective operation P C, which applies a logical operator between corresponding elements of data arrays X and E. Such an operation is of complexity O(x max y max ). (6.40) Extra memory of x max y max data units is required for this operation. The summary of complexities of operations implemented in the VFA model are shown in table 6.3. For more compact equations in table 6.3, solely here x max and y max are replaced by x and y. Name of Sign Computational Memory operation complexity requirement Input filtering F G O(x y p q h(θ)) x y h(θ) Contour integration L c O(x y p q h(θ) i) 0 Lateral inhibition L I O(x y h(θ) i) 0 Intersection extraction P X O(x y h 2 (θ)) x y h 2 (θ) Endpoint extraction P E O(x y p q h(θ)) x y Corner extraction P C O(x y) x y Table 6.3: The computational complexities and memory requirements of operations in the VFA model Cognitive plausibility of the VFA model One purpose of the proposed VFA model is to be functionally as relevant to the cognitive functions of the primary visual cortex as possible. There has been a great deal of disagreement about the unitary and modular nature of the cerebral cortex. As described in section 4.1, the evolutionary old parts of the cerebral cortex are modular, and new parts, such as the neocortex are unitary. The visual cortex belongs to the older part of the cortex, and as such, shows a high level of modularity. This supports the cognitive plausibility of the VFA model. The proposed VFA model is based on data arrays that are organized along orthogonal dimensions. This implies that the location of data array elements do not follow the functional maps observed in the primary visual cortex, described in section According to [43] however, a highly visual mammal, the grey squirrel does not have orientation maps in the visual cortex. This supports that any V1 model that is functionally identical with the V1, but does not show a mapped organization is correct in respect of the cognitive functions. As described in section 4.1, the evolutionary old parts of the cerebral cortex are modular, and new parts, such as the neocortex are unitary. The visual cortex belongs to the older part of the cortex, and as such, shows a high level of modularity. In this respect, the VFA model, being modular, is plausible from the cognitive science point of view. In the following points the cognitive plausibility of the components of the VFA model will be discussed. 57

69 Data arrays Data arrays in the VFA model represent the output values of basic functional units in the visual cortex. These units are anatomically organized in a spongelike network, but their output characteristics can be organized into orthogonal dimensions on a functional basis. Data arrays represent such an organization. Since data arrays do not necessarily represent the locality of basic functional units in the cortex, they are bio-functionally plausible. The lack of feature maps in data arrays is also supported by the tree shrew, being a highly visual mammal without orientation map in its visual cortex. Filtering operations Filtering operations determine how the basic visual features find their way to the data array of the VFA model. According to the research results of Daugman, the input weight distribution of V1 simple functional units (their receptive fields) can be described by Gabor functions. The size of receptive fields increase towards peripheral regions of the visual field. Section gives an input filtering where Gabor functions are used, while section gives the model with increasing receptive field sizes. Lateral operations The functional units in the visual cortex interact with each other, creating an information link between them. This functional link is implemented by means of lateral operations of the VFA model. As a result, contour integration and lateral inhibition functions observed in the visual cortex are performed by the VFA model as well. Projective operations Projective operations can implement cognitive functions that summarize properties of cortical columns. They are functional models of input characteristics of columns composed of complex functional units in the visual cortex. 58

70 Chapter 7 Opto-mechatronical computation of the VFA model The VFA model lends itself to be implemented on a Neumann-type computer, by virtue of its data and operation structure. By the emergence of high complexity computer systems, the computational capacity required by the VFA model can be satisfied. A real-time, scalable execution of the implemented VFA model is however not possible even with the fastest computers. The visual cortex solves problems of similar complexity to the VFA model, in a fraction of a second, even though the response time of neurons is in the range of several milliseconds. The fast processing ability of the brain is endowed by the vastly parallel network of simple computational units, the neurons. It is not only the large number of neurons that allow fast computation, but also the large number of connections between them: output signals from a set of neurons reaches a set of neurons that is three orders of magnitude larger. Such a high, simultaneous connectivity is not possible in nowadays computer systems, and has to be overcome by temporally successive operations, at the expense of speed. 7.1 Oriented motion blur filtering In this section a filtering based on motion blur is proposed, which logically links hundreds of pixels, and calculates the data array V of the VFA model, including contour integration. A hardware implementation of the motion blur based filter is also proposed, allowing a real-time, scalable computation. The proposed hardware implements the logical link between pixels in a simultaneous and analog way Filter model The goal of the proposed filter is to use oriented motion blur to filter oriented edges from an image. The main idea is that an image blurred along an orientation will conserve the edges parallel to the blur orientation. Other edges are not present on the image, because they are attenuated by the blur. 59

71 A simple edge detection algorithm applied on a motion blurred image will thus contain only edges whose orientation is close to the orientation of the blur. This phenomenon is used for orientation selective edge detection. Let the projected image be represented by the two dimensional light intensity function f i (x,y) such that x,y R, f i (x,y) R +. (7.1) The motion blur can be defined by the path along which the projected image is translated relative to the projection surface. Since the translation can be done in two independent dimensions, such a motion is described by two time-dependent functions s x (t) and s y (t): } s x (t) s(t), (7.2) s y (t) where t is the time. The motion blurred image function m(x,y) can in turn be defined for each point (x,y) = p as m(p) = tm 0 f i (p + s(t))dt, (7.3) where t m denotes the time limit of the integral. Let s suppose from here that s(t) = 0, i.e. s(t) is a steady straight motion. The motion with an orientation θ is denoted as s θ (t). Given this assumption, equation 7.3 can be written as: m θ (p) = tm 0 f i (p + s θ (t))dt, (7.4) Consider an image with a light and a dark side divided by a straight horizontal contour. A pair of points, p 1 and p 2 are chosen from the image such that their distance d = p 1 p 2 is infinitesimally small. The motion s θ (t) is performed starting from the two points, such that the points p 1 +s θ (t m /2) and p 2 +s θ (t m /2) are found on the dark and light sides of the contour line respectively (Figure 7.1). The two points m θ (p 1 ) and m θ (p 2 ) represent the corresponding intensity values on the motion blurred image, and are obtained using equation 7.4. In the function of θ m θ (p 1 ) is strictly monotonously increasing on the θ = [0,180 ] interval, while m θ (p 2 ) is strictly monotonously decreasing on the same interval. When the motion is parallel with the contour, i.e. θ = 0 or θ = 180, the two points have different values: m θ (p 1 ) > m θ (p 2 ) = 0 or 0 = m θ (p 1 ) < m θ (p 2 ). When the motion is vertical to the contour, i.e. θ = 90, the two points have the same values: m θ (p 1 ) = m θ (p 2 ). The absolute difference in illumination between m θ (p 1 ) and m θ (p 2 ) can be written as m θ (p 1 ) m θ (p 2 ), and it yields the magnitude of the gradient on the motion blurred image. This function has a maximum when θ 0 and θ 180, and has a minimum at 0 when θ 90. A CCD or CMOS image sensor can be used to perform the integral in equation 7.4. The output of the image sensor is a matrix M (θ) that contains 60

72 Figure 7.1: The motion blur of two different angles will cause the ratios between the integrals of p 1 and p 2 be different. the blurred image: M (θ) x,y = m θ (p)dp, p A x,y (7.5) where x,y denote the pixel coordinates, and A x,y denote the set of points covered by the area of the same pixel. Let the absolute gradient of the captured image M (θ) be denoted as E (θ), which is obtained from M (θ) using the quadratic mean of the two directional differentials of M (θ) : ( M ) E (θ) (θ) 2 ( ) M (θ) 2 = +. (7.6) x y It is to note that equation 7.6 can be replaced by any classical linear filtering based edge detection operator, such as the hewitt or the Sobel operator. The function E (θ) x,y has high values in x,y positions where the original image I that could be captured without a motion blur would contain high contrasts oriented parallel with θ. The length of the blur d = s(0) s(t m ) allows to adjust a tradeoff between the orientation accuracy and the contour length sensitivity of the filter. Choosing d to be short yields ambiguous orientation detection but allows the detection of short contours, while a large d gives accurate orientation detection, but cuts of short contours, even parallel with the blur orientation. The accuracy should be adjusted empirically so that neighboring orientations slightly overlap Contour integration The use of motion blur filter naturally implements the cognitive functionality of contour integration, as well as facilitates crossing detection. Consider the case where two collinear contours are separated by a gap with a span smaller than the span d of the motion blur. In such a case the blur parallel with the two contours will create an edge between them, which will emerge as a contour in the edge detection process. 61

73 Contours at endpoints will be prolonged by the motion blur, which converts corners into crossings. Also, at endpoints, a gradual degradation of the contour intensity can be experienced. This makes endpoint detection possible using the endpoint detecting projective operation presented in section 6.5.1, by adding a bias to the endpoint detection filter in equation Examples obtained using motion blur for oriented edge detection and contour integration are shown in Figure A.8 in appendix A Computation in the VFA model using the oriented motion blur filter In section the orientation selective filtering operation of the VFA model was proposed. The filtering uses Gabor functions according to equation The Gabor function was chosen because of its strong biological plausibility. The computational complexity of applying Gabor filters of all the orientations on the input image to obtain the data array V of the VFA model has led the author to search for faster solutions even at the expense of loss in quality and biological plausibility. The application of oriented motion blur replacing Gabor filtering is possible if the data array V is made up from the data arrays E (θ) obtained in equation 7.6: V x,y,k = E (θ k) x,y,k [1,h(θ)] N +. (7.7) This solution not only substitutes the Gabor filtering in equation 6.11, but also makes the iterative contour integrating lateral operation L c in equations 6.15 and 6.16 unnecessary, all of them being computationally complex, as discussed in section Opto-mechatronical device for motion blur filtering In this section a hardware implementation is proposed, which reduces the computational load by several magnitudes. The oriented motion blur can be calculated by a computer, which requires x max y max p additions, where m, n are the size of the image in pixels, while p is the length of the motion blur in pixels. In case of large images of several megapixels and motions through several tens or hundreds of pixels the computation of the motion blur requires a large amount of computational capacity. The idea of the hardware proposed in this section is that the light integration ability of the CCD pixels is used instead of the arithmetic unit of the CPU. The projected image is translated relative to the image sensor according to some motion s(t). The afferent light on each CCD pixel is simultaneously integrated according to equations 7.4 and 7.5, with t m being equal to the exposure time. The captured image of the image sensor (the output image) will 62

74 Figure 7.2: The mirror can be rotated around axis a 1 causing translation of the projected image on the image sensor. The mirror and axis a 1 can be rotated around axis a 2, which modifies the orientation of the translation caused by the rotation around a 1. thus be M (θ k). A Sobel filter based edge magnitude detection is performed in the output image M (θ k), the result of which contains the contours parallel with the motion s(t) Motion blur hardware principle A hardware tool has been conceived to perform the above described blurring operation, inspired by the small vibrations, tremors of the eyes. A mirror is attached to a platform using high precision ball bearings, which allow a rotational motion around the axis a 1, perpendicular to the normal vector of the mirror plane. The platform is in turn attached to a mount using the same high precision bearings as with the mirror. The platform along with axis a 1 can be rotated around axis a 2, which is parallel with the normal vector of the mirror plane (Figure 7.2). The rotation of the mirror around a 1 causes the image projected on the image sensor to be translated in a certain direction. The direction of the image translation can be adjusted by rotating the orientation of a 1 together with the mirror platform around a Implementation of the hardware A mechanical frame has been constructed to implement the previously presented principle (Figure 7.3). Both axes are rotated using stepping motors, controlled from a PC. This allows a direct operation and integration of the frame with further image processing done on the PC. The motion s(t) causing the motion blur practically starts and ends at t values that coincide with beginning and end of the image sensor s exposure. A synchronization of the motion and the exposure is thus necessary. This is rather difficult and requires sophisticated solutions. A much better solution is to make the mirror perform a periodic motion s p (t), with a time period several times shorter than the exposure time of the camera. The integration of s p (t) is carried out through a time period that is an exact multiple of its own period, 63

75 Figure 7.3: The mechanical hardware to move the mirror. thus, the result will be equal with the motion s(t) integrated between the two endpoints of s p (t). This is true only if the motion s p (t) has a constant speed between its endpoints, which requires infinite acceleration at turnarounds. In practice this can be approximated, but can not be achieved. In the implementation the stepping motor connected to axis a 1 is controlled to step back and forth, which makes the mirror tilt back and forth in a fixed orientation. The exact motion characteristics of the mirror depend on the motors and its control circuits. The other motor is used to turn the whole platform around axis a 2, which also turns the orientation in which the mirror is vibrating. 7.3 Experimental results Hardware evaluation The characteristics of the periodic motion performed by the mirror is very important. A sinusoidal motion of the mirror yields undesirable results. This is caused by the slow motion around the extremities of the span of the vibration, causing high values in the integral in equation 7.4 at the endpoints of the vibration. The following edge detection in equation 7.6 will detect false contours, which is undesirable. Based on the above described problem, the amount of time spent at the extremities of the periodic motion should be rather short. This requires high acceleration at the endpoints, which is a challenge in the construction of the mechanical hardware. The motion of the mirror depends largely on the mechanical characteristics of the stepping motor, and its control, which is rather difficult to implement properly. It is however possible to measure the motion, and decide if it is satisfactory to replace the original s(t) motion function, necessary for an acceptable motion blur quality. To measure the motion characteristics of the mirror, a leaser beam was used to reflect from the mirror to a white canvas. A camera was used to cap- 64

76 Figure 7.4: The time function of the mirror vibration with a time period of 62.5ms. ture the periodically moving red point on the canvas. The optical center of the camera was shifted perpendicular to the vibration direction during exposure. The result is a time-function of the mirror motion, which allows the analysis of the system. There are two main requirements for the vibration: Period should be an entire fraction of the image sensor s exposure time Time spent at the extremities should be low compared to the center In Figure 7.4 the image of the vibrating laser beam is shown. The exposure time was 500ms, during which 8 periods were captured. The frequency of the vibration thus becomes 16Hz. This requires the exposure time to be calculated as t m = n 62.5ms, n N. The second requirement is also satisfied by the vibration on Figure 7.4. On figure 7.5 the percentage of the period spent at each position is shown for the measured vibration, along with the percentage of total weight at any position of the Gabor function based filter kernel used in the contour integration earlier in the dissertation. It is visible that besides a noise on the mirror vibration, the two curves are similar. This can be considered a convincing evidence that the vibration of the proposed hardware is appropriate and it fulfills the second requirement. A better vibration would however yield even better results in the oriented contour detection Oriented contour detection Images were taken while the mirror was vibrated with the characteristics discussed above. Images were reflected and captured by a camera as shown in Figure 7.6. The mirror platform was rotated around, stopping for an exposure at every 10 degrees. The captured images were transferred to the PC, where a gradient detection was performed on them, using a Sobel filter. The edge map was also converted to a binary image using a threshold. Some test images obtained from the experiments are shown in appendix A.3 in Figures A.9 and A

77 Time function of vibration Filter used for contour integration Percentage of period spent in position Vibration amplitude (pixels) Vertical integral of filter weights Filter width (pixels) Figure 7.5: The percentage of the period spent in any position is shown on the graph below the time-function of the vibration. The same is shown for the contour integration filter. There is a similarity between them, however there is an undesirable noise on the time-function of the vibration. Figure 7.6: The test setup to capture motion blurred images. The image is reflected into the camera by the vibrating mirror. 66

78 As for the quality of the orientation selective contour filter, it can be compared to other methods by a subjective evaluation. Based on images shown in Figures A.9 and A.10 it can be posted that they contain most of the target contours, they contain a small amount of non-target contours. The computational requirement for the processing is shorter than convolution based methods. To arrive from a scene to an orientation selective contour map, three phases have to be done: image acquisition, which takes t m time, motion blur, which takes t blur time, edge detection, which takes t edge time. The total time required can be written as t all = t m +t blur +t edge. In the experiments a 6 megapixel camera was used, and the motion blur spanned through 80 pixels. The exposure time was set to the smallest possible value, t m = 62.5ms. The edge detection took approximately t edge = 100ms. The mirror based motion blur is done simultaneously with the exposure, which means that in such a case t blur = 0ms, and the total time required is t all = t m +t blur +t edge 170ms. The calculation of the motion blur on the 6 megapixel image through 80 pixels took t blur = 5000ms on a PC. This yields a total time of t all = t m +t blur +t edge 5170ms, which is 30 times longer than obtained using the vibrating mirror. These results are also summarized in Table 7.1. t m t blur t edge t all Mirror 62.5ms 0ms 100ms 162.5ms PC 62.5ms 5000ms 100ms ms Table 7.1: The required time for orientation selective edge detection. It is to note that a simple, fast motion blur calculated by the PC will contain numerical errors. This is because distinct pixel values are added. The motion however does not move through pixel centers, but it is continuous. The mirror based motion blur on the other hand is independent from the pixels of the image sensor, and it is digitalized only after the blur is available. 7.4 Discussion Computational complexity and quality Computational complexity The proposed opto-mechatronical filtering method implemented using the developed hardware device can perform the input filtering operation F of the 67

79 VFA model at O(x max y max h(θ)) complexity. This is obtained from the two components of the oriented motion blur based method: motion blur and Sobel filtering. The first component is performed in constant time, the second component requires O(x max y max h(θ)), which is the overall complexity of the opto-mechatronical filtering method. Comparing this with the complexities of the Gabor function based VFA model, a considerable complexity reduction has been achieved. The two computationally most expensive operations in the VFA model, the input filtering and the contour integration, are replaced by the oriented motion blur filtering implemented in the vibrating mirror based hardware. This reduction can be expressed as O(x max y max p q h(θ) i) O(x max y max h(θ)). Image quality In order to assess the quality of the proposed opto-mechatronical method, its results are plotted next to those obtained by the VFA model using Gabor functions and lateral operations. The comparison of the latter with widely accepted methods is discussed in section 6.6. The results are obtained from three gray scale input images using both methods. The comparison is done by a widely used subjective method [41]: it is up to the reader to judge the quality of the two approaches. The author would however add some notes. The images for the comparison are shown in appendix B.3 on Figure B.3, where it is visible that the major edges are found by both approaches, however the opto-mechatronical approach gives a less smooth result. Corners are represented as line segment crossings in both cases. Some smaller contour segments are missing in the opto-mechatronical approach. The contour lines are thicker in the VFA based results. Robustness against noise The vibrating mirror has the effect of a directional low-pass filter, which is very advantageous in noise reduction. Test simulations of the opto-mechatronical filtering were performed with noise-added input images. The noise models used were Gaussian, Poisson, Salt & Pepper, and Peckle. The contour maps obtained from the experiments did not show any significant differences from the noiseless case. The contour maps were subjected to the corner detection operations of the VFA model in order to evaluate the effect of noise on corner detection using the opto-mechatronical filtering. The results were again quite promising: the same corners were recognized as in the noiseless case. In some cases even more correct corners were recognized, than in the noiseless case, as shown in Figure B.4. A conclusion is that the opto-mechatronical filtering also eliminates noise, and performs a robust contour and corner detection. However it is to note that in a real application the input is not an image, but a real scene that is projected on the image sensor of a camera. As such, the optp-mechatronical filtering can eliminate noise caused by the image sensor and the optics of the image acquisition system. 68

80 7.4.2 Comparison with other image processing hardware solutions In this section two solutions will be considered that aim a very fast edge detection. The proposed opto-mechatronical solution will be compared with the proposed solutions. Correlation Image Sensor The correlation image sensor (CIS) has been developed at the University of Tokyo by Shigeru Ando [2]. It performs analog operations simultaneously on each pixel, allowing certain image processing tasks to be performed at a very high speed. The CIS is based on the hardware realization of the correlation integral of two functions f (t) and g(t) within the time interval t m. R f g (t) = t t t m f (t 1 )g(t 1 )dt 1 < f (t)g(t) >. (7.8) The integral is calculated in every pixel using an electronic circuitry. The circuit is of one pixel and calculates the correlation integral between a voltage signal f (t) received from a CCD cell and a reference signal g(t). The correlation image sensor is a device that performs correlation integrals in a general sense without sampling the incident light intensity and reference signals. All inputs except clocks are analog and continuous. The correlation image sensor can be applied in feature extraction from images. Rotating the viewing direction around a small circle, and introducing sinusoidal functions as reference signals makes the CIS to detect differently oriented contours on the image [3]. Let ω and ε be the angular velocity and radius of the sensor motion respectively. Using cosωt and sinωt as reference signals, 0 and 90 edge operators are obtained: < f x,y (t)cosωt >= ε 2 o x,< f x,y (t)sinωt >= ε 2 o y, (7.9) where o x and o y are the directional derivatives of input image I. Shifting the phase of the reference signal causes the CIS to detect edges of other orientations. Comparison with the opto-mechatronical filtering device A vibrating mirror has been applied by S. Ando with his correlation image sensor. In his work the vibration is used to induce a periodic motion of image regions over CMOS pixels, in turn producing electronic signals with which a reference signal is correlated. The main difference between the proposed opto-mechatronical filtering device and the CIS system of Ando is the purpose of the vibration. In case of the CIS the vibration is necessary to induce an altering signal in each pixel, while in the proposed hardware of this dissertation the vibration causes an oriented blur. 69

81 The other difference is the underlying image sensor: in case of the CIS, a special analog image sensor is required, while in the proposed opto-mechatronical filtering hardware any commercial camera works well. In this respect the proposed system is simpler and more cost effective. In terms of logical link between pixels achieved by the mirror, the CIS links a few pixels in a continuous way, which is required for the correlation in the sensor. It is an important difference that in the CIS the vibration has to be perpendicular to the edge to be detected, while in the opto-mechatronical hardware the vibration is parallel with the edges that are detected. Also, the CIS can not perform contour integration, which is a major drawback compared to the solution of this dissertation. In terms of speed it is hard to compare the two systems, since the CIS performs analog edge detection, while in the proposed system this is done by the computer. One main factor of the speed at which they operate is the period of the mirror vibration leading to a mechanical discussion, which is out of the scope of this dissertation. The CNN Universal Chip Cellular Nonlinear/neural Network (CNN) was invented by Leon O. Chua and Lin Yang in Berkeley University in The CNN Universal Machine is the computer architecture of the CNN, which is physically implemented in the CNN Universal Chip. The CNN Universal Chip is the first operational, fully programmable dynamic array computer. This computer consists of an array of micron CMOS cell processors, each of them having direct optical input, communication and control circuits, analog and logic memories. Each CNN cell is connected with its nearest 8 neighbors and the output interface. This parallel focal-plane array computer is able to process 3-trillion equivalent logical operations in one second [21]. Discussion respecting the CNN and the opto-mechatronical filtering device The goal of the CNN and the proposed opto-mechatronical device is very different. The CNN can perform a wide range of operations, while the proposed hardware only performs oriented edge detection. It is however interesting to make a comparison in respect of the pixels logically linked in one step. The CNN hardware logically links the 9 neighboring pixels to the output of one pixel. This number in the opto-mechatronical filtering device is much more, determined by the amplitude of the vibration. The large number of collinearly linked pixels explain the contour integration ability of the proposed device. 70

82 7.4.3 Cognitive plausibility of the opto-mechatronical filtering The proposed opto-mechatronical filtering method at first sight is far from the cognitive science. However, functionally it performs the operations done by the simple functional units of the primary visual cortex. A more important thing to note is that the outstanding computational performance of the brain is the result of the high number of connections between the basic processing elements. Artificial parallel computational tools with a low number of information link between their processing elements are unlikely to show brain-like performance. The proposed opto-mechatronical filtering implements information link between hundreds of pixels, and from this respect it shows a high cognitive plausibility. 71

83 Chapter 8 Cognitive informatics model of the Eye-retina system In this chapter a novel edge-filtering method is proposed, in accordance with the latest findings in cognitive research. One thing in common between the majority of classical (Gaussian, Sobel, Prewitt, Laplace, etc.), and biology inspired [38] [88] image filtering methods used in image processing is that the output image is computed by convolution based linear filtering of the input image by a filtering mask (or kernel). This has two important implications as regards the output image: Input and output images have the same spatial resolution A given pixel value on the input image influences several pixel values on the output image According to findings in retinal research, the convolution-based computational philosophy is biologically inadequate. This is well demonstrated by the number of retinal cones (photoreceptors sensitive to color) is about 6 million and the number of rods (photoreceptors sensitive to light regardless of its color) is about 125 million, while at the same time the number of ganglion cells sending the visual information to the brain is only about 1.2 million, as discussed in section This fact is in contradiction with the first implication stated above, and suggests that image processing as well as massive image compression (also referred to as convergence) already takes place in the retina [6] [51]. As discussed in section 3.1.1, Packer and Dacey [66] have shown that receptive fields of ganglion cells of the same type do not overlap in the central fovea. The center of these receptive fields are located at a distance of one diameter, which suggests a mosaic-like arrangement. This finding is in accordance with the convergence described earlier in [6] [51]. The results of Packer and Dacey seem to be in a conceptual contradiction with the classical convolution based linear filtering methods, also indicating that they are biologically inadequate. The goal of this chapter is to provide a model of cognitive informatics that fills the conceptual gape described above. 72

84 Figure 8.1: Overlapping and non-overlapping filtering architecture 8.1 Non-overlapped receptive field model Based on the findings concerning the mosaic arrangement of foveal ganglion cell receptive fields, a cognitive informatics model is proposed, which uses no overlaps between filter matrices (Figure 8.1). A filter kernel F R 3 3 is defined which is responsible for the actual operation performed by the modeled ganglion cell. The model poses no constraint on the exact content of the kernel, but if a retina-like functionality is required, an edge detection kernel (Sobel, prewitt, etc.) should be applied. The input image I N x max y max is tiled in a mosaic arrangement by the filter kernel F in a non-overlapped manner. The center of each instance of F is at a distance of three pixels from each of the four neighboring filter kernel centers. The image pixel values are multiplied by the corresponding F values and a weighted sum is calculated for each filter kernel center. The matrix E N x max/3 y max /3 composed of the output of the weighed sum of each filter kernel centers provides the output of the non-overlapped filtering operation. The tiled arrangement of the filters and the processing structure is shown in Figure 8.2. Since the size of the filter F is 3 3 pixels, the output image E obtained by non-overlapped filtering will have a size 3 times smaller in both dimensions, compared to the original image I. Such a shrink in size means that the number of pixels in E is 9 times smaller than in the original image I. Such a reduction in the output image size corresponds to the law of convergence, and supports the biological plausibility of the proposed filtering model. In the non-overlapped filtering model it can easily occur that a change in contrast falls precisely in between two neighboring filter matrices, thus remaining undetected, as shown in Figure Optical overlap of receptive fields The non-overlapping model used to generate the images on Figure 8.3 performs too poorly to be considered biologically valid: our visual perception gives us a much better picture about our environment. The literature does not explain how the quality of images can be guaranteed by the retina, even with the difficulties caused by this non-overlapping architecture. Research results of Hammett [34] report that moving or flickering percepts of blurred edges 73

85 Figure 8.2: Non-overlapping processing structure. The input image is tiled using a mosaic arrangement of the filter matrix F in a non-overlapped manner. Figure 8.3: The convolution based edge detection (left) and the nonoverlapped filtering (right). 74

86 seem less blurred for human observers, but it is caused by the temporal differences between the magnocellular and parvocellular pathways, and not the spatial arrangement of ganglion receptive fields. The starting point of the model is that high-quality edge detection can be achieved only when assuming that receptive fields overlap. Taking into consideration the fact that ganglion cells of the same type do not physically overlap on the fovea, but also considering that such overlaps would otherwise be required, it is plausible to assume that the necessary overlaps are "computed" by the means of optics, with the light traveling through the eye lens. The visual information projected in between two non-overlapping receptive fields remains undetected. This information has to be directed towards the centers of the surrounding receptive fields. The lens of the eye is far not perfect, it has optical aberrations and performs a fluctuation around the mean accommodation during steady viewing [84], as discussed in section These two effects causes the image of a spatial point P to be projected on a small spot. The spot can provide the necessary information link between the center and periphery of the non-overlapped filter kernels. In order to prove this solution, the RMS spot size is calculated using a schematic eye model, described below Schematic eye model The eyeball is a very complex, non-rigid optical system. Several different schematic optical eye models have been proposed in the scientific literature [63] [31] [27] [32]. All of these models have their own merit on describing the optical and physiological properties of the human eye. Fig.8.4 shows the layout of the analyzed accommodation-dependent schematic human eye model, which is a simplified up-to-date and comprehensive Liou-Brennan [52] model. In the model that we use, there are a number of limitations that simplify the practical applicability: The image quality is limited by geometrical aberrations. The field of view is small and off-axis aberrations [31] are out of our investigation. The refraction index of the lens is function only of the accommodation and the gradient index (GRIN) crystalline lens model [72] [5] has not been used Effects of eye-lens dynamics and aberrations on retinal blur In order to calculate the effect of the spot on the performance of non-overlapped filtering, the spot size of an image point projected on the retina has to be calculated, based on the accommodation dynamics and the optical model described above. 75

87 Cornea Ir is Len s R etina Optical axis Figure 8.4: Accommodation-dependent schematic human eye model. The spot size on the retina was calculated using optical engineering functions, such as Airy disc size, geometric spot size and root mean square (RMS) spot size [83]. The theoretical (diffraction) limit is the diameter of the equivalent Airy disk for the system and this value can be compared to the RMS spot size as an indicator of the image quality. If the diffraction limit is much smaller than the RMS spot size than the performance of the system will be limited by geometrical aberrations. The spot diagram is a two dimensional distribution of ray intersections at the image plane. When the optical system aperture is small, diffraction effect is more significant than geometrical aberrations in limiting the image quality. The distribution of ray intersection calculated by geometrical optical ray-tracing [20] methods depends on the above mentioned geometrical aberrations of the system. The RMS spot diameter is computed as the root mean square of all distances between each marginal ray intersections with the image plane and a reference point, generated by the chief ray intersection. The geometric spot is a region that contains the outermost ray intersection in the pattern relative to the reference point. As mentioned above, the goal of this section is to present the optical image forming properties of the human eye during the accommodation process. Calculated RMS spot size (4mm entrance pupil diameter) obtained using optical parameters [64] [36] [90] [63] for different accommodation [42] fluctuations stimulus level [84] by constant image distance are presented in Fig.8.5. The change of the RMS spot size of the eye model presented here was simulated using calculation based on geometrical optics. The results are in good accordance with data presented in literature [27]. The diameter of ganglion cell receptive fields in the fovea is about 20µm, the RMS spot size is slightly larger with a fluctuation of about 5µm. This result means that the visual information is spread over the retina to an extent that is necessary to create an information link between the centers and peripheries of ganglion cell receptive fields, and as such, compensate the information loss due to non-overlapped filtering. 8.3 Lens blur based retina model This section a cognitive informatics model is proposed with the purpose of proving that the image processing quality of a non-overlapped filtering method can be enhanced by the lens blur. 76

88 Figure 8.5: Calculated RMS spot size obtained using optical parameters for different accommodation fluctuations stimulus level by constant image distance. The proposed model incorporates the non-overlapped filtering presented in section 8.1, and the effect caused by the optical and dynamical characteristics of the eye-lens, discussed above in section 8.2. The model is implemented in the form of a software simulation, and later in a software-hardware system Software simulation In the software based simulation of the proposed model the lens blur achieved by linear filtering of the input image with a Gaussian blur filter kernel G. A single point on the image sensor will represent the sum of all contributing disks of confusion. This effect can be computed using a standard convolution filter which is representative of the aperture shape and visibility. Shape is not the only defining factor for blur, light distribution within the aperture is important and a key performance parameter. Light falls off across the lens and is usually darker at the edges. Gaussian convolution filters provide the smoothest blur, so light falloff across a lens should approximate a Gaussian. The filter kernel G of the lens blur can thus be a disk, or simply a box with intensity degradation toward its peripheral regions. The convolution of the input image with the lens blur filter kernel imitates the effect of the optical aberrations and the effect of fluctuations in accommodation on the retina, and allows the non-overlapped filtering method to detect previously undetected contour features. The size of the filter kernel G has to be determined. In order to keep the biological plausibility, according to results in section the size of the kernel should be in the range of the receptive field size. In case of the model this equals the size of the edge detection kernel, 3 3 pixels Laboratory experiment In order to show the effect of optical aberrations on the performance of ganglion receptive fields, a software simulation is one choice, however the hardware implementation and laboratory experiments are more convincing. The lens blur filtering can also be achieved by the adjustment of the lens mounted 77

89 in front of the image sensor. The focus should be adjusted slightly away from the setting where the sharp image is obtained. Such a setup will naturally compute the lens blur without any computational effort, as it is done in the eye-retina system. A laboratory experiment was elaborated to achieve a lens blur by adjusting the focus of the lens of a camera away from the sharp setting. The so obtained image was in turn subjected to the non-overlapped filtering method. In the experiment a camera with a ,8mm image sensor was used in a pixel resolution. The size of a pixel is thus 12µm 12µm. The focal length of the applied lens was set to f = 28mm. The aperture of the lens was set to f/3.6, which means that the aperture radius was r a = 6mm. The object was at a distance of about s = 3m from the camera. This yields s = mm, which is the distance between the lens and the CCD sensor where a sharp image is obtained. The radius r s of the required spot size is about 2.5 pixels, which yields r s = 30µm. Such a blur can be achieved by elongating the lens from the image sensor by s. This value can be obtained from equation 8.1: which can be transformed to s = s r s r a = r a s = r s s, (8.1) mm 0.03mm 6mm = µm. (8.2) The new lens-ccd distance obtained from equation 8.2 is s 2 = s + s = mm. Such camera parameters allow sharp focusing at a distance of about s = 2m from the camera. Using the above obtained settings, four images were taken: one with a sharp focus (s = 3m), a less sharp focus (s = 2.5m), an optimal focus causing a blur of 5 pixels (s = 2m), and a larger focus deviation (s = 1.5m). The captured images are shown in Figure 8.6. The images taken by the camera are subjected to the non-overlapped filtering as discussed in section 8.1. The result obtained are shown in Figure 8.7. It is up to the reader to subjectively judge the quality of the four images, since there is no exact method to decide the answer to this question [41]. By looking at the images it is visible that the non-overlapped filtering of a sharp image yields poor results. Some edge elements are missing from the image. The seemingly best output is obtained with the input image which was optically blurred to the optimal extent. 8.4 Discussion Based on recent findings in biology and the experimental results presented in the previous section, the following hypotheses are postulated. 78

90 (a) (b) (c) (d) Figure 8.6: The input images taken using a sharp focus s = 3m (a), a less sharp focus s = 2.5m (b), an optimal focus causing a blur diameter of 5 pixels s = 2m (c), and a larger focus deviation s = 1.5m (d). (a) (b) (c) (d) Figure 8.7: The results obtained using a sharp focus s = 3m (a), a less sharp focus s = 2.5m (b), an optimal focus causing a blur diameter of 5 pixels s = 2m (c), and a larger focus deviation s = 1.5m (d) Hypotheses Hypothesis 1: The role of eye accommodation during fixation has information theoretical implications. This hypothesis is supported by recent findings in biology [66], which claim that receptive fields cover the surface of the fovea in a mosaic arrangement with minimal overlaps, and also by the fact that the output of retinal edge filtering on the axons of the ganglion cell layer has a 79

91 much lower resolution than the input resolution on the photoreceptors. Hypothesis 2: From an information processing point of view, the functional role of eye accommodation during fixation extends to more than just the maintenance of a sharp image on the retina. Eye accommodation may also be responsible for the compensation of information losses caused by the non-overlapped receptive field architecture. This hypothesis is supported by the experimental results presented in section 8.3, which show the difference between the edge detected images obtained by using non-overlapped filtering with and without the lens blur. The experiments show that the application of the lens blur considerably enhances contrast perception. This can be explained by the integrative and compressive effects of lens blur within a certain locality, in contrast with the compression principle of the sharply focused, non-overlapped filtering method, which completely disregards certain aspects of locality. Hypotheses 3: The inverse proportionality of the pupil size and the amplitude of fluctuations in accommodation yields a quasi constant lens blur on the retina, which makes the compensation of information losses caused by the non-overlapped filtering quasi constant, the size of which is tuned to retinal receptive field sizes. This hypothesis is supported by calculations in section 8.2, which indicate that the size of the blur is in the range of 3-5 receptive fields, depending on the pupil size. 80

92 Part III Applications 81

93 Chapter 9 Object recognition system based on the VFA model The VFA model proposed in Chapter 6 was intended to provide a cognitive informatical model considering the functions of the primary visual cortex, with the purpose of being utilizable by other, higher level cognition and neurobiology inspired models. The present chapter discusses how the VFA model is attached to amend an existing object recognition engine [9], aiming to form a fully cognitive informatical object recognition system. The goal of the system is to provide an efficient tool of technical informatics for the solution of practical problems in the field of object recognition and image classification. 9.1 Specification of the object recognition system The object recognition system discussed in this chapter aims to classify the input image according to the object the image was taken of. The system includes an object epitome library (OEL), in which every object epitome is compared to the input image during the recognition process. The object with the highest similarity value is chosen by the recognition system as the one suspected to be on the input image. The recognition system implements an automatic object epitome acquisition function. When an image is presented with the purpose of teaching, the system creates the object s epitome and stores it in its OEL. 9.2 Design of the object recognition system The proposed object recognition system is composed of two basic components: A VFA model based visual information processing component, The object recognition engine published by Alex Berg [9]. The data array of complex cognitive units in the VFA model are used in the object epitome construction, while that of simple cognitive units is used 82

94 VFA model Lateral operations Input image Input filtering Simple cognitive units Projective operations Complex cognitive units Output Output Recognition engine Epitome acquisition Geometric blur based matching algorithm Object epitome library Output: Name of recognized object Figure 9.1: The VFA model with the attached object recognition engine in both object epitome construction and object matching. The values stored in the VFA data arrays are directly fed into the object recognition engine of the system. The structure of the object recognition system is shown in Figure VFA component The VFA model is used in the object recognition system as a primary visual information processing component. The input to the VFA is an image representing an object (hand drawn or photograph). The output of the component are the two data arrays, representing the simple and complex cognitive units of the primary visual cortex. These data arrays directly compose the input of the object recognition engine. The VFA component has an extra input variable h(θ) that defines the number of orientations extracted by the input filtering operation. As it was measured and published by Hubel, in the real cognitive system h(θ) = 18, but other values might also be used in the system: the lower is h(θ), the faster is the system. The VFA component can be software based, using Gabor function based input filtering, or hardware based, using the opto-mechatronical input filtering. The semantics of input and output data of the model are however the same in both cases, the differences are in quality and computational complexity. 83

95 Input image Sobel operator based edge detection VFA model Geometric blur based object recognition algorithm Geometric blur based object recognition algorithm Original system VFA based system Figure 9.2: The system proposed by Berg in [9], and its amended version using the VFA model Object recognition engine The object recognition engine of the proposed system is implemented according to the algorithm of Alex Berg [9], discussed in section 5.3. The algorithm is designed to work with sparse signals as primary input. In the implementation of Berg a horizontal and vertical edge detected image is used as the sparse input signal. In the proposed system the geometric blur based algorithm is applied on the data arrays of the VFA model, being a sparse input signal. No modification to the original recognition engine has been made. On Figure 9.2 the original and the VFA-amended implementations are shown. Automatic model acquisition The proposed system features an automatic object epitome acquisition functionality. An object epitome in Berg s recognition engine is stored as the neighborhood description of nodes, and their relative position. When acquiring a new epitome to be placed in the OEL, highly informative points on the object have to be chosen as nodes. The model is in turn acquired based on the chosen nodes. Since corners and crossings play an important role in cognitive object recognition as discussed in section 4.2, the corners extracted and stored in the data array of complex cognitive units of the VFA model are used as nodes in the automatic epitome acquisition. In order to decrease the computational complexity of the recognition engine, a limited number of corners and crossings are randomly chosen from those found by the projective operations of the VFA model. When the list of nodes is available, the engine generates their neighbor- 84

96 hood descriptions and their relative distances based on the content of the data array of simple cognitive units of the VFA model. This data is placed in a vector, and stored in the OEL. Object model matching When the recognition system is presented an image for recognizing an object, it performs the recognition process to find out which, if any, object stored in the OEL is represented on the input image. The recognition process goes through the following steps: 1. Feed the input image into the VFA model to obtain the data array of simple cognitive units 2. For each object model stored in the OEL, perform the matching algorithm, and obtain a matching cost 3. Make a decision according to the matching costs obtained in point 2. The computation of the data array of simple cognitive units of the VFA model in the first step produces the sparse input signal necessary for the recognition engine. The node neighborhoods stored in the object epitome are compared with the actual input and a minimal matching cost is sought for each node. The rest is done by the recognition engine, which outputs the supposed object name, matching costs for all objects in the OEL. This helps to assess the quality of the matching result. 85

97 Chapter 10 Abstract image categorization 10.1 Formulation of the image categorization problem Content based image classification is one of the key technical problems in the era of Internet based image databases. This section proposes a solution to classify abstract images drawn by hand in a wire-line manner. The object recognition system proposed in this chapter is applied to solve this problem. The OEL of the object recognition system is filled with examples of hand drawn images, which are used to classify images that contain similar objects Test images The test images used in the example are photographs of hand drawn objects (Figure 10.1). The object categories are: Chair, Face, House Model acquisition In order to populate the object model library, the images Chair2, Face1 and House2 from Figure 10.1 were chosen to extract the object epitomes from. The complex cognitive units of the VFA model have been used to find the corners on the images, which in turn are used as nodes to generate the OEL. Each layer of the data array of simple cognitive units of the VFA model is blurred according to the geometric blur algorithm. In the current implementation 6 different blurs are used. The neighborhood of the nodes suggested by the VFA model are sampled from the blurred data. The geometric blur pattern used in the present implementation contains 61 sampling points, shown in Figure 10.2, where the actual node is in the center of the figure. The sampling is done in every orientation layer of the VFA model. If the number of layers is h(θ), one node surrounding is described by s = 61 h(θ) values. The samples of object node neighborhoods are stored in a vector along with their relative distances, between all point pairs. In order to reduce the computational complexity of the recognition engine, a limited number n of random nodes are 86

98 Chair1 Chair2 Chair3 Chair4 Face1 Face2 Face3 Face4 House1 House2 House3 House4 Figure 10.1: The images used in the experimental object recognition task 87

99 Figure 10.2: The pixel neighborhood sampling pattern used in the recognition engine in the context of the image categorization problem. The pattern is composed of 61 pixels. h(θ) dim(m) Table 10.1: The lengths of object model vectors in function of the number of orientation layers, supposing 8 nodes. chosen from the suggested ones. This requires n (n 1) 2 distance values to be stored for each object model. The total number of elements of the vector m that stores one object model can be calculated as follows: m R d,d = 61 n h(θ) + n (n 1) 2 = n (61 h(θ) + n 1 ). (10.1) 2 Typical values for n are 8 and 20, while typical values for h(θ) are 6, 10 and 18. The number of elements of object model vectors are shown in Table Recognition experiments for these values will be given in the next section. In case of h(θ) = 6, the number of nodes detected was not enough for object model generation. This is explained by the sparseness of the orientation resolution, which causes many edges to be excluded from the data array of simple cognitive units. The crossing detection in turn does not find the nodes necessary for model acquisition. Similar results were obtained with h(θ) = 10 using the opto-mechatronical filtering. The nodes suggested by both Gabor function and the opto-mechatronical filtering based VFA implementations with h(θ) = 18 are shown in Figure

100 Figure 10.3: The nodes in the data array of complex cognitive units detected by the VFA model, using Gabor function based filtering (top) and optomechatronical filtering (bottom). The number of nodes n = 20 and the number of orientation layers is h(θ) = 18. Lib. number Orientations Nodes Type GF GF GF GF OM Table 10.2: The parameters of model libraries created for the recognition experiments. GF stands for Gabor filtering, OM stands for opto-mechatronical filtering Recognition results The object recognition system has been tested with the images on Figure 10.1, excluding those used to build object models. Five different OELs were created using the VFA model to test the differences they induce in the recognition results. The parameters of the epitome libraries are shown in table The opto-mechatronical filtering based OELs of low orientation numbers are missing from the table, because they could not be generated due to reasons discussed earlier. In the experiment every previously unseen image on Figure 10.1 was presented to the recognition system for every OEL. The recognition engine returned a matching cost value for each object - image pair, among which the minimal is considered as the found object. The summary of the results obtained with each OEL is shown in Table The objects in images Chair4 and Face3 were not recognized in any case. This is not surprising because they are very distorted compared to the model epitome, and may also cause confusion for humans trying to recognize them. If these two images are considered as outliers and removed from the image 89

101 Lib. number Success Robustness VFA time Recog. time 1 66% 1.2s 5.5s 2 44% s 8.7s 3 44% s 10.6s 4 78% s 17.4s 5 57% s 13.5s Table 10.3: Statistics of the recognition results. Robustness is proportional with the variance of costs in cases where recognition was successful. set, the recognition success rates in the five experiments becomes 80%, 57%, 57%, 91% and 66% respectively Discussion The different parameter settings of the VFA model yield different success rates and require different computational times. The quality of automatic object model acquisition also varied with the settings. Setting the number of VFA layers to h(θ) = 6 yields the lowest computational load, but the results are not acceptable. Neither does the object epitome acquisition work properly for all the objects. Setting h(θ) = 10 gives good recognition results while keeping the computational load at an acceptable level. Automatic object model acquisition is possible for all objects, but false recognition appears in some cases. Setting h(θ) = 18 is the most plausible setting respecting the real cognitive process, and yields a very good success rate, but its computational load is rather heavy. Increasing the number of nodes n in the object epitomes can significantly increase the success rates, but computational complexity is increasing at the order of O(n 2 ). Using the opto-mechatronical filtering based VFA implementation reduces the computational time necessary to compute the VFA model, which is 2.2s, supposing that the image sizes are in the range of pixels. Recognition results are however not so outstanding with this version of the model. Finally, the results obtained above show that the VFA-model-improved geometric blur based object recognition engine yields an object recognition system that is able to cluster objects on images into three object categories at a success rate of 80 90%. 90

102 Chapter 11 Robot guiding in industrial environments This chapter presents the application of the VFA-model-amended object recognition system in the assistance of the intelligent navigation system of a mobile robot in a structured environment Formulation of the robot guiding problem In production engineering, the higher the degree of automation of a production system, the more cost-effective it usually is. One problem with fully automated systems, however, is that their reconfiguration is time consuming and therefore costly. Another common concern with such systems is that their full automation imposes constraints on the size and dimensions of the shop floor on which they are installed. Automatically guided vehicles (AGVs) are considered to be a viable building block in solutions to such problems. With the use of AGVs, it is possible - for instance - to transport goods halfway through the production line from one production belt to another. Because of the transportation flexibilities of the AGV, the relative location of these belts is not as constrained as it would otherwise be. Such an AGV is being developed at the Dept. of Production and Quality Engineering at NTNU. An important goal in the project is that the AGV be built using widely available low-cost elements (therefore, expensive sensory equipment are out of the question). To this end, the AGVs are equipped with bright LEDs, and a low-cost camera system installed on the ceiling is used to track their motions. There are many complications that needed to be addressed by the developers of the system (such as what to do when two AGVs are so close that the system cannot determine which of the two LED blobs belong to which, as well as problems with reflections on the shop floor), however, the discussion of such issues lies outside of the scope of this dissertation. A problem addressed in this chapter is the automatic calibration. If an AGV suddenly loses all of its positional history (due to an unexpected restart forced by e.g. a watchdog system which is typical of such mission-critical systems), or if an AGV is newly installed in the shop floor, it would hardly 91

103 be cost-effective to hire an employee to guide the calibration of an otherwise automatized AGV system, which was automatized in the first place with the purpose of saving expenses. To solve this problem, the AGV sends a where am I message to the shop server, accompanied by snapshots of its environment taken in four known directions. The VFA based object recognition system tries to identify objects of known x,y position on the images, which in turn allow the computation of the robot s position. The planar position of the robot can be calculated if the azimuths of at least three external points of known position are known. Since the images of the objects are taken in fixed directions, the robot has to be able to reliably recognize and localize three or more objects. If the robot can recognize four objects, redundant information is available for localization, which also allows error detection, provided one or more objects were misidentified. The geometry of the robot localization problem is shown on Figure Object 1 Object 2 Robot Object 4 Object 3 Figure 11.1: The geometry of the robot localization problem. 3 objects determine the position of the robot, the fourth object adds redundancy and allows error detection. The problem - although formulated somewhat differently - conforms to the basic problem for which a technique called SLAM (Simultaneous Localization and Mapping) was developed, which uses sets of orthogonal features to determine whether or not a robot is on the right track. SLAM aims to construct maps of an agent s environment using sensory equipment such as laser range finders and sonar sensors, without receiving any particular feedback from the environment (other than the sensory data) [82]. In this work, a somewhat different approach is adopted, in that cheap cameras are used and features are generated using a VFA representation Robot localization using the VFA based object recognition system The object recognition system presented in chapter 9 is used to support the AGV localization problem sketched in section The system was intended 92

104 to be used in the AGV s industrial environment, however it can be generalized to any well structured environments. For this reason the tests were elaborated in a laboratory set up in an office-like environment. The localization task was constrained to a small area, where four objects were placed into known positions. The task was to snap four photographs of the surroundings from one position into known directions, and to determine on each image if any of the objects in the OEL are present, and if yes in which direction (horizontal position on the image). The localization task was considered successful if three objects were correctly recognized, or four objects were correctly recognized without any geometrical localization conflicts Object recognition system setup The parameters of the object recognition system presented in chapter 9 are determined such that the recognition results are satisfactory and the computational load is also acceptable. The best results were obtained by using 10 different orientations in the VFA model, and 15 nodes automatically determined by the corner detection functionality of the VFA model. In the object model acquisition process the background of the objects were manually erased Recognition process Four images are taken by the camera on the robot in known directions around the robot. The images are fed into the object recognition system, which returns a 4 4 matrix A R o i, where o = 4 is the number of objects in the OEL, and i = 4 is the number of images. A contains the distortion cost for every object image pair, so for example A i, j contains the distortion cost of object j on image i. The minimal cost is calculated for each column of A. The rows that correspond for the minimal cost in each column are considered the rows of objects found on each image. After the recognition process three cases are distinguished: In the first case each row of A contains only one minimal column value, meaning that the recognition was successful. The localization process is then performed according to the geometry on Figure If a geometrical conflict occurs (the four circles do not intersect in one point), it means that one of the images didn t contain the object, although it was mistakenly recognized. In such a case the object with the highest minimum is discarded, and the localization is performed again, using only the remaining 3 objects. In the second case one row of A contains two minimum column values, meaning that two images have been recognized to contain the same object. This is impossible, because the images are disjunctive, and only one entity of each object is placed in the scene. In this case the object with the higher minimum value is discarded, and the localization is performed using the remaining three objects. In the third case two rows contain two, or one row contains three or four minimal column values. In such situation the localization is not reliable, and the robot tries to move on and take a new set of images, and repeats the recog- 93

105 Figure 11.2: The images used in the three examples. Rows 1 to 3 contain the images taken in the three scenarios. Horizontal image positions correspond to columns of A. nition process. In this case the localization is not successful, but it is detected and no false localization occurs Experimental results The above localization method has been performed in several positions. Three examples are presented here, which demonstrate three successful localization scenarios. The images taken for the three examples are shown in Figure The recognition cost matrices A (r), r {1,2,3} for the three cases are respectively the following, with the minimal column values indicated in bold numbers: , , Based on the images in Figure 11.2 the matrix A (1) contains a successful recognition scenario. Since the objects are also correctly recognized (as it can be seen from Figure 11.2), the localization will not cause any geometrical conflict. The matrix A (2) contains a recognition scenario with two images in conflict. Note that both A (2) and A(2) are minimal column values. The 3,3 conflict can be solved by discarding the image of the higher value, which is the third column. By looking at the image set, it is visible that the image with the door open does not contain any object. Finally the matrix A (3) looks ok for the first sight. The localization will however mistakenly include the coordinates of an object which is actually not seen by the robot, and cause a geometrical conflict in the localization process. This causes the algorithm to discard the image with the chair having the highest columnar minimum, A (3) 3,4. The localization is performed again, which in this case is successful. 3,4. 94

106 11.3 Discussion Based on the above experiments, the localization is successful, if three or four objects are recognized successfully. The system is tolerant up to a 25% object recognition failure rate. If a higher failure rate is obtained, the localization is unsuccessful, which in many cases can be detected by having two pairs, or three or four minimal column values in one row of A. If detected, a new localization process is performed from a slightly different position. In the experiments an 80% object recognition success rate was generally achieved. This was sufficient for a successful localization in most of the cases. Repeated localizations occurred in a few cases, and no mistaken localizations were detected. 95

107 Part IV Conclusion 96

108 Chapter 12 Theses The results achieved in the dissertation are summarized in four thesis groups. The author s publications where the actual theses groups were published are indicated in square brackets. Other publications of the author related to the dissertation are [P22 39]. Thesis 1: The Visual Feature Array concept [P1 2, 6 15, 19] I proposed the Visual Feature Array (VFA) concept which provides a uniform framework for informatical models of cognitive functions and the representation of low level perceptual modalities, with a special respect on vision. Embedded in the VFA concept I conceived the VFA-model, in which I defined and developed simple and complex operations of filtering, lateral and projective cognitive functions of the primary visual cortex, using a finite element orthogonal hyper-grid and SIMD (Single Instruction Multiple Data) operators. as a model of the primary visual cortex. The model builds on modern modeling tools of informatics and transforms cognitive functions into systematical finite element orthogonal grids and Single Instruction Multiple Data operations. The proposed model also uniformly integrates well known methods of the wide range of earlier cognitive solutions. I validated the efficiency and quality of the proposed cognitive informatical models using the Heath method, and compared the results with those of classical operators (Sobel, Laplace, Canny), according to which I proved that the proposed cognitive operations are applicable in a broad field of problems of technical informatics. Thesis 1.1 I defined end stopping, Gabor-function based and foveated filtering operations in the VFA model. I showed that these operations implement the monocular functions of Hubel s Ice Cube model, and they map their results in an n-dimensional data array representing simple functional units of the primary visual cortex. 97

109 Thesis 1.2 I integrated two cognitive processes contour integration and lateral inhibition of the primary visual cortex into the VFA model in the form of iterative lateral operations. I determined the parameters for lateral operations at which both the contour integration and the lateral inhibition are implemented by the VFA model in a comparable way and efficiency to human perception. In case of lateral inhibition I showed that the winnertake-all functionality of cognitive functional units is fulfilled, consequently the overlap between layers along the orientation dimension of the VFA model is eliminated. Thesis 1.3 I defined basic and complex projective operations in the VFA model, which represent the information processing between simple and complex cognitive units. I showed that the defined projective operations are efficiently applicable to detect complex image features (crossings and corners). Thesis 2: Opto-mechatronical implementation of cognitive informatical operations [P3,4,20] I designed and implemented an opto-mechatronical hardware device in the form of a laboratory prototype, which performs the Gabor function based filtering and contour integration operations elaborated in Thesis 1. Thesis 2.1 I implemented an oriented motion blur filtering operation by an opto-mechatronical system, based on the characteristics of a vibrating mirror and the Sobel operator. Using the subjective techniques of the Heath comparison method I showed that the so obtained filtering operation is equivalent with the union of the Gabor function based filtering and contour integration operations described in Thesis 1. Thesis 2.2 I proved that the computational complexity of the implemented opto-mechatronical system is O(x max y max h(θ)), while that of the Gabor function based filtering and contour integration operation is O(x max y max p q h(θ)) and O(x max y max p q i h(θ)) respectively. Thesis 3: Role of optical aberrations and fluctuations of accommodation, and their effect on cognitive informatical operations [P5,17,18,21] I investigated into the contradiction between the non-overlapped arrangement of ganglion receptive fields published by Packer and Dacey in 2002, and the classical convolution based linear filtering operations widely adapted in informatical solutions. 98

110 Thesis 3.1 I showed that if conventional filtering operators are used on the analogy of the recent discovery, then the information loss exceeds the acceptable level for such problems. With this I pointed out that linear filtering methods have a poor biological relevance. Thesis 3.2 I proved that the information loss can be compensated by optical aberrations and fluctuations of accommodation. Based on the optical model of the eye and the size of foveal receptive fields I showed that the amount of aberrations and fluctuations in the eye is in the magnitude which is necessary for the compensation of such information loss. Thesis 3.3 Based on the above results I elaborated a cognitive informatical model to solve to contradiction while keeping the analogy of the discovery of Packer and Dacey, and including the effect of optical aberrations and the fluctuations of accommodation. This new cognitive informatical model can provide an answer to the question with a cognitive relevance about why the information loss on ganglion receptive fields is undetectable for the brain. Comment Besides the above theses, I also postulated three hypotheses concerning the new discovery, and supported them by laboratory experiments. Concerning these results, we have done studies together with David Hubel s 1 research group dealing with cognitive vision. The first hypothesis says that the role of eye accommodation during fixation has information theoretical implications. The second hypothesis says that fluctuations of eye accommodation may be responsible for the compensation of information losses caused by the nonoverlapped receptive field architecture. The third hypothesis says that the spot size on the retina is tuned to cause an optimal logical link between retinal receptive fields. Thesis 4: Application of the VFA concept in intelligent engineering systems [P16] I used the VFA model to amend the object recognition engine of Alex Berg developed at Berkeley University and published in I used the VFA model to populate the object epitome library of the engine and to detect characteristic image features necessary for object classification. I applied the obtained system to solve two problems of technical informatics. Thesis 4.1 I applied the amended object recognition engine to classify images of abstract (hand-drawn simple wire-line schematic) objects. I showed that this technique is able to classify abstract objects of three categories at success rates of 75-85%, depending on the abstractness of the information. 1 David H. Hubel along with Roger W. Sperry and Torsten N. Wiesel have been awarded the 1981 Nobel prize in medicine and physiology. 99

111 Based on test results I showed that the amended object recognition engine reaches its highest success rate when according to the real cognitive process, the orientation dimensionality is close to the value measured by Hubel, and the Gabor function based filtering operation is applied. Thesis 4.2 I applied the amended object recognition engine to the localization necessary for the navigation of an NTNU-laboratory-based mobile robot seeing four segments of the panoramic view. I showed that the object recognition engine using the VFA-model-based representations of the images recorded by the on-board cameras is able to recognize key objects of the surrounding structured environment at a success rate of 80%, which exceeds the necessary 75% for a successful localization. 100

112 Appendix A Output images from the VFA model A.1 Filtering operations Here some test images obtained from the data array of simple cognitive units V are shown, using the three different input filtering operations and lateral operations. A.1.1 End stopping filtering The input test image used to evaluate the model is shown in Figure A.1a. This image is subjected to a primary edge detection according to first step of the end stopping filtering operation. The result is a binary image of edge elements, with white dots representing high-contrast points on the original image. This edge-detected image is shown in Figure A.1b. In the present example 5 different line lengths were used with the possible orientations to calculate the output of the filtering operation. These lengths were 3, 5, 9, 17, and 33 pixels. The matrices obtained by projecting the matrices with a given length and (a) (b) Figure A.1: Original test image (a) and the result of the primary edge detection (b) 101

113 (a) (b) (c) (d) Figure A.2: The reconstruction of the edge-detected image from line segments of 3 pixels (a), 9 pixels (b), 33 pixels (c) and the reconstructed image (d) all the angles together, for lengths of 3, 9 and 33 pixels are shown in Figure A.2. A.1.2 Gabor function based filtering An example of the data array V obtained using the Gabor function based filtering operations is shown in Figure A.3. A.1.3 Foveated filtering operation On Figure A.4 the data array V obtained using the foveated input filtering operation with 6 discrete number of distances is presented. On Figure A.4 a. the original image is shown. The center of foveation is centered on the lower corner of the laptop placed on the robot. To show the difference between the Gabor function based and foveated input filtering operations, both are shown on Figure A.4 b. and A.4 c. respectively. Since the data array V is three dimensional in both cases, their projection along the orientation dimension is shown on the figures, for better visual experience. 102

114 Original image Theta = 0 degrees Theta = 18 degrees Theta = 36 degrees Theta = 54 degrees Theta = 72 degrees Theta = 90 degrees Theta = 108 degrees Theta = 126 degrees Theta = 144 degrees Theta = 162 degrees Superposition Figure A.3: The iso-orientation layers of the data array V with the values of θ shown under each image. The superposition of layers is shown on the lowerright corner. Red dots on the original image indicate the detected corners, for details see section

115 (a) (b) (c) Figure A.4: The original image of a mobile robot (a), the data array V obtained using the Gabor function based filtering operation (b) and by using foveated input filtering operation with the lower corner of the laptop as the center of foveation (c). A.2 Lateral operations A.2.1 Contour integration Figures about the stability of the lateral operation of contour integration are shown below in Figure A.5. A.2.2 The content of V after the application of lateral operations The result obtained using lateral operations are demonstrated in Figure A.7. For differences, please compare with Figure A.3. A.3 Opto-mechatronical computation of the VFA model A.3.1 Simulation results Examples obtained using motion blur for oriented edge detection and contour integration are shown in Figure A.8. A.3.2 Experimental results Two test scenes were used, both of them with the vertical and horizontal edges are shown on Figures A.9 and A

116 (a) (b) (c) (d) Figure A.5: The overall activation of V with c = 0 (a), c = 0.3 (b), c = 0.15 (c) and with c = 0.16 (d), through 100 iterations. 105

117 (a) (b) (c) (d) Figure A.6: The content of data array V before the first iteration (a), and after the 100 th iteration using a threshold c = 0 (b), c = 0.15 (c), and c = 0.16 (d). 106

118 Original image Theta = 0 degrees Theta = 18 degrees Theta = 36 degrees Theta = 54 degrees Theta = 72 degrees Theta = 90 degrees Theta = 108 degrees Theta = 126 degrees Theta = 144 degrees Theta = 162 degrees Superposition Figure A.7: The iso-orientation layers of the data array V at different orientations. The values of θ are indicated under each image. The superposition of layers is shown on the lower-right corner. Red dots on the original image indicate the detected corners, for details see section

119 (a) (b) (c) (d) (e) (f) Figure A.8: The original image (e) is blurred horizontally (a) and vertically (b). Their edge detected versions are shown in (c) and (d) respectively. Overlapped projection of (c) and (d) yields (f), which shows how the corners compose an intersection, useful in the vertex detection of the VFA sub-model. The contour integration ability of the motion blur filter is demonstrated by the small gap in the rectangle. The top edge is considered to be continuous. 108

120 (a) (b) (c) (d) Figure A.9: Horizontal (a) and vertical (c) blurred images, and their edge detected counterparts (b) and (d) respectively. (a) (b) (c) (d) Figure A.10: Horizontal (a) and vertical (c) blurred images, and their edge detected counterparts (b) and (d) respectively. 109

121 Appendix B Discussion and comparison of the results of the dissertation B.1 Comparison of contour detection The images used in the subjective ranking experiments about the quality of the VFA model are listed on Figure B.1. The F-test results of the same experiment are shown in Tables B.1 and B.2. Source SS DF σ 2 F-value P-value Drawn house Methods Error Fort Methods Error Synthetic house Methods Error Balls Methods Error Trash can Methods Error Plane Methods Error Camcorder Methods Error Stairs Methods Error Table B.1: ANOVA results for methods by images considering the first question. 110

122 Source SS DF σ 2 F-value P-value Drawn house Methods Error Fort Methods Error Synthetic house Methods Error Balls Methods Error Trash can Methods Error Plane Methods Error Camcorder Methods Error Stairs Methods Error Table B.2: ANOVA results for methods by images considering the second question. B.2 Comparison of corner detection B.3 Comparison of Gabor function based and optomechatronical VFA implemtations The data array V of the VFA model obtained by Gabor functions and optomechatronical filtering is shown in Figure B.3. B.4 Robustness to noise of the opto-mechatronical filtering The corners found using the opto-mechatronical filtering and the corner detection functionality of the VFA model. The results obtained for the original image and the four noise-added images are shown in Figure 111

123 Figure B.1: Images used in the subjective evaluation of the VFA model. 112

124 Figure B.2: Corner detection results compared. The VFA based corner detection results (left), and the results of the Rosten algorithm [76] [77] (right). 113

125 Gabor function opto-mechatronical Figure B.3: Results obtained by the Gabor function based input filtering and contour integration (left) and the opto-mechatronical filtering (right). 114

126 (a) (b) (c) (d) (d) Figure B.4: The corner detection results obtained using the opto-mechanical filtering (a), and the same operation on images with four different kind of noise: Gaussian (b), Poisson (c), Peckle (d) and Salt & Perrer (e). 115

127 Author s publications [P 1] B. Reskó, Á. Csapó, and P. Barnayi. Cognitive vision inspired contour and vertex detection. Journal of Advanced Computational Intelligence and Intelligent Informatics, 10(4): , [P 2] B. Reskó, Z. Petres, A. Róka, and P. Baranyi. Visual cortex inspired intelligent contour detection. Journal of Advanced Computational Intelligence and Intelligent Informatics, 10(5): , [P 3] B. Reskó, P. Baranyi, P. Korondi, and H. Hashimoto. Opto-mechanical filtering applied for orientation and length selective contour detection. In Proc. of the 33rd Annual Conference of the IEEE Industrial Electronics Society (IECON), pages , Taipei, Taiwan, November [P 4] B. Reskó, P. Baranyi, and P. Korondi. Orientation selective contour detection using oriented motion blur. In Proc. of Intl. Conf. on Artificial Intelligence and Pattern Recognition, pages , Orlando FL, USA, July [P 5] B. Reskó, K. Kouhei, S. Zheng, and H. Hashimoto. Retina inspired edge detection for robot vision. In Proceedings of the 2007 JSME Conference on Robotics and Mechatronics (ROBOMEC 2007), pages 2A2 K05(1 2), Akita, Japan, May [P 6] B. Reskó, P. Baranyi, and H. Hashimoto. Foveated visual feature array - model of the visual cortex. In 7th SICE System Integration Division (SI 2006), pages , Sapporo, Japan, December [P 7] B. Reskó, A. Róka, A. Csapó, P. Baranyi, and H. Hashimoto. Cognitive informatics based dind for corner and crossing detection in intelligent space. In Proc. SICE-ICCAS International Joint Conference, pages , Busan, Korea, October [P 8] B Reskó, D. Tikk, H. Hashimoto, and Péter Baranyi. Visual feature array based cognitive polygon recognition using the ufex text categorizer. In International Conference on Mechatronics, pages , July [P 9] B. Reskó, Z. Petres, and H. Hashimoto. Cognitive vision inspired feature understanding in intelligent space. In Proceedings of

128 JSME Conference on Robotics and Mechatronics, pages 2P1 E14(1 4), Tokyo, Japan, May [P 10] Z. Petres, B. Reskó, and H. Hashimoto. Cognitive contour detection for negative filtering. In 11th International Symposium on Artificial Life and Robotics (AROB 2006), pages , Beppu, Japan, January [P 11] B. Reskó, Z. Petres, and P. Baranyi. Two feature extraction methods in the construction of a visual feature array. In 6th International Symposium of Hungarian Researchers on Computational Intelligence, pages , Budapest, Hungary, November [P 12] Z. Petres, B. Reskó, P. Baranyi, and H. Hashimoto. Biology Inspired Intelligent Contouring Vision Device in Intelligent Space. In Proceedings of the 6th International Symposium on Advanced Intelligent Systems (ISIS 2005), pages , Yeosu, Korea, Sept 28 Oct [P 13] Z. Petres, B. Reskó, P. Baranyi, and H. Hashimoto. Human Vision Inspired DIND in Intelligent Space. In Proceedings of 23rd Annual Conference of the Robotics Society of Japan, page 2B23, Yokohama, Japan, Sept [P 14] B. Reskó, Z. Petres, A. Róka, and P.Baranyi. Visual cortex inspired intelligent contouring. In IEEE Proceedings of Intelligent Engineering Systems (INES 2005), pages 47 51, Athens, Greece, September [P 15] Z. Petres, B. Reskó, P. Baranyi, and H. Hashimoto. Cognitive Psychology Inspired Distributed Intelligent Network Devices. In Proceedings of International Conference on Instrumentation, Control and Information Technology (SICE 2005), pages , Okayama, Japan, August [P 16] B. Reskó. Demonstration of the VFA based object recognition system at the Industrial Open Day of the Institute of Industrial Sciences of the University of Tokyo. [P 17] A. Róka, A. Csapó, B. Reskó, and P. Baranyi. Edge detection model based on involuntary tremors and drifts of the eye. Journal of Advanced Computational Intelligence and Intelligent Informatics, 11(6): , [P 18] B. Reskó, A. Róka, A. Csapó, and P. Baranyi. Edge detection model based on involuntary eye movements of the eye-retina system. In 5th Slovakian - Hungarian Joint Symposium on Applied Machine Intelligence and Informatics (SAMI 2007), pages , Poprad, Slovakia, January

129 [P 19] B. Reskó and P. Baranyi. Lateral operations in the vfa model. In Proc. of 6th Intl. Symposium on Applied Machine Intelligence and Informatics, pages , Herl any, Slovakia, January [P 20] B. Reskó, P. Baranyi, and P. Korondi. Opto-Mechanical Oriented Egde Filtering. In Proc. of Canadian Conference on Computer and Robot Vision, pages , Windsor ON, Canada, May [P 21] B. Reskó, A. Antal, and P. Baranyi. Cognitive Informatics Model for Non-overlapped Image Filtering based on the Optical Aberrations of the Eye. Journal of Advanced Computational Intelligence and Intelligent Informatics, (Accepted), [P 22] B. Reskó P. Baranyi A. Róka, A. Csapó. Edge detection model based on involuntary eye movements of the eye-retina system. Journal of Applied Sciences at Budapest Tech, 4(1):31 46, [P 23] A. Gaudia, B. Reskó, T. Thomessen, and P. Korondi. Robot programming based on ubiquitous sensory intelligence. In IEEE International Conference on Intelligent Engineering Systems, pages , Cluj- Napoca, Romania, September [P 24] B. Reskó and P. Baranyi. Stereo camera alignment based on disparity selective cells in the visual cortex. In IEEE International Conference on Computational Cybernetics, pages , Mauritius, April [P 25] B. Reskó, P. Baranyi, and H. Hashimoto. Camera control with disparity matching in stereo vision by artificial neural networks. In Workshop on Intelligent Solutions in Embedded Systems, pages , Vienna, Austria, June [P 26] B. Reskó, P. Baranyi, and H. Hashimoto. Artifical neural network based stereo matching in stereo vision system. In RAAD Workshop, Brno, Czech Republic, [P 27] B. Reskó, P. Baranyi, P. Korondi, P. T. Szemes, and H. Hashimoto. Stereo matching in robot vision by artificial neural networks. In IEEE International Conference on Industrial Technologies, pages , Maribor, Slovenia, December [P 28] B. Reskó, J. F. Bourges, P. Baranyi, and H. Hashimoto. Panoramic picture attachment with genetic algorithms. In IEEE International Conference on Computational Cybernetics, pages , Siofok, Hungary, August [P 29] B. Reskó, J. F. Bourges, P. Korondi, H. Hashimoto, and Z. Petres. Image attachement using fuzzy-genetic algorithms. In IEEE International Conference on Fuzzy Systems, volume 2, pages , Budapest, Hungary, July

130 [P 30] B. Reskó, A. Gaudia, P. Baranyi, and T. Thomessen. Ubiquitous sensory intelligence in industrial robot programming. In Proc. of 5th International Symposium of Hungarian Researchers on Computational Intelligence, pages , Budapest, Hungary, November [P 31] B. Reskó, D. Herbay, P. Korondi, and P. Baranyi. 3d image sensor based on opto-mechanical filtering. In Proceedings of the 8th International Symposium of Hungarian Researchers on Computational Intelligence and Informatics, pages 17 27, November [P 32] B. Reskó, D. Herbay, P. Krasznai, and P. Korondi. 3d image sensor based on parallax motion. Journal of Applied Sciences at Budapest Tech, 4(4):37 53, [P 33] B. Reskó, P. T. Szemes, P. Baranyi, P. Korondi, and H. Hashimoto. Artificial neural network based object tracking. In SICE International Conference, pages , Sapporo, Japan, August [P 34] B. Reskó, P. T. Szemes, P. Korondi, P. Baranyi, and H. Hashimoto. Artificial neural network based object tracking. Transaction on Automatic Control and Computer Science, Scientific Bulletin of "Polytechnica" University of Timisoara, 4: , May [P 35] B. Reskó, P.T. Szemes, P. Korondi, P. Baranyi, and H. Hashimoto. Camera motion control based on ubiquitous computing. In EPE- Power Electronics and Motion Control Conference, Riga, Latvia, [P 36] B. Reskó, M. Niitsuma, P. Baranyi, and P. Korondi. Intelligens tér és alkalmazásai. Acta Agraria Kaposváriensis, 11(2):41 52, [P 37] András Róka, Ádám Csapó, Barna Reskó, Péter Baranyi, and Hideki Hashimoto. A cognitive computational model to demonstrate the importance of eye-movements in edge detection. In 3rd International Conference on Soft Computing and Intelligent Systems and 7th International Symposioum on Advanced Intelligent Systems (SCIS and ISIS 2006), pages , September [P 38] B. Solvang, B. Reskó, P. Korondi, and G. Sziebig. Distributed image processing system using the rt-middleware framework. In Proc. of the 33rd Annual Conference of the IEEE Industrial Electronics Society (IECON), pages , Taipei, Taiwan, November [P 39] K. Wang, B. Reskó, Y. Wang, M. Boldin, and O. R. Hjelmervik. Application of artificial neural networks to predicate shale content. In Advances in Neural Networks - Proceedings of Second International Symposium on Neural Networks, volume 3498, pages , Chongqing, China, May 30 June

131 Bibliography [1] P. Adorjan. Dynamics and Representation in the Primary Visual Cortex. PhD thesis, Technical University Berlin, [2] S. Ando and A. Kimachi. Correlation image sensor: Two-dimensional matched detection of amplitude-modulated light. IEEE Transactions on Electron Devices ON ELECTRON DEVICES, 50(10): , October [3] S. Ando, T. Nakamura, and T. Sakaguchi. Ultrafast correlation image sensor: Concept, design, and applications. In IEEE International Conference on Solid-State Sensors and Actuators, pages , Chicago, June [4] I. Aradi and P. Érdi. Signal generation and propagation in the olfactory bulb: multicompartmental modeling. Computers and Mathematics with Application, 32:1 27, [5] A. Atchinson and G. Smith. Continuous gradient index and shell models of the human lens. Vision Res., 35: , [6] H. B. Barlow. Summation and inhibition of the frog s retina. J. Physiology, 119:69 88, [7] P. Bayerl and H. Numann. A fast biologically inspired algorithm for recurrent motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2): , [8] J. A. Bednar and R. Miikkulainen. Joint maps for orientation, eye, and direction preference in a self-organizing model of v1. Neurocomputing, 69(10 12): , [9] A. C. Berg, T. L. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondence. IEEE Computer Vision and Pattern Recognition (CVPR), [10] A. C. Berg and J. Malik. Geometric blur for template matching. Computer Vision and Pattern Recognition, 1: , [11] I. Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94: ,

132 [12] I. Biederman and E. E. Cooper. Priming contour-deleted images: Evidence for intermediate representations in visual object recognition. Cognitive Psychology, 23: , [13] J. Bilotta and I. Ambrov. Spatial properties of goldfish ganglion cells. J. Gen. Physiol., 93(6): , [14] R. T. Born and D. C. Bradley. Structure and function of visual area mt. Annu. Rev. Neurosci., 28: , [15] W. Bosking, Y. Zhang, B. Schofield, and D. Fitzpatrick. Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17(6): , [16] T. Bosse, P. P. van Maanen, and J. Treur. A cognitive model for visual attention and its application. In Proceedings of the IEEE/WIC/ACM international conference on Intelligent Agent Technology, pages , Washington, DC, USA, [17] O. J. Braddick, J. M. D. O Brien, J. Wattam-Bell, J. Atkinson, T. Hartley, and R. Turner. Brain areas sensitive to visual motion. Perception, 30(1):61 72, [18] J. Braun. On detection of salient contours. Spatial Vision, 12(2): , [19] J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6): , June [20] L. A. Carvalho. A simple mathematical model for simulation of the human optical system based on in vivo corneal data. Brazilian Journal of Physics, 19(1):29 37, [21] L. O. Chua and T. Roska. Cellular Neural Networks and Visual Computing: Foundations and Applications. Cambridge University Press, [22] B. Edelman D. Valentin, H. Abdi. What represents a face: A computational approach for the integration of physiological and psychological data. Perception, 26: , [23] D. Dacey. Primate retina: cell types, circuits and color opponency. Prog Retin Eye Res., 18(6): , [24] H. J. A. Dartnall, J. K. Bowmaker, and J. D. Mollon. Human visual pigments: Microspectrophotometric results from the eyes of seven persons. Proc. of Royal Society of London, B., 220: , [25] J. G. Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A, 2,

133 [26] J. Davies, A. K. Nersessian, and N. J. Nersessian. A cognitive model of visual analogical problem-solving transfer. In Nineteenth Annual International Joint Conference on Artificial Intelligence., pages , [27] M. S. de Almeida and L. A. Carvalho. Different schematic eyes and their accuracy to the in vivo eye: A quantitative comparison study. Brazilian Journal of Physics, 37(2A): , [28] H. Deubel and W. X. Schneider. Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision Research, 36, [29] S. Devries and D. Baylor. Mosaic arrangement of ganglion cell receptive fields in rabbit retina. J. Neurophysiology, 78(4): , October [30] R. W. Ditchburn and B. L. Ginsborg. Vision with a stabilized retinal image. Nature, 170:36 37, [31] I. Escudero-Sanz and R. Navarro. Off-axis aberrations of a wide-angle schematic eye model. J. Opt. Soc. Am. A, 16(8): , [32] E. J. Fernández, A. Unterhuber, B. Považay, B. Hermann, P. Artal, and W. Drexler. Chromatic aberration correction of the human eye for retinal imaging in the near infraraed. Optics Express, 14(13): , [33] D. J. Field, A. Hayes, and R. F. Hess. Contour integration by the human visual system: evidence for local "association field". Vision Research, 33(2): , [34] M. A. Georgeson and S. T. Hammett. Seeing blur: motion sharpening without motion. Proc. R. Soc. Lond. B., 269: , [35] S. J. Gislason. The Book of Brain [36] A. V. Goncharov, M. Nowakowski, M. T. Sheehan, and C. Dainty. Reconstruction of the optical system of the human eye with reverse raytracing. Optics Express, 16(3): , [37] E. Grassi and S. A. Shamma. A biologically inspired, learning, sound localization algorithm. In Conference on Information Sciences and Systems, Johns Hopkins University, pages , [38] C. Grigorescu, N. Petkov, and M. A. Westenberg. Contour and boundary detection improved by surround suppression of texture edges. Image and Vision Computing. [39] C. Grigorescu, N. Petkov, and M. A. Westenberg. Contour detection based on nonclassical receptive field inhibition. IEEE Transactions on Image Processing, 12(7): ,

134 [40] J. Hawkins and S. Blakeslee. On Intelligence. Times Books, Henry Holt and Co., [41] M. Heath, S. Sarkar, T. Sanocki, and K. Bowyery. Comparison of edge detectors, a methodology and initial study. Journal of Computer Vision and Image Understanding, 69(1):38 54, January [42] A. Ho, P. Erickson, F. Manns, T. Pham, and J-M. Parel. Theoretical analysis of accommodation and ametropia correction by varying refractive index in phaco-ersatz. Optometry and Vision Science, 78(6): , [43] S. D. Van Hooser, J. A. F. Heimel, S. Chung, S. B. Nelson, and L. J. Toth. Orientation selectivity without orientation maps in visual cortex of a highly visual mammal. The Journal of Neuroscience, 25(1):19 28, [44] J. C. Horton and D. R. Hocking. Intrinsic variability of ocular dominance column periodicity in normal macaque monkeys. Journal of Neuroscience, 16(22): , [45] D. Hubel. Eye, Brain and Vision. W.H. Freeman & Company, [46] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat s visual cortex. J. Physiology, 160: , [47] D. H. Hubel and T. N. Wiesel. Receptive fields and functional architecture of monkey striate cortex. J. Physiology, 195: , [48] M. Hubener, D. Shoham, A. Grinvald, and T. Bonhoeffer. Spatial relationships among three columnar systems in cat area 17. The Journal of Neuroscience, 17: , [49] B. Julesz. Dialógusok az észlelésről. Tipotex Kiadó, [50] I. Kovacs and B Julesz. A closed curve is much more than an incomplete one: effect of closure in figure-ground segmentation. PNAS, 90: , [51] S. W. Kuffler. Discharge patterns and functional organization of mammalian retina. J. Neurophysiology, 16:37 68, [52] H. L. Liou and N. A. Brennan. Anatomically accurate, finite model eye for optical modeling. J. Opt. Soc. Am. A, 14: , [53] L. L. Lui, J. A. Bourne, and M. G. P. Rosa. Functional response properties of neurons in the dorsomedial visual area of new world monkeys (callithrix jacchus). Cerebral Cortex, 16(2): , [54] H. Bulthoff M. Tarr. Image-based object recognition in man, monkey and machine. Cognition, 67(1-2):1 20,

135 [55] Y. Ma, X. Gu, and Y. Wang. Contour integration based on the characteristics of edge elements. International Congress Series, 1301:97 101, [56] S. Martinez-Conde, S. L. Macknik, and D. H. Hubel. The role of fixational eye movements in visual perceptio. Nature Reviews Neuroscience, 5(3): , [57] G. Mather. The lateral geniculate nucleus. Pages/ Physiol/LGN.html. [58] J. Maunsell and D. Van Essen. Functional properties of neurons in middle temporal visual area of the macaque monkey. i. selectivity for stimulus direction, speed, and orientation. J Neurophysiol, 49(5): , [59] B. A. McGuire, J. K. Stevens, and P. Sterling. Microcircuitry of beta ganglion cells in cat retina. J. Neurosci., 6(4): , [60] M. Meister. Multineural codes in retinal signaling. In Proc Natl Acad Sci USA, volume 93, pages , January [61] J. Moran and R. Desimone. Selective attention gates visual processing in the extrastriate cortex. Science, 229(4715): , [62] T. N. Mundhenk and L. Itti. Computational modeling and exploration of contour integration for visual saliency. Biological Cybernetics, 93(3): , [63] R. Navarro, J. Santamaria, and J. Bescos. Accomodation-dependent model of the human eye with aspherics. J. Opt. Soc. Am. A, 2(8): , [64] S. Norrby, P. Piers, C. Campbell, and M. van der Mooren. Model eyes for evalution of intraocular lenses. Applied optics, 46(26): , [65] D. H. O Connor, M. M. Fukui, M. A. Pinsk, and S. Kastner. Attention modulates responses in the human lateral geniculate nucleus. Nature Neuroscience, 5:pages , [66] O. S. Packer and D. M. Dacey. Receptive field structure of h1 horizontal cells in macaque monkey retina. J. Vision, 2(4): , [67] N. J. Parikh, J. D. Weiland, M. S. Humayun, S. S. Shah, and G. S. Mohile. Dsp based image processing for retinal prosthesis. In Proceedings of the 26th IEEE Annual International Conference of the Engineering in Medicine and Biology Society, volume 1, pages ,

136 [68] A. Polans, W. Baehr, and K. Palczewski. Turned on by Ca 2+! the physiology and pathology of Ca 2+ -binding proteins in the retina. Trends in Neurosciences, 19(12): , [69] U. Polat and D. Sagi. The architecture of perceptual spatial interactions. Vision research, 34:73 78, [70] U. Polat and D. Sagi. Lateral interactions between spatial channels: supression and facilitation revealed by lateral masking experiments. Vision research, 33: , [71] F. T. Qiu and R. von der Heydt. Figure and ground in the visual cortex: V2 combines stereoscopic cues with gestalt rules. Neuron, 47: , [72] M. A. Rama, M. V. Pérez, C. Bao, M. T. Flores-Arias, and C. Gómes- Reino. Gradient-index crystalline lens model: A new method for determining the paraxial properties by the axial and field rays. Optics Communications, 249: , [73] M. Riesenhuber. How a Part of the Brain Might or Might not Work: A New Hierarchical Model of Object Recognition. PhD thesis, M.I.T., [74] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2: , [75] L. A. Riggs and F. Ratliff. The effects of counteracting the normal movements of the eye. J. Opt. Soc. Am., 42: , [76] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In Proceedings of the Tenth IEEE International Conference on Computer Vision, volume 2, pages , [77] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In 9th European Conference on Computer Vision, pages , [78] A. Saudargiene, B. Porr, and F. Wörgötter. Biologically inspired artificial neural network algorithm which implements local learning rules. ISCAS, Vancouver, [79] I. A. Shevelev. Second order features extraction in the cat visual cortex: Selective and invariant sensitivity of neurons to the shape and orientation of crosses and corners. Biosystems, 48(13): , [80] I. A. Shevelev, N. A. Lazareva, G. A. Sharaev, R. V. Novikova, and A. S. Tikhomirov. Selective and invariant sensitivity to crosses and corners in cat striate neurons. Neuroscience, 84(3): ,

137 [81] I. A. Shevelev, N. A. Lazareva, G. A. Sharaev, R. V. Novikova, and A. S. Tikhomirov. Interrelation of tuning characteristics to bar, cross and corner in striate neurons neuroscience. Neuroscience, 88(1):17 25, [82] R. C. Smith and P. Cheeseman. On the representation and estimation of spatial uncertainty. International Journal of Robotics Research, 5(4):56 68, [83] W. J. Smith. Modern Optical Engineering, The Design of Optical Systems. McGraw-Hill, New York, [84] L. R. Stark and D. A. Atchison. Pupil size, mean accommodation response and the fluctuations of accommodation. Ophthalmic and Physiological Optics, 17(4): , [85] N. W. Swindale. A model for the formation of ocular dominance stripes. Proc. of the Royal Society of London, Biol. Sciences, 208(1171): , [86] N. W. Swindale. A model for the formation of orientation columns. Proc. of the Royal Society of London, Biol. Sciences, 215(1199): , [87] N. W. Swindale. The development of topography in the visual cortex: a review of models. Computation in Neural Systems, 7: , [88] J. Thiem, C. Wolff, and G. Hartmann. Biology-inspired early vision system for a spike processing neurocomputer. Biologically Motivated Computer Vision, pages , [89] R. B. Tootell, M. S. Silverman, E. Switkes, and R. L. De Valois. Deoxyglucose analysis of retinotopic organization in primate striate cortex. Science, 218(4575): , [90] M. von Waldkirch, P. Lukowicz, and G. Tröstner. Defocusing simulation on a retinal scanning display for quasi accommodation-free viewing. Optics Express, 11(24): , [91] Y. Wang. On cognitive informatics. Brain and Mind: A Transdisciplinary Journal of Neuroscience and Neurophilosophy, 4(2): [92] Y. Wang. The oar model of neural informatics for internal knowledge representation in the brain. The International Journal of Cognitive Informatics and Natural Intelligence (IJCINI), 1(3):64 75, [93] Y. Wang, D. Liu, and Y. Wang. Discovering the capacity of human memory. Brain and Mind: A Transdisciplinary Journal of Neuroscience and Neurophilosophy, 4(2): [94] A. L. Yarbus. Eye Movements and Vision. New York: Plenum Press,