Computer Vision for Autonomous Mobile Robots

Transcription

1 Computer Vision for Autonomous Mobile Robots Dissertation zur Erlangung der Doktorwürde der Naturwissenschaften im Fachbereich Mathematik und Informatik der Freien Universität Berlin Institut für Informatik, Arbeitsgruppe Künstliche Intelligenz Takustr Berlin vorgelegt von Felix von Hundelshausen Berlin 2004 Gutachter: Prof. Dr. Raúl Rojas Prof. Dr. E. D. Dickmanns Datum der Disputation:

2 Preface When starting my thesis and my work in the Middle Size league of RoboCup, I always intended to use RoboCup as an example domain for methods that should later work somewhere else. I am glad that I was able to find an algorithm that was very useful for our soccer robots, but which will also be of great use for other systems. The algorithm is able to efficiently track large homogeneous regions in image sequences. My second aim was to try to use shape information for robot localization and navigation. Here, I developed a method that is specialized to the RoboCup scenario in that it tries to recognize a palette of features in the very specific RoboCup field lines. However, the way in which detection is performed can be transferred to other scenarios. It has always been difficult to achieve the balance between a reliable, working system in practice and the aim for inventing new methods. Often, simple methods work better in practice. In the end, I am glad that it was not a balancing act but rather a convergence of the two aspects. All the methods that are proposed in the thesis were used in our final six robots participating at the world championships in Lisbon 2004, where we placed fourth. Acknowledgements First, I want to thank Prof. Dr. Raúl Rojas, who always had time for me. Over all years he always slept in the same youth hostels with us when we participated in competitions and he never went to sleep before the last of us stopped working. Often it was Mark Simon who worked all the night. Then, I want to thank our team: Ketill Gunnarsson who removed all the compiler warnings from our source code, Fabian Wiesel, Slav Petrov, Fabian Ruff, Oliver Tenchio, Michael Schreiber, Wolf Lindstrot, Henning Heinold, Holger Freyther and Achim Liers who designed our electronics. Everything was only possible with their help. Also, I want to thank Detlef Müller and his group at the precision mechanics section of our physics department who accepted some last-minute orders and were able 1

3 to produce these parts in a very short time period. Thanks also to Toni Zollikofer from Motorola, whose company provided us with micro-controllers, and to JVC sponsoring us with seven laptops. Then I want to thank Derek Daniel, who corrected my English. Finally, I want to thank Prof. Dr. E. D. Dickmanns. His explanations and advice greatly helped me to understand the control theoretic backgrounds that are important in today computer vision. 2

4 Abstract This thesis has been written in conjunction to our engagement in the Midsize league of RoboCup where autonomous mobile robots play soccer. In particular, it is about the computer vision system of the robots, which supplies the necessary visual information. The main contribution is a new image processing technique that allows efficient tracking of large regions. The method yields the precise shape of the regions and it is a base for several other methods, which are described in this thesis. They comprise a new localization method enabling the robots to determine their precise position by perceiving the white field lines. In particular, they are able to perform real-time recognition of a whole palette of features, including the center circle, T-junctions and corners. If a situation occurs where no feature can be recognized, another new method, the MATRIX-method, is applied. It uses a pre-computed force field to match the perceived field lines to the corresponding lines in a model. Overall localization is then performed in a three-level fusion process, which precisely takes into account the different time delays in the system. The approach has been demonstrated to work, playing over 10 games at the world-championship 2004 in Lisbon where the system achieved fourth place. Although the system was conceived for participation in RoboCup, especially the region tracking method will be of great use for many other applications.

5 Contents Preface 1 1 Introduction Motivation The Middle Size League The FU-Fighter s Middle Size Robot Localizing the Robot by the Field Lines Organization of the Thesis Related Work Navigation Using Laser Range Scanners Navigation Using GPS and DGPS Navigation Using Radar Navigation Using Infrared Proximity Sensors Navigation Using Ultrasonic Sensors Navigation by Vision Typical System Architectures in RoboCup Existing Methods for Field Line Extraction Applying Thresholding and Thinning to Extract the Lines Using the Canny Edge Detector to Extract the Lines Using the Radial Scan Method to Extract the Lines Using a Model to Extract the Lines Existing Methods for Robot Self-Localization Using the Field lines Monte Carlo Localization Global Localization by Matching Straight Lines

6 2.9.3 Relative Localization Methods for Feature Detection A new Algorithm: Tracking Regions Extending the Region Growing Paradigm Region Growing by Pixel Aggregation The Key Observation Shrinking Regions Alternating Shrinking and Growing Applicability Running Time Controlling the Tracking Homogeneity Criterion Tracking Several Regions Boundary Extraction Extracting the Field Lines By Tracking Regions Results A new Localization Method Using Shape Information Three Layers for Robot Self-Localization The Robot s System State Coordinate Systems and Transformations Relationship between Wheel Rotations and the Robot s Movement The Dynamic Model Using a Kalman Filter to Fuse the Three Layers Fusing Delayed Measurements Splitting the Kalman Cycle Explicit Representation of Time Layer 1: Odometric Information The Observation Model The Omni-Directional Vision System The Distance Function Predicting the Location of Objects in the Image Transformation of Two-Dimensional Points on the Field

7 4.9.5 Transformation of Arbitrary 3D Points Transforming the Contours into World Space Modelling the Field Lines Layer 2: Relative Visual Localization MATRIX: A Force Field Pattern Approach Adapting the System Dynamics Approach Layer 3: Feature Recognition Representation of the Line Contours Quality and Quantity of Features Direct Pose Inference by High-Level Features Smoothing the Lines Splitting the Lines Corner Detection Classification Constructing Arcs and straight lines Grouping Arcs and Detecting the Center Circle Refining the Initial Solution of the Circle Determining the Principal Directions Discarding Unreliable and Grouping Collinear Lines Detecting the Corners of the Penalty Area Results of the Feature Detection Results of the Overall Localization Conclusions and Future Work Considering the System Dynamics Approach Automatic Modeling Automatic Learning of Feature Detectors The Problem of Feature Selection Top-Down Versus Bottom-Up Methods Criticizing the Proposed Feature Recognition Approach The Problem of Light Summary of Contributions 176 3

8 A Pseudo-code of Described Algorithms 178 A.1 Thinning A.2 Smoothing the Line Contours A.3 Calculation of a Curvature Measure A.4 Extracting Local Maxima in Curvature B Source Code of the Region Tracking Algorithm 182 B.1 The Homogeneity Criterion B.2 The Don t touch Array B.3 Initializing the Region Tracker B.4 FU RegionTracker.h B.5 FU RegionTracker.cpp B.6 FV2.h B.7 FV2.cpp B.8 FU Contours.h B.9 FU Contours.cpp B.10 UTIL HEAP.h B.11 UTIL TOPFX.h Bibliography 202 4

9 Chapter 1 Introduction This thesis is about a computer vision system for autonomous mobile robots. Building robust vision systems is important in order to allow a mobile robot to perceive its environment and to navigate and operate in the real world using visual information. Since the real world is very complex, we chose the RoboCup Middle Size league as a test scenario, where fully autonomous robots play soccer. This thesis describes a new image segmentation algorithm that is useful for a large domain of applications. When dealing with a sequence of images, existing methods compute the segmentation for each image from scratch. In contrast, the new algorithm adjusts the old segmentation to the new, only accessing a small portion of the image data and is thus more efficient. Furthermore, a new method for robot self-localization by perceiving the field lines in the RoboCup environment is presented. The method for extracting the field lines builds upon the new image segmentation method. The field lines are represented in a form that allows the efficient recognition of features like the center circle or corners at the penalty area. Robust, precise and efficient localization is then achieved in a three-layered process that combines dead-reckoning, relative registration and feature recognition. Here, relative registration is a method that allows the correction of small localization errors without the need for recognizing features. 5

10 1.1 Motivation One of the dreams of Artificial Intelligence researchers is the construction of cognitive robots that can flexibly operate in the real world. However, for a long time research in Artificial Intelligence has been very theoretical, having brought forth IBM s super computer Deep Blue, which checkmated David Kasparov in 1997, but not having achieved a cognitive system that would be able to perceive, reason and act in our world. Therefore, a new paradigm emerged: the situated agent approach. According to this approach, in order to make real advances in Artificial Intelligence, one has to build real embodied systems that perceive, think and act in the real world. Here and now is the key phrase that emphasizes an abandonment of the unrealistic assumptions made before. Our world is dynamic. A lot of unknown and unpredictable situations occur and one has to react immediately without the luxury of infinite time to plan or think about a situation. However, when dealing with real systems, the danger is that science gets lost in engineering details and this danger is increased if all researchers work on specialized, incompatible applications. The need for a benchmark problem to bundle research efforts was obvious and in 1992 Alan Mackworth proposed the idea of soccer-playing robots to the Artificial Intelligence community [56]. Such robots must be able to perceive and control the ball, avoid obstacles, communicate and cooperate with other robots and all this in real-time. The idea was widely accepted, and after five years of discussion and feasibility studies, it finally led to the foundation of the RoboCup research initiative in Since then, world wide robotic soccer competitions have been held annually, with over 150 teams participating in the meantime. The competitions are organized in different leagues in order to promote individual system aspects as much as possible. The Small Size league, with small robots playing on a field having approximately the size of a pingpong table, the Middle Size League with bigger robots playing on a 12 8m field, the Sony Legged League based on Sony s quadruped dog-like robots, the Simulation League where soccer agents are simulated and finally, the Humanoid League with the focus on walking, human-like robots. Our research group participates in the Small Size League in form of the FU-Fighters which succeeded in placing second at the world championships several times since 1999 and which achieved 1st placing in In 2002 we started our engagement in the Middle Size League. The main difference between the two leagues is that in the Small Size 6

11 League, the robots have no integrated perception. Instead, an external camera observes the game from above and an image processing system retrieves the necessary information to calculate the behavior of the robots. Thus, all the calculations are carried out by a central system external to the robots and the robots simply execute motor control commands sent through a radio link. In contrast, the robots in the Middle Size League are fully autonomous. They have a built-in color camera and an embedded computer for visual perception and behavior computation. The visual perception is the most critical part of the system because all the other sub-systems depend on it. If an obstacle is not recognized, collision avoidance fails and if the position and the velocity of the ball are miscalculated, ball control becomes impossible. 1.2 The Middle Size League In the Middle Size League, the soccer robots are fully autonomous. They integrate perception, behavior and motor control. Figure 1.1 shows a kick off at the championships in Padova Figure 1.1: The Middle Size League in Padova

12 The regulations of the Middle Size League change continuously. The size of the playing field, for instance, has changed from 10 5m in 2003 to 12 8m in Also, the structure of the field lines has changed as illustrated in figure 1.2 and figure 1.3. The field lines y [m] x [m] Figure 1.2: Model of the field lines in Padova y [m] x [m] Figure 1.3: Model of the field lines in Lisbon consist of the border lines having a width of 12.5 cm and internal lines having a width of 8

13 5 cm. The internal lines consist of the center line, the center circle (radius 1m), the lines marking the goal and penalty areas and the quarter circles (radius 40 cm) at the corners. In 2003 a 10 cm barrier around the playing field was introduced in order to prevent the robots from leaving the field. In the following chapters some figures have been made in 2003 and therefore will refer to the field model of 2003 while others, in particular when describing the feature recognition process, will be based on the field lines of Since 2004, each team consists of up to six robots, one being the goal keeper. The robots have a maximum size of cm and are fully autonomous, i.e. sensors, vision and motion control are on-board. The robots use wireless communication to synchronize their team play. The match lasts two equal periods of 10 minutes and no external intervention by humans is allowed, except to insert or remove robots from the field. At present, relevant landmarks on the field are color-marked: the ball is orange, the goals are yellow and blue, the corner posts have yellow and blue stripes. The robots are almost black and carry color markers in magenta or light blue. 9

14 1.3 The FU-Fighter s Middle Size Robot Soccer playing robots must have flexible moving capabilities in order to be able to rapidly move behind the ball or dribble the ball around obstacles. Therefore, our robots have an omni-directional drive which consists of three geared 90 Watt DC motors. They drive three independent special wheels, allowing the robot to instantaneously move and rotate in any direction from any configuration. Figure 1.4: The Middle Size robot of the FU-Fighters team in The wheel design is illustrated in figure 1.5. The original idea, the Mecanum wheel, was invented in 1973 by Bengt Ilon, an engineer with the Swedish company Mecanum AB. Our design consists of a central wheel with 12 air cushioned rollers placed perpendicularly around the periphery of the wheel. The small rollers allow the wheel to move freely orthogonal to the wheel direction (the passive direction), while a force can be translated to the ground along the rotational direction of the large central wheel (the active direction). The three wheels are configured as two side wheels and one back wheel as can be seen in figure 1.6. By arranging them at different angles and driving them with individual motors, the forces can be combined to produce a total translational force vector and a rotational force in any desired direction. Figure 1.6 illustrates the relationship between the individual wheel movements and the resulting movement of the robot by some examples. 10

15 (a) (b) (c) (d) Figure 1.6: Some examples illustrating the wheel movements when the robot (a) moves forward (b) moves to the right (c) moves aslant forward or (d) rotates to the right. The profile of our robots was designed to produce a low center of gravity. This allows our robots to move with high speed and prevents over-balancing. The base of the robot is 40 by 40 cm, but the height of the base is just 10 cm. On top of this base we placed a 1.3 kg laptop. The laptop is connected through the serial interface to the control electronics for the wheels. The control electronics are based on Motorola s HC Bit microcontroller. It runs a program that reads the control commands, which consist primarily of a desired speed vector and a rotational velocity. The microcontroller calculates the required speeds of the individual wheels and regulates the necessary power. Each motor has an impulse generator that generates 64 impulses per revolution. Since each motor has a 12:1 gear, 768 impulses arise per revolution. This information about the wheel rotations is used for feedback control, and it is also sent back to the laptop in order to supply the computer vision component with the available motion data. This odometric information is valuable 11

16 for the visual processing, because the location of objects and their features in the images can be predicted more rapidly and precisely. The total weight of the robot, with laptop is below 7 kg. The low weight of the robots makes it possible to use a single pack of ten NiMH batteries (12V/8000mAh) for supplying the motors and control electronics. The laptop provides the necessary power for the Firewire camera which yields up to 30 frames per second with a maximum resolution of pixels. However, we determined that a resolution of 320 passive direction active direction Figure 1.5: The omnidirectional wheel. 240 is sufficient and we use this resolution at 30 frames per second. A camera was placed on top of the robot, at 40 cm height, looking on an omni-directional mirror that is supported by three acrylic glass bars. The convex mirror above the camera reflects the light from all 360 degrees around the robot and thus the robot has an omni-directional view. A typical image as seen from the robot s perspective is depicted in figure 1.7. centre of the mirror (self-reflection of the robot) Figure 1.7: A typical image as seen from the robot s perspective. This is the reflection within the mirror above the camera. The centre of the mirror reflects parts of the robot itself, for example the plate where the camera is mounted. 12

17 1.4 Localizing the Robot by the Field Lines In the first years of the Middle Size League, walls had been placed around the playing field and the most successful teams used laser range scanners as the primary sensor, yielding range scans in which the surrounding walls were clearly represented. Localization was achieved by matching the range scans to a model of the walls [38]. However, in 2002 the walls were removed and laser range scanners became difficult to use, because the rotating laser beam often hit unpredictable objects outside the playing field, for instance spectators. Therefore, many teams removed their laser range scanners and replaced them with omni-directional vision systems, the currently most-used system setup. Together with the laser range scanners, the localization based on matching range scans vanished. Instead, methods based on landmark detection and triangulation were mainly used. However, using visual information, it should be possible to detect the white field lines and to match them to a model for initial pose estimation, similarly to what was previously done with the laser range scans. For robot self-localization in the Middle Size league, it would seem to be much easier and efficient to use the goals and corner posts for localization, and indeed, many teams use them as landmarks for triangulation or Monte Carlo localization methods. However, using the field lines has some advantages. The first are of practical nature: Localization will be more precise, since there are always some lines close to the robot. Although occlusion by other robots happens, it will almost never occur that all potentially visible lines are hidden. Furthermore, landmarks like the colored corner posts and goals will be removed in the future. However, the most important reason is the aim to research methods that can not only be used for RoboCup, but also for other applications. The field lines contain a lot of shape information, so developing methods that can extract this shape information in real-time and allow the recognition of objects by this shape information will be of great use for other applications. The field lines also allow the investigation of an interesting question: For initial localization, the extracted lines have to be matched to a pre-defined model, and the question is whether the recognition of sub-structures is useful to accomplish this task. Leaping ahead to the conclusions of this thesis, the question can clearly be answered in the affirmative. 13

18 1.5 Organization of the Thesis The remainder of this thesis is organized as follows: In chapter 2 we review the state of the art concerning mobile robot navigation and relevant computer vision methods. In chapter 3, a new method for tracking large homogeneous regions is proposed, which is the base of the entire computer vision system and the main contribution of this thesis. It is able to compute the segmentation of a new image from the old accessing only a small fraction of the image data (about 10 percent on average). Running the system at 30 frames per second, the method needs only 5-10 percent of the processing power of a Pentium III 800 MHz processor. Nevertheless, the method tracks large image regions and is able to compute their precise boundary contours. In our application, the method is used to track the green regions of the playing field and the white field lines are extracted by searching along the boundaries of the green regions. In chapter 4, we describe a new localization technique. It consists of three layers: Dead reckoning, relative registration and feature recognition. Dead reckoning allows to estimate the movement of the robot by measuring the wheel rotations. Relative registration assumes the initial knowledge of the robot s pose and is able to correct small positional errors by matching the extracted field lines to a model. Relative registration works without recognizing structural information. Finally, we present an efficient method to recognize features like the center circle and corners at the penalty area. This feature recognition process constitutes the third level in our localization technique and is the reason for its robustness. Even when odometry information is incorrect or when the robot is suddenly manually placed in another location, the method is able to find the robot s true position after a short time period (typically below 1s). We conclude and discuss this thesis in chapter 5. As a result of the discussion, directions of future work are indicated. Finally, chapter 6 summarizes the contributions of this thesis. 14

19 Chapter 2 Related Work The following chapter is organized into two parts: In the first part, which consists of sections one to five, we describe current research from an application-based point of view. Here, we do not restrict the focus on computer vision, but ask more generally for perceptional robot navigation techniques including laser range scanners, sonars, etc. In the second part, sections 2.5 to 2.10, we change to an algorithmic point of view and concentrate on computer vision techniques, in particular techniques that relate to our specific RoboCup problem, namely robot self-localization by the field lines. Navigation is an essential and old problem that has led to the founding of several institutions (i.e. Deutsche Gesellschaft für Ortung und Navigation 1 (German Institute for Navigation) or the US Institute of Navigation 2 ). Navigation essentially works by combining some kind of perception with prior knowledge of the environment, such as natural or artificial landmarks or properties of the environment like the magnetic field of the earth. In RoboCup, the environment is the playing field and it is well-known in advance. In different scenarios, i.e a robot exploring a planet, the robot must gain the knowledge itself, that is, it has to build a map that is then used as a frame of reference for localization. This problem of Simultaneous Localization and Map Building (SLAM) is a difficult and important issue in current mobile robotics research. Interestingly, the SLAM problem is, to a high degree, independent of the sensors that are used, and it is helpful to consider not only current research that is based on computer vision but also the work which is related to other sensor systems. Often, methods and means to solve

20 problems in one sensor modality can be transferred to other sensor systems. For instance, techniques that where originally applied using laser range scanners can be transferred to omni-directional vision systems, as it is the case with the radial scan method, which will be described later. 2.1 Navigation Using Laser Range Scanners Using a laser range scanner, a laser beam rotates in a plane and measures distances to objects in the environment. The distance is measured by determining the time between emission and detection of the reflected light. The most common laser range scanners are from SICK AG 3 and RIEGL GmbH 4. Some of their products are shown in figure 2.1. SICK LMS Indoor (180 ) Range: 80m Stat. error: 5 mm SICK LMS Outdoor (180 ) Range: 80m Stat. error: 1cm RIEGEL Q140i Outdoor (60/80 ) Range: 450m Stat. error: 2,5 cm Figure 2.1: The most common laser range scanners are from SICK AG and RIEGL GmbH. Different products for indoor and outdoor applications exist. A laser range scan consists of a list of points naturally represented in polar coordinates. Each point specifies the angle and corresponding range measurement at that angle. Since the laser beam rotates continuously, the points are sorted by their angles from the very beginning, which simplifies some important operations on the scans, i.e. when trying to extract straight line segments from a scan. Currently, laser range scanners are employed in many robotic systems. At the Australian Center for Field Robotics (ACFR), a utility vehicle (see figure 2.2) has been equipped with a scanning laser range finder and navigation

21 Figure 2.2: At the Australian Center for Field Robotics (ACFR) a utility vehicle is equipped with a laser range scanner for navigation in a park-like environment. Here, the focus of research is on Simultaneous Localization and Map Building (SLAM) algorithms. (with kind permission from Dr. E. M. Nebot) and map-building algorithms are being investigated in a park-like setting [37, 46]. Most of the teams competing in DARPA s Grand Challenge use laser range scanners for obstacle avoidance, for instance see [80]. Here, the team whose robot completed the 143-mile course across the Mojave Desert from Barstow California to Primm Nevada (the course in 2004) in the least amount of time and at most 10 hours, would win one million dollars. Figure 2.3 shows Carnegie Mellon Univesity s Red Team s vehicle. They performed best among all teams in The Grand Challenge has revealed some disadvantages of laser range scanners: They can have difficulties with dusty environments. Also, vibrations of the vehicle, tilting the laser scanner, can produce serious problems. In non-planar environments, small angular disturbances can lead to largely varying intersections between the laser beam and the environment, producing highly instable, varying range scans. Laser range scanners have been successfully used for indoor applications. In [33], laser range scanners were applied to localize robots within a museum. Also, laser range scanners were successfully applied in the MidSize league of RoboCup. In particular, they were useful at the time when the playing field was surrounded by walls. The profile of the 17

22 Figure 2.3: The Carnegie Mellon University team s race vehicle, Sandstorm, is a M998 HMMWV vehicle modified for autonomous race driving. It uses three SICK and one RIEDL laser range scanner, primarily for obstacle detection. (with kind permission from Prof. William L. Red Whittaker) walls and the goals could easily and efficiently be extracted from the scans and matched with a model. The work of Steffen Gutmann [38] provides a good overview of existing methods to process the range scans and match them to simple, polygonal models. Another nice approach was taken in [55]. Here, the idea was to register successive range scans against each other to correct for errors in odometry. The approach is similar to the well-known iterative closest point algorithm (ICP) [9], however it combines two different heuristics to obtain a more precise matching. The approach also works in curved environments. 2.2 Navigation Using GPS and DGPS The global positioning system (GPS), also called NAVSTAR, was developed and realized by the US Department of Defense. Its mission is defined as: Provide the US Armed Forces and Allies with GPS user equipment for precise worldwide navigation anytime, anyplace, anywhere. [1] 18

23 Despite its military origins, GPS is currently used by an enormous number of civilian applications. The current worldwide GPS industry is estimated to be approximately $8 billion and there are about four million GPS users worldwide. Today, GPS operates with 24 satellites orbiting earth. Their orbits are steadily observed by a master control station located in Colorado-Springs and corrected if necessary. The idea was to constellate the orbits of the satellites in a way that the signals of at least three, better four satellites, can be received anywhere on earth. However, the satellite coverage has not completely succeeded. Leak areas are consistently reported. Moreover, orbits are changed during military conflicts to improve signal reception in specific regions and make reception difficult for the enemy. Since GPS is a military system, it was not intended to provide full precision for all potential users. Thus, the satellites send signals on two different frequency bands, one for civilian users (f 1 = 1575, 42MHz) and one for authorized users (f 2 = 1227, 60MHz). The frequency f 1 is modulated with a C/A-Code (Coarse Acquisition), while f 2 is modulated with a P-Code (Precise Code). The non-authorized user only has access to the C/A-Code, not to the P-Code. The C/A signal is a PRN-code (Pseudo Random Noise code). It consists of a sequence of 1024 bits that are generated by a simple state machine. At first glance, the sequence seems to be random, but it is not, as it precisely depends on the cycles of the internal clocks, thus the name Pseudo random. The satellites and the GPS receivers have synchronized clocks. In the case of the satellites, atomic clocks are used, which are synchronized to so-called GPS Time (offset by a constant from International Atomic Time (TAI)). International clock comparisons are routinely performed via GPS with accuracy on the order of 50 nanoseconds. Both the satellites and receivers generate the same PRN-code signal, which depends on the clock cycle. When the receiver receives the signal of a satellite, it can calculate its time-of-flight by determining the shift between the received and self-generated signals. This shift can easily be determined by cross-correlation. Having calculated the time-offlight, the distance to the satellite is given. That is, with a single satellite, the position of the receiver can be limited to a 3D sphere with the center being the position of the satellite. Intersecting the spheres from three satellites, the position on earth can be determined. Using a fourth satellite, even the elevation of the GPS-receiver can be calculated. Although, the C/A signal does not allow full precision in localization, it was still 19

24 believed to be too precise by the US military. Hence, the Selective Availability (SA) feature was introduced, which artificially scrambled the signals of the satellites. However, in February 2000, SA was ordered to be turned off by former president Bill Clinton. The reason was that GPS had evolved to an important commercial factor and competitive products like the Russian GLONASS should not be favored by potential users. Moreover, the military claimed to have the ability to selectively turn on SA at well-defined regions. Within the last years, a method called Differential GPS (DGPS) has come into favor. The idea is simple: Since the errors of GPS are correlated within the same region, the idea is to chose a fixed position on land as a reference point, whose correct geographical location is known. A GPS-receiver at this position determines its position as indicated by the GPS satellites, and then calculates the distortion by comparing the GPS-determined position against its known position. The land-based reference point then broadcasts the difference. Users use the broadcast difference to improve the calculations of their positions. As a result, DGPS accuracy and integrity are better than GPS. Typically, a precision of approximately 10m is achieved. Today, a world-wide net of DGPS reference stations exists, which provides the respective correction data. In programs like WAAS, EGNOS or MSAS, the data from different stations is collected and rebroadcast via satellite. Interestingly, the correcting data is broadcast at the same frequency and format as GPS. Thus, in principal, any GPS-receiver can receive the differential information, but only modern receivers are able to use the information for their calculations. 2.3 Navigation Using Radar Short-range millimeter-wave radar is emerging in the field of mobile robotics [32, 19, 18]. The Red Team, participating in the Grand Challenge 2004, used the NavTech DSC200 Continuous Wave Frequency Modulated radar, which has a range of up to 200 meters. It provides 360 scanning at 2.5Hz and an Ethernet interface to the range measurements. In contrast to vision and laser range scans, RADAR operates at a wavelength that penetrates dust and other visual obscurants but it provides data that is more difficult to interpret. Objects are detected by finding amplitude peaks in the frequency-shifted return signal. The amplitude of the signal can be used to estimate the size, and thus significance, of an object. However, this process can be confounded due to surface properties of objects and the orientation of the object relative to the transceiver. 20

25 2.4 Navigation Using Infrared Proximity Sensors Infrared distance sensors typically can resolve distances between 10 and 80 centimeters. The most popular application is ROOMBA, a small robot cleaner from irobot 5 that uses the sensors for navigation. Some teams in the MidSize league of Robocup (i.e. the team Attempto Tübingen) have experimented with these sensors. However the lights illuminating the playing field have an infrared contribution that corrupts the range measurements. 2.5 Navigation Using Ultrasonic Sensors Ultrasonic sensors have been widely used in indoor applications [28], but they are not adequate for most outdoor applications due to range limitations and bearing uncertainties. In underwater robotics, sonar sensors are often used [22]. Current work on undersea vehicles concentrates on the development of accurate position and attitude estimation, as well as control methods using scanning sonar to provide terrain-aiding information from either a man-made structure or from the sea bed. Key elements of this work include the development of sonar feature models, the tracking and use of these models in mapping and position estimation [84], and the development of low-speed platform models used in vehicle control. Ultrasonic sensors are subject to noise and sporadic false readings, just like any other sensor system. In most operating environments (except for some shop floors), environmental ultrasonic noise is fairly rare. However, robots with multiple ultrasonic sensors may introduce their own noise, a phenomenon known as crosstalk. Crosstalk differs from other environmental noise because it introduces a systematic error that will repeatedly cause similar erroneous readings. A related kind of semi-systematic error is crosstalk that may occur when multiple mobile robots with multiple ultrasonic sensors operate in the same environment. In most indoor applications, crosstalk is much more likely to occur than environmental ultrasonic noise. In principle, sonar can be a basis for powerful sensing systems, as evidenced by certain animals such as bats or dolphins. However, the sonar systems used for mobile robots are usually rather simple ones, their simplicity and low cost being the very reason for choosing

26 Figure 2.4: The submersible, Oberon of the Australian Center for Field Robotics (ACFR) sonar as a sensing modality. It is not surprising that such systems are severely limited in their performance by low resolution, specular reflections, insufficient dynamic range and other effects. 2.6 Navigation by Vision A vision system perceives the light of the environment using a two-dimensional field of receptors. The human retina consists of approximately 120 million rods for blackwhite perception and 6 to 7 million cones for color perception. Current video cameras still are restricted to approximately 1 million receptors. Although higher resolutions are possible (photo cameras that can yield about 10 million pixels already exist), the primary restriction is the data bandwidth that is necessary to transfer video information at 25 to 30 Hertz for real-time vision. With this huge amount of information, the visual sense provides the richest information on the environment from all sensors. Such information is valuable for recognizing objects, understanding situations and controlling dynamic processes in real-time. When a human drives a vehicle, he mostly relies on his eyes. He uses his sense of vision not only for finding the path on which to drive and for estimating its condition, but also for detecting and classifying external objects 22

27 such as other vehicles and obstacles, and for estimating their state of motion. Entire situations may thus be recognized, and expectations as to their further development in the near future may be formed. The same is true for almost all animals. Many use vision as the main sensing modality for controlling their motions. Observing animals, for instance when they are pursuing prey or trying to escape a predator, may give an impression of organic vision systems performance in motion control. Some animals, like the Octopus, even use their visual sense to camouflage, changing their skin color or texture similar to their surroundings in order to hide (see figure 2.5). (a) (b) (c) Figure 2.5: Many animals rely on their visual sense. The cheetah (a) uses vision for hunting. The octopus (b) uses its visual sense to camouflage and the eagle(c) is known to have one of the sharpest vision systems (with kind permission from John Forsythe, Copyright John Forsythe) One apparent difficulty in implementing vision as a sensor modality for robots is the huge amount of data generated by a video camera: As mentioned, about 10 million pixels per second, depending on the vision system used. Nevertheless, it has been shown that even modest computational resources are sufficient for realizing real-time vision systems if a suitable architecture is implemented. In the late 1970s the idea of motion recognition and vehicle control by computer vision emerged in the context of planetary rover vehicles [78]. Typically, the required computers needed a quarter of an hour for image analysis. The vehicles moved a little and had to wait, standing still as long as the image was analyzed. Each image was interpreted from scratch. At that time, a different approach was taken by Dickmanns, coming from the field of 23

28 control engineering. His approach emphasized the importance of a high image frequency (larger than 10Hz) to exploit temporal and visual continuity in the scene. Modeling the change in camera location would allow prediction of the feature positions in the images and search space would be reduced drastically. Extending the well-established methods of the 1960s for recursive estimation of incomplete measurement data, he put forth the 4-D approach, the full state reconstruction in 3-D space and time, which was finally widely accepted in the community dealing with autonomous vehicles [8, 50, 59, 83, 90]. (a) (b) Figure 2.6: (a) The VaMP, Mercedes 500 SEL of the University of the Federal Armed Forces Munich, Germany. It is equipped with four video cameras, a millimeter wave RADAR and eight microprocessors for object recognition and autonomous vehicle guidance. (b) Linking profiles of similar gray values, hypothesis for objects are created. Once the hypotheses exist, tracking is based on the system dynamics approach using motion models of the objects. (with kind permission from Prof. E. D. Dickmanns) The approach is based on the famous Kalman filter [58]. It combines a motion model with measurements and yields least-squares estimates of the system state. For motion control applications, the measurements are typically taken from sensors such as position encoders, inertial sensors, GPS, etc. Dickmanns was the first who applied the Kalman filter framework with visual sensors [26]. What at the first glance simply looks like an application of the Kalman filter evolved to one of the most promising and elaborated frameworks world-wide. The primary issue is that instead of processing the whole image in order to retrieve some positional information, the observational model is used to predict the location and 24

29 appearance of features in the image. In this way, specialized, highly-efficient detectors that investigate only very small portions of the images can be applied. With this approach, real-time state estimation using visual information was possible as early as the 1980s, although computational power was relatively low. Another benefit of the approach is that it avoids the inverse projection problem. A point in the image plane corresponds to a ray in the 3D world and it is not possible to identify the depth at which the light was reflected by an object using only a single image. This problem results in singular matrices in many approaches. Using Dickmanns approach, this problem is avoided because the discrepancy between predicted and measured features is directly fed back using a filter matrix, which is based on the motion and observation models of the system. The work of Wünsche [85, 86] was another important milestone, since it showed how features from which the system state can best be estimated can be selected. The system dynamics approach requires the existence of an initial estimate of the system state, which typically includes the pose of the system. If this estimate is not available or not precise enough, the approach will fail. For this reason, an initial phase is run at the very beginning with the goal of yielding an initial estimate. The real-time phase follows, which is essentially the system dynamics approach. Initial and real-time phases don t have to be processed one after another; they can also be processed in parallel. Here, the real-time phase is executed in each frame while the initial phase can run at a lower rate. In more complex scenarios, as in the case of figure 2.6, the initial phase also has to yield hypotheses for other objects. The efficiency of the approach lies in the fact that it uses as much prior information about the application as possible. This information includes a model of the environment, the dynamics of the system and the observation model, which describes the geometric relationship between objects in the world and their appearance in the image. It is of immense importance to use this prior knowledge. When a change in the images is detected, this change might be caused by sensing noise, by a movement of the system, or by a change of the environment. Use of prior knowledge enables us to identify which of the components most likely produced the change. Systems not using this information typically try to reduce noise in their position estimates by some smoothing operations. However, smoothing without taking a motion model into account, automatically produces a delay, which is poison for a control application. 25

30 The Dynamic Model The motions of real mass bodies underlie physical laws, which are formulated in the form of differential equations since Newton ( ). Each set of differential equations can be converted into a system of first order differential equations. In this form, the equations simply express how the change of a state value depends on the value of the other states at time t. ẋ = f(x(t), u(t), z(t), p) (2.1) Here, the state variables are summarized in the state vector x. The number of states depends on the system. An aircraft will typically have at least 12 components, for example. In the case of the soccer playing robot, which will be investigated later, the vector will comprise 6 components. The state vector also includes the position of the system. The vector u describes the control input. This vector is important since the dynamics of the system can be predicted by considering the last control commands. Vector z takes unknown noise into account. Finally p is a vector of constant parameters. Using the dynamic model, it is possible to predict the state for a time period T. x (k) = x (k 1) + tk t k 1 f(x(τ), u(τ), p)dτ (2.2) However, the predicted system state x (k) will be subject to errors due to errors in modeling and the unknown noise z. The Model of the Environment The goal of using the observation model and model of the environment is to be able to predict the location of features in the images. For instance, the environment can be modeled as a 3D wireframe model. The model has to include information about the material of the objects so that a prediction of the appearance in the images can be made. This model can be regarded as a 3D feature space. The Observation Model The observation model describes how the features p i (x pi, y pi, z pi ) of the feature space are mapped onto the image plane. Here, x pi, y pi, z pi are the 3D coordinates of the ith feature. This mapping depends on the actual system state x and other known constants k c (focal 26

31 length, camera mounting position, etc.). This mapping g is typically nonlinear (i.e in the case of a perspective camera). y = g(p i, x, k c ) (2.3) Here, y is the vector that describes how the feature appears in the image. Correcting the Estimate of the System State Given the predicted system state x (k), we can now predict the appearance y (k) of a specific feature in the image by using equation 2.3 describing the optical mapping: y (k) = g(p i, x (k), k c ) (2.4) Since x (k) will be subject to errors as described above, y (k) will also be incorrect. However, the error is typically small since the prediction is only made for a small time period T (T 1/30s). In order to correct the system state, we first determine the real vector y from the image. One of the essential advantages of the overall approach is in this step: Instead of processing the whole image, a specialized detector can be applied to only a small part given by y (k). This notion is illustrated in figure 2.7. Depending on the feature, the detector can, for instance, search an edge along a small line fragment. The type of detector can also be predicted. Then the appropriate detector can be selected out of a pre-computed array of detectors and the detector finally measures y(k). Now we consider the discrepancy between predicted and real measurements y(k) = y(k) y (k), (2.5) and we feed back the discrepancy in the image plane onto the estimate of the system state. x (k) = x (k) + K(k) y(k). (2.6) Here K(k) is the so-called filter matrix. The Kalman filter theory provides a method to determine this matrix in order to get a new least-squares estimate for the system state. For an in-depth introduction, the book [11] can be recommended. The filter matrix is calculated by: K = P C T (CP C T + R) 1 (2.7) 27

32 camera Figure 2.7: An aircraft is equipped with an video camera for automated landing approach. The basic concept of the system dynamics approach is to use a few reliable features to update the estimate of the system state. The rectangular frame at the side of the runway shows the area that corresponds to the image region, which is inspected by a feature detector. While typically more than one feature is used, the approach does not demand the presence of all features covering the degrees of freedom at the same time. Here P is the actual error noise covariance matrix. The number of columns and rows of the matrix equals the number of system states. R is the measurement noise covariance matrix. Its number of rows and columns corresponds to the dimension of y. Finally, matrix C is the Jacobian matrix, which specifies how the appearance of the currently-used features changes to first order when the system state changes. More formally, C contains the partial derivatives of y with respect to the system state x. With y = (y 1, y 2,...y m ) T and x = (x 1, x 2,..., x n ) T, C := δy 1 δy 1 δx 1 δx 2 δy 2 δy 2 δx 1 δx 2. δy m δx 1.. δy m δx 2 δy 1 δx n δy 2 δx n. δy m δx n (2.8) 28

33 Wünsches work [85, 86] shows how the Jacobi-Matrix can be used to decide which features to track. The underlying idea is that the column vectors should be linearly independent so that different features open up separate dimensions. Moreover, features with large partial derivatives should be preferred, since they indicate a close coupling between a component of the system state and the respective feature component. Wünsche developed a method that successively swaps currently-used features with non-used prominent features in order to find an optimal set. 2.7 Typical System Architectures in RoboCup Most RoboCup teams do not implement the system dynamics approach. The main reason for this is probably the complexity of the approach. Remember that this approach allows the restriction of image analysis to small parts of the images by making use of a dynamic model of the robot and the environment. There are a few systems that restrict image analysis to parts of the image. For instance, in [47], a collection of 1200 jump points are used. They are distributed over the image, and object detection is carried out based on these points. Similarly, in [77], a method has been adopted that samples pixels from the image, adapting the sampling resolution to the expected size of objects. These methods do not access all pixels, but still processes all areas of the image, not concentrating on specific locations where features are expected. Another example is [69] where prior knowledge about the orientation of the camera is used to predict the orientation of the horizon and respective perpendicular scan lines, which are applied for feature detection. However, they do not use a dynamic model of the robot and environment to feed back the measured discrepancy between prediction and measurement. In contrast to the above mentioned systems, most systems in RoboCup are based on color segmentation of the whole image in each frame. This is done either in hardware (i.e. by the teams [7, 35]), or software (i.e. by the teams [5, 12, 16, 4]). A popular software method is described in [15]. Most existing systems in RoboCup do not distinguish between an inital phase and a real-time phase, but rather run an initial phase in each frame. The general idea of the two phases is that first hypotheses, for instance for the robot s position, are generated in a computationally more expensive initial phase, and that the change of the state of the 29

34 hypotheses is kept track of in an efficient real-time phase. The real-time phase uses the hypotheses to predict the locations of features in the images and, therefore, only a small fraction of the image data has to be evaluated. 2.8 Existing Methods for Field Line Extraction This section will review different methods to extract the field lines from the images. To illustrate the methods, we will use the same input image that is depicted in figure 2.8. Figure 2.8: This image will be used to review existing methods of line extraction. 30

35 (a) t=140 (b) t=160 (c) t=200 Figure 2.9: Pixels whose intensities are below the threshold t have been drawn black. Applying different intensity thresholds demonstrates that a global threshold does not exist. In image (a) the threshold is too small to separate the centre circle from its surroundings. On the other hand, in image (b) the field line on the bottom right is missed. In image (c) false candidates like the wall of the goal are not present, but at the cost of missing field lines Applying Thresholding and Thinning to Extract the Lines The field markings appear very bright in the images. Therefore, one of the most simple approaches for extraction is to use a threshold t to discriminate between field-line and non-field-line pixels. However, there are other bright objects, like the side walls of the goals for instance, and a pixel whose intensity exceeds t may not necessarily be a field line. Figure 2.9 shows the results of applying different thresholds and demonstrates that finding a global threshold that separates the field lines from the carpet is not possible. The reason is that the lighting changes on the playing field. In the worst case, the intensity of the green carpet at a bright location can exceed the intensity of a field line at a dark location. Moreover, the thresholds will vary depending on the robot s position and orientation on the playing field. We will ignore this problem for the moment and solve it later. Although the method touches all the pixel data of the image, it is very fast. The average running time was 1.3 milliseconds on our test platform 6 (C++ code, no usage of MMX or SIMD instructions). Having extracted candidates for field lines, it is important to reduce the amount of data to reduce the computational load for further processing. One possibility is to employ a thinning algorithm as described in [25]. The pseudo-code and a short description is provided in appendix A. 6 Pentium III, 800 MHz, image resolution of

36 In a simple, non-optimized thinning algorithm, processing takes 4.4 milliseconds. In particular, when using a video format where the intensity is represented separately from the color components, like YUV for instance, the method can be implemented very efficiently. Figure 2.10 shows the result of thinning the lines of figure 2.9b). The advantage of thresholding and thinning is that it is simple and fast. The main disadvantages are that not all lines will be detected due to variations of the lightening and that false lines will be extracted due to bright objects like white walls outside the playing field. Experience has shown that the method worked well in the setup at the competitions in Padova 2003, because the field was brightly illuminated (1000 LUX) from above and the lines were brighter than other white objects outside the playing field (with the exception of the white bars and side walls of the goals). However, the artificial illumination of the playing field is going to be removed by the RoboCup organization within the next few years (in 2004 it was already reduced), and the problem of detecting false candidates will become more serious. On the other hand, the problem of false candidates can be coped with when the position of the robot is known, because the position of the lines in the image can be predicted, and detected lines can be discarded if they differ too much from the predicted ones. Note, that the techniques described in this section are not applied in the real system. They are only described as a review. The techniques applied will be described beginning with chapter Using the Canny Edge Detector to Extract the Lines Using the thresholding method, it is not possible to find a global threshold to extract the field lines, because of the different lighting conditions. However, each line should produce two edges, one at each side of the line, respectively. Detecting these edges should be more independent of the lighting, since for edge detection, we will compare the intensities of neighboring pixels, which should be affected similarly by the lightening. Extensive literature on edge detection is available. The most popular method is the Canny Edge Detector, which is due to [17]. The Canny operator performs several steps that are demonstrated in figure First, the image is smoothed with a Gaussian filter. A typical example is the following 5 5 filter: 32

37 Figure 2.10: Result after having thinned the lines of figure 2.9b) G = Then a simple 2D first derivative operator is applied to the smoothed image to detect regions of the image with high first spatial derivatives. This process requires the convolution of the image with two filters, one for the x and one for the y gradient component, respectively. A common filter pair is the Sobel operator. S x = S y = From the gradient components, we compute the gradient magnitude, which is often approximated by the sum of absolute values (instead of the Euclidean distance) to speed up the computation. Edges give rise to ridges in the gradient magnitude image. The algorithm then tracks along the tops of these ridges and sets to zero all pixels that are 33

38 not actually on the ridge top so as to produce a thin line in the output, a process known as non-maximal suppression. The tracking process exhibits hysteresis controlled by two thresholds: T 1 and T 2, with T 1 > T 2. Tracking can only begin at a point on a ridge higher than T 1. Tracking then continues in both directions along a line from that point until the height of the ridge falls below T 2. This hysteresis helps to ensure that noisy edges are not broken up into multiple edge fragments. Figure 2.12 shows the final results for some different thresholds. Running the detector on our platform took 54 milliseconds. This slow performance is due to the many multiplications that have to be carried out during the three convolutions (5 5 smoothing filter, 3 3 x-sobel filter, 3 3 y-sobel filter). To detect the lines, different detectors might be applied. Possible solutions could be based on the Roberts Cross Edge Detector [70], the Sobel Edge Detector [66], the compass operator [71], edge detectors using the Laplacian of Gaussian, Gabor filters [62] or wavelet based solutions [57]. However, applying such a scheme to the whole image is time consuming. In particular, when processing 15 to 30 frames per second for real-time vision, it is important to restrict the area of the image to which a detector is applied. 34

39 (a) The original intensity values (b) Smoothed with a 5x5 Gaussian filter (c) x-component of the gradient (d) y-component of the gradient (e) Magnitude of the gradient (f) Nonmaximum suppression Figure 2.11: The Canny Edge Detector. The original image (a) is smoothed (b) and the components of the gradient are computed by convolution with a Sobel mask ((c) and (d)). The gradient magnitude is calculated (e) and nonmaximum suppression is performed (f). The result of edge extraction is shown in figure 2.12 for different thresholds. 35

40 T 1 = 160 T = T = 120 T = T 1 = 80 T 2 = 80 T 1 = 80 T 2 = 40 T 1 = 50 T 2 = 30 T 1 = 50 T 2 = 10 Figure 2.12: Results of the Canny Edge Detector using different thresholds. Edge tracking can only begin at intensities greater than T 1 and continues as long as intensities exceed T 2. 36

41 2.8.3 Using the Radial Scan Method to Extract the Lines The idea of the radial scan method was first described in my diploma thesis [81]. The idea is to employ feature detectors along rays starting at the center of the image. For omnidirectional vision systems, we use the center of the mirror in the image, which is identical to the center of the image in the ideal case. It is also possible to let the rays start at some distance from the center, rather that at the exact center. In our omni-directional images, for instance, the robot is projected to the center of the images, and when searching for the field lines, we want to exclude the area of the robot. The advantages of the rays are first, that only a small fraction of the image data is accessed, and second that a direction is given. Therefore, direction-specific feature detectors can be applied. This is an advantage, because having a given direction, we can look for feature patterns along that direction. Here, feature pattern means that we can look for a specific order of features. To detect the field lines for instance, we can search for a color transition from green to white followed by a transition from white to green. Here, the order of transitions is along the direction of the ray. If this direction was not given, we would have to apply detectors for all possible directions, which would be computationally very expensive. An efficient method of how to find the color transitions is proposed in [29]. Figure 2.13a) shows a radial scan used for the purpose of detecting the white lines. In a first approach, we use the thresholding technique as a detection scheme. Figure 2.13b) shows the detected points. Since the pixel locations of the rays can be pre-calculated, the method can be efficiently implemented. The thresholding technique as applied in the example of figure 2.13, which uses 200 rays of approximately 150 pixels each, runs in 0.62 milliseconds 7, which is more than twice as fast as applying the thresholding technique to the whole image. One disadvantage of the method is that features that do not intersect with the rays are not detected. For instance, a field line that is nearly parallel to the rays will only be intersected a few times. In chapter 3, a new method that does not have this limitation will be proposed. Although the radial scan is no longer applied in our system, it has been adapted by several other teams (i.e Brainstormer Tribots). 7 Pentium III, 800 MHz, image resolution of

42 (a) (b) Figure 2.13: The Radial Scan method evaluates image information along rays. (a) Accessed pixels are drawn black. (b) Pixels on the rays that are brighter than t = 160 are extracted Using a Model to Extract the Lines Another possibility is to predict the geometry of the field lines in the images, assuming that the distance calibration and position of the robot are known. This approach is similar to the system dynamics approach in that the internal model is projected onto the image plane. However, no direct coupling between measurements and the motion model exists. It is important to be aware that due to the mirror, even straight lines become curved in the images (except the lines that appear radially). But there are also some originally curved lines as in the case of the center circle and quarter circles at the corners. To approximate this curvature, the field lines could be divided into a set of small, straight line segments. To predict the positions of the lines in the image, the endpoints of each line segment can be transformed from world to image space. This only requires image inspection in the neighborhood of the predicted line segments. The lines can be detected if the difference between predicted and existing lines is small enough. To inspect the image in the neighborhood of the predicted positions, one method is to search along lines perpendicular to the line segments. This has the same advantage as the radial scan, that the search direction is given. Therefore, one can search for green-whitegreen transitions, for instance. Figure 2.14 illustrates the overall idea. The advantage of this method is that only a small fraction of the image data must be accessed. This 38

43 (a) (b) Figure 2.14: Using a model for line detection. (a) The model consists of small line segments whose endpoints are marked by dots. The robot position is indicated by the black disk with the arrow showing the robot s orientation. (b) Knowing the robot s position and having a sufficiently precise distance calibration, the line segments can be projected onto the image and the lines in the neighborhood of these segments can be searched. spatial selectivity also reduces the number of false candidates. When the robot moves, odometric information can be used to predict the new positions of the lines. As in the system dynamics approach, a disadvantage is that the initial robot position and distance calibration must be known. Instead of a rigid model as described above, one could adopt the approach of a deformable template as described in [89] where a flexible model consisting of eleven parameters is used to detected the eyes in a face recognition system. The parameters control the position and orientation of the eye, the position and size of the iris and the bounding contour of the eye, which was approximated by two parabolic sections. The deformable template interacts with the image in a dynamic manner in which the parameters are updated by steepest descent. The philosophy behind using models is to incorporate prior knowledge about the domain of application to accomplish the perception. However, with the long term goal of building flexible, general purpose vision systems, it is obvious that the system itself must learn the models rather than have a fixed number of hard-coded models. A general vision system should be able to perceive even unknown objects. 39

44 2.9 Existing Methods for Robot Self-Localization Using the Field lines The localization problem, that is, determining the robot s pose (position and orientation) relative to its environment, is a key problem in mobile robotics. Its solution plays a fundamental role in various successful mobile robot systems (see e.g. [21, 34, 41, 54, 68, 75, 82] and various chapters in [10, 51]). Knowing the position of the robot is important for navigation and the exchange of information about the environment. In RoboCup, for instance, a robot can inform another robot about the position of the ball if the second robot is not able to see it. However, both robots have to know their respective position on the field. In general, one can distinguish between global and relative localization. In global localization, the problem is to find the pose of the robot without prior information. In relative localization, which is also known as position tracking, the position some time t ago is known and the task is to update the position due to the movement of the robot within this time interval. In many cases, odometric information of this movement is available and hence, only small errors have to be corrected Monte Carlo Localization For global localization, the robot s observation at a single point in time often is not sufficient to uniquely resolve the robot s position. Therefore, methods that propagate a possibility distribution of the robot s pose have become very popular. In particular the Monte Carlo Localization(MCL) [79], which approximates the possibility distribution by using a set of particles, has become the method of choice for global localization. We experimented with MCL in an initial version of our localization system. However, when the feature recognition approach was developed later, we discovered that MCL was no longer necessary. Nevertheless, we shortly describe how we used MCL together with the field lines. We used 200 particles to represent the probability distribution for the localization of the robot. Each particle has cartesian x, y-coordinates, a heading direction φ and represents a hypothesis of the robot s position on the playing field. The probability that the robot is within a certain area is proportional to the density of samples within the 40

45 area. Initially, all particles are distributed uniformly over the field (see figure 2.15). Figure 2.15: Initially the particles are distributed uniformly over the field. Two kinds of update operations are applied to the sample set, movement updates and sensor updates. In our implementation, movement updates were based on odometric information. When the robot moves within two points in time, odometric information yields a relative translation T and a relative rotation φ, which approximate the real movement of the robot. Performing a movement update, all particles are translated and rotated according to T and φ plus additional noise that is modeled by two onedimensional Gaussians. The first Gaussian G T, models the noise along the translational direction and G φ models the noise of the rotation. The variances of the Gaussians depend 41

46 on the length of translation and the amount of rotation. They have been modelled by σ 2 T := 0.005cm T + 4cm 2 (2.9) σ 2 φ := φ , (2.10) where T is measured in centimeters and φ in radians. Sensor updates are based on the Figure 2.16: Due to movement and sensor updates, the particles begin to cluster. When the robot is localized, a single cluster represents the position of the robot. line contours. Later, we describe how they are extracted from the images. At this point, it suffices to think of a set of approximately 400 points, each originating from a white field line. Note, that two types of points exist: On the one hand, we have the particles or samples, which represent hypothetical robot positions, on the other hand we have the points, which represent the contours of the field lines that are extracted from the images. Each sample is given a weight equal to the probability that the line contours have 42

47 been perceived from the samples pose. Here, one might think that the computational load is too high to compute the probability for all 200 samples at a real-time video frame rate. However, approximating the probability by a matching distance based on closest point relationships, a lookup table can be pre-computed, which allows an efficient evaluation. This method will be described in more detail later, where it is also used for relative localization (see page 107). Having weighted the particles, a new set of unweighted particles is generated by randomly sampling from the old set. Here, the probability of a sample being selected is proportional to its weight. By repeating movement and sensor updates, the particles begin to form clusters where the robot s pose is likely to be, and finally a single cluster remains (figure 2.16). Due to the symmetry of the playing field, two symmetric clusters should always emerge. However, in most cases the number of samples within the two clusters differ slightly due to the random sampling. Since we use relatively few particles (only 200), this introduces a bias for the bigger cluster. Thus, one of the clusters tends to attract more and more particles while the other vanishes. Due to this effect, only one cluster will always remain in practice. Localization tends to fail if no particle close enough to the robot s position exists. Therefore, a common approach is to take 10 percent of the particles (the ones with the lowest weights) and distribute them uniformly over the field in each step. We experimented with MCL, also performing a feature recognition in each step, which will be described later. We are able to recognize the center circle and specific corners at the penalty area. Detecting such features gives a direct hint for the robot s position, and we integrated these hints into the Monte Carlo Localization by directly placing particles at the corresponding positions, following the idea of [53]. Later, we discovered that the feature detection allowed us to implement a more efficient global localization method without the need for multiple hypotheses as in MCL. We will describe this method in chapter 4. Using MCL together with the field lines is not a new idea. In [48], a collection of points on field lines is extracted from the images and MCL is used for localization. However, no attempt has been made to recognize features like the center circle in the field lines. Recognizing such features improves localization, because it allows insertion of particles at the corresponding positions, an idea which is due to [53] and which has been elaborated 43

48 by [79] in form of the Mixture MCL. The idea of inserting particles is important, because MCL fails if no particle is present at the true position of the robot. In contrast, MCL together with the recognition of line features has been adopted in [39]. Here, the Hough transform [44] is used for line and circle detection. After having accumulated the evidences, the particles are weighted by predicting the position of the expected features in the Hough spaces and looking up the accumulated values in the parameter space. The approach is interesting because it avoids extracting maxima in the Hough spaces. That is, the features are not extracted explicitly. However, one disadvantage of the approach is that it requires a separate Hough space for each feature, and that a single detected point on a line gives rise to entries in all spaces. Thus, the method is computationally expensive. In contrast to the above approach, it will be shown later that the field lines can be detected and represented in a way that efficiently allows recognition of different features without using an expensive method like the Hough transform Global Localization by Matching Straight Lines Recently, a localization method based on the field lines but not using MCL has been proposed in [72]. Here, a method that was originally used for localization by laser range scans has been adopted. First, points on field lines are extracted from omni-directional images using the radial scan method (see page 37). The points are represented similarly to laser range scans. However, while a laser range scan has only one point at an angle, the points extracted from the images are organized in different layers. The first layer represents the closest points, the second layer represents the points which follow thereafter, etc. Next, straight line segments are extracted in each layer using an efficient divide and conquer algorithm, and collinear line segments are grouped to form single straight lines. Then, the LineMatch algorithm [38] is used to match the lines to a model that also consists of straight lines. The LineMatch algorithm essentially tests all combinations recursively, but stops a recursion when two orthogonal lines describe the mapping uniquely. In this approach, only straight lines have been considered. The center circle has been modelled as a rectangle and aslant lines that could emerge from the circle are excluded by detecting that the angle does not correspond to any of the two main components of line directions. However, if only a few lines are visible, the two main components cannot be determined reliably and the method will fail in such cases. 44

49 2.9.3 Relative Localization Once the position of the robot is known, efficient methods can be used to keep track of its position. The most efficient approach is probably the system dynamics approach that was already described at the beginning of this chapter. In chapter 4, we will sketch how this approach can be implemented for the RoboCup scenario. However, although the approach is efficient, we did not include it in our final system. The reason is that we are performing a feature recognition at 10 frames per second. This feature recognition is of utmost importance for the system for initial pose estimation and recovery from catastrophic localization errors. In RoboCup, such errors can occur, since collisions can happen and robots are sometimes manually placed in other positions. The feature recognition approach that will be proposed later requires the field contours as input, and we will develop a new method in the next section, which is able to efficiently extract the field lines from the images. This method represents the field lines as chains of points, which is the input format of the later feature recognition approach. However, even without the recognition of features like the center circle it is possible to use the extracted field lines to correct small errors of the robot s position. For this purpose we implemented a method that is similar to the approach in [55], which was designed for the use of laser range scanners. Here, the task was to recover the relative translation and rotation between two laser range scans, taken at consecutive time points. The relative movement is recovered by minimizing a distance function, which is defined through point-to-point correspondences between the two scans. In chapter 4, we will propose a new method that is more efficient than the approach of [55]. The increase in efficiency is made possible by exploiting the fact that the playing field of RoboCup is static (the field without the moving objects). It is then possible to pre-compute a force field which is used for registering the field lines to the model. The method will be described in detail later Methods for Feature Detection In RoboCup, the detection of features or sub-structures such as the center circle are of great value for robust robot self-localization. Once the robot is localized, it is easier to keep track of its position by more efficient methods without trying to recognize features, but for initial localization, features are of immense value. Even when running Monte 45

50 Carlo Localization (MCL) [79], features speed up convergence by inserting particles at plausible locations that are indicated by the features [53]. When the distance function, which relates distances in the world to distances in the image, is calibrated badly, uniform methods fail while feature recognition is still possible. Thus, features are an entry point for automatic calibration. Despite of these practical issues, trying to detect features is of intuitive beauty, because it means understanding the data, not just applying a uniform method to a uniform set of data points. In RoboCup, several approaches to detect features in the field lines have been made. In [72], straight lines are detected, however, no other features, in particular no curved features, are extracted. In [39], straight lines and circles are recognized using the Hough transform [44], but no other features like the quater circles, corners or rectangles are detected. While robust and conceptually nice, the Hough transform is inefficient and for each type of feature, a separate two-dimensional parameter space has to be initialized, evidence has to be accumulated, and the main peeks have to be found. The last step has been optimized in [39] by combining the method with MCL; however, it is still computationally expensive. Aside from RoboCup, feature detection has been addressed in numerous papers. The simplest approach to feature detection is to directly derive parameters of the feature from the data. For instance, given three non-collinear points, one can easily derive the parameters of a circle containing the points. However, one has to know that the points belong to a circle. If one point is wrong, a false circle will be constructed. The next class of methods are least-squares fitting methods. Instead of deriving the parameters by the minimum number of required points, they use more points and minimize the overall squared error. Fitting methods exist for lines, circles and ellipses [31, 45], or more general for conic sections. However, as with the first method, it has to be known a priori which points emerge from the feature. Outliers drastically influence the result. Therefore, attempts for robust detection methods have been made. The most robust methods are probably those based on the Hough transform [44]. Numerous variants for the detection of lines, circles, ellipses and general polygonal shapes exist. Each input item (i.e a point) votes for all possible parameters of the desired feature and the votes are accumulated on a grid. The parameters with the most votes finally determine the feature. This last step is difficult. It requires a decision about the number of votes necessary to consider a feature to be present. Also, if the input data deviates slightly 46

51 from the perfect feature, accumulation spreads over neighboring cells, and peek detection becomes complex. However, the advantage of Hough transform based methods is that the membership of input data to the feature does not need to be known a priori. The counterpart to Hough transform methods are methods that probe the parameter space. Instead of beginning with the input data and deriving the parameters in a bottom up fashion, the parameter space is searched in a top down way [64, 13]. Other approaches rely on the initial presence or generation of hypotheses. RANSAC [30] and clustering algorithms such as fuzzy shell clustering (FCS) [23, 24] fall into this category. The advantage of initial hypotheses is that, if there is strong a priori evidence for a hypothesis, outliers can be detected easily by rejecting input data that are too distant. On the other hand, a hypothesis that produces too many outliers can be considered erroneous. A completely different approach is UpWrite [3, 60], which iteratively builds up small line fragments from points, and higher features like circles and ellipses from the line fragments. In [61], the method has been compared with the Hough transform and comparable robustness has been reported. An approach similar to the UpWrite has been reported in [49] where ellipses emerge from arcs, the arcs from line fragments and the line fragments from points. Later, we will propose a framework for the detection of many different geometric features within the field lines in the RoboCup domain. We describe how to detect straight lines, corners, T-junctions and the center circle of the playing field. Here, we follow the principle of constructing higher geometric features, such as circles and rectangles from smaller components such as lines, arcs and corners. Typically, different types of higher features are composed of the same kind of lower features. For instance, a rectangle is composed of straight lines and corners, and a T-junction is also composed of two straight lines. This hierarchical organization in which higher-order features share common components, makes the overall recognition process more efficient than approaches that try to detect the individual complex features separately. 47

52 Chapter 3 A new Algorithm: Tracking Regions Consider a system for a mobile robot, perceiving, reasoning and acting in real-time. Such a system might be divided into many parts, for instance vision, behavior, motor control, reasoning and communication. However, all parts share a common problem when performing calculation over time: Should all calculations be made from scratch or should the results of recent calculations be adapted to the current situation? For instance, when the robot has planned a path and a moment later an object has moved, blocking the planned path, should the path planner calculate a completely new path or should it try to adjust the old path? A system able to adapt calculations instead of calculating everything from scratch takes advantage of the fact that the solutions to the current problem will smoothly vary in time. A frequent problem is, that robots create unsteady, oscillating data structures and plans. Moreover, the overall solution will typically be more efficient, since it uses the results of recent computations. In the field of computational geometry, a new area of research has emerged due to this consideration. It is called kinetic data structures and it was introduced by Leonidas J. Guibas. For a review see [36]. Here, the question is how structures like the Voronoidiagram, for instance, can be computed in a scenario where the underlying situation is continuously changing. The goal is to adapt an existing solution to the new scenario and not to compute the solution from scratch. In real-time computer vision, the same problem exists. It would be wasteful to process the images independent of one another and to run image processing techniques each time from scratch. Essentially, use of techniques like the system dynamics approach, which was described 48

53 at the beginning of chapter 2, exploits the continuity of the world underlying the images. However, one weakness of the method is that it needs a model of the environment. It would not be a weakness if methods were available that could automatically create the models themselves, but this would require methods that can start without a user-defined model. In RoboCup, if a model and the initial estimate of the robot s pose are available, the location and appearance of the field lines in the image can be predicted. This allows to access only a fraction of the image data. On the other hand, as described in the previous section, there exist methods that can detect the field lines without the initial knowledge of the robot s pose, but at the cost of accessing more image data. It would be nice to have a method that combined the advantages of both, accessing only a small fraction of the data without the need of a user defined-model and without having to know the robot position and distance calibration. In this section, we will develop such an algorithm, the complete C++ source code of which is enclosed in appendix B. In order to develop the method, an important characteristic of natural images has to be exploited, the tendency of regions to contain similar colors or textures. Natural images do not look like figure 3.1a) but instead contain regions of similar color (figure 3.1b)). Particularly in the RoboCup domain, the images contain large green regions caused by (a) (b) Figure 3.1: (a) This image only contains noisy points and does not exhibit regions of similar color. (b) As a contrast, a natural image usually contains homogeneous regions. the green carpet and we will use these regions to detect the field lines. 49

54 The idea is to determine the boundary shape of the regions and to apply a local detector for the field lines around the boundaries (see figure 3.2). Thus, we use the boundaries as an index structure for feature searching in the image. This procedure is similar to that of using a model in that we limit the search space, but the method will be more flexible, since the geometry of the boundaries solely emerges from the content of the image. Compared with a model, additional computation is necessary to determine the regions and their boundaries. However, we will see that homogenous regions have a nice property that can be exploited to develop an efficient method for that purpose: When working with a video stream, successive images are similar in most cases and when a region changes position from one frame to the next, these regions overlap in many cases. This observation can be exploited in a way that the results from processing a frame are used to speed up the treatment of the next frame. (a) (b) (c) Figure 3.2: (a) The omni-directional image of a robot in the middle of the soccer playing field. (b) The regions between the lines as obtained by the tracking algorithm. (c) One region as an example: A detector is applied along lines perpendicular to its boundary. The algorithm is based on the region growing paradigm [14][91][74][2][42][65]. However, in the following section, we extend the paradigm to track regions over time. 3.1 Extending the Region Growing Paradigm In this section, we extend the region growing paradigm to track regions over time. Therefore, we first give a short review of a specific region growing algorithm that will be extended 50

55 later Region Growing by Pixel Aggregation In Region Growing by Pixel Aggregation one starts with a small region (i.e. a single pixel) and consecutively adjoins pixels of the region s neighborhood as long as some homogeneity criterion is fulfilled. This method can be implemented in the following way: We reserve a two-dimensional array V that has the same size as the image and that has a boolean entry for each pixel that indicates whether the respective pixel has been visited yet. Further, we maintain a queue Q that stores the spreading boundary pixels (their coordinates) of the region during the execution of the algorithm. We refer to the stored elements in Q as drops following the idea that we pour out a glass of water at the seed pixel and that the drops of this water spread over the region. Q stores the boundary drops at each time during the execution of the algorithm. Initially, if we start with a single pixel, Q stores a single drop corresponding to the pixel that is also marked as visited in V. The algorithm continues with a loop that is performed as long as any drops are stored in Q. In each pass of the loop, one drop is extracted from the queue and the neighboring pixels are investigated. In the case of a four-connected neighborhood, we inspect the top, right, bottom and left pixels. After insuring that the pixels have not yet been visited, we determine whether they hold for a specific homogeneity criterion, color similarity for instance. For each pixel that conforms to this condition, a new drop is instantiated and stored in Q. After the loop terminates, Q is empty and the region is marked in V. Figure 3.3 shows an example of a growing process that finds the region of the yellow goal The Key Observation Our goal is to efficiently track regions over time. To give a specific example, we want to track the yellow goal while the robot moves. Note, that later we will not track the yellow goal, but the green regions. However, the algorithm can more easily be explained with the example of the yellow goal. Assume that the robot starts at pose P A and takes an image I A. Then the robot moves a little to pose P B and takes another image I B. Assume further that we have determined the region A of the yellow goal in I A and the corresponding region B in I B by the above region growing algorithm. If the video frequency is high enough with respect to the movement of the robot, the two regions will overlap as depicted in 51

56 Figure 3.3: The array V is visualized by the dark gray surface (visited elements). The white circles represent the drops. Here, the algorithm does not work on single pixels, but on blocks of 4 4 pixels. figure 3.4. If we apply the region growing algorithm separately to image A and B, the A B B A A B (a) (b) (c) Figure 3.4: The regions of the yellow goal are determined (a) before and (b) after the robot has moved. (c) The two regions overlap. running time for each is linear with respect to the number of pixels (or blocks of pixels) within the respective regions. 52

57 However, we want to develop a more efficient method. Assuming that we have extracted region A, then roughly speaking, we want to use the drops of A s region growing to somehow flow to the region of B by making use of the overlap. The drops of region A should first shrink to the intersection S of A and B and then grow to the boundary of B. Thus, in order to find region B, we don t start with a seed pixel, but with a whole seed region, the intersection of A and B. Figure 3.5 sketches this idea. To realize the algo- S A A B B (a) (b) (c) Figure 3.5: (a) Seed pixel and initial growing to the boundary of A. (b) Shrinking to the intersection of A and B. (c) Growing to the boundary of B. rithm, a method to shrink region A to the intersection of A and B has to be developed. This will be done in the following subsection Shrinking Regions We will develop the shrinking method in two steps. First, we consider the case of shrinking a region without stopping. That is, the region shrinks until it vanishes. Next, we modify the method so that shrinking stops at the intersection of A and B. In the first step, we don t need any image information, just the array V in which region A is marked, and a queue of drops at the boundary of that region. As we will need two different queues for growing and shrinking later, we denote the queue used here as Q to avoid confusion. We apply the same operations as in region growing with one exception: We reverse the meaning of V. As a result, the drops can only spread within the region marked in V and since they are initially placed at the boundary, their only means of escape is to move from the outer to the inner part of the region. Instead of marking 53

58 Figure 3.6: The region shrinks until it vanishes. No image information is required for this process. elements in V as in region growing, entries in V are cleared while the drops spread. At the end, Q is empty and V is cleared completely. This process is illustrated in figure 3.6. In the second step, we use image I B to determine when shrinking should stop. We emphasize that the initial region (marked in V ) comes from image I A while image I B is referenced during the method. Each time a drop has been extracted from Q, we verify the homogeneity criterion for the corresponding pixel in image I B. If the pixel belongs to the region, the drop is no longer allowed to spread. As a result, the region shrinks to the intersection of region A and B as depicted in figure 3.7. This is exactly inverse to the growing, where drops only spread to neighbors that fulfill the homogeneity criterion Alternating Shrinking and Growing We want to alternate shrinking and growing in order to track regions. However, some problems concerning the interface between the two processes must be solved. The first problem is that after growing, the queue of drops is empty, but shrinking initially requires 54

59 Figure 3.7: The region shrinks to the intersection area, that is, until all drops are within the region of the new image. a list of drops at the boundary of the region. In the same way, shrinking ends with an empty list of drops but growing requires seed drops. To solve this problem, each of the two processes, growing and shrinking, has to build the initial queue for the other procedure. We accomplish this by using two queues, Q and Q. Growing assumes the initial drops to be in Q and after execution Q is empty and Q has been built up, which stores the initial drops for shrinking. Shrinking runs with these drops and initializes Q for the next growing. When the neighbors of an extracted drop are inspected during growing, the drop is inserted into Q as an initial drop for shrinking if any of its neighbors do not belong to the region. Vice versa, a drop that is extracted from Q during shrinking is inserted to Q as an initial drop for growing if shrinking stops for this drop. A complete listing of the source code in C++ is available in Appendix B. Here, we give the pseudo code for the growing and 55

60 shrinking procedure: Procedure 1 Grow() Input: array V marks whether a pixel has been visited yet queue Q (FIFO) maintains the drops of the growing process image Ia Output: queue Q (FIFO) the initial drops for shrinking WHILE (NOT Q.IsEmpty()) BEGIN d:=q.extractdrop(); atborder:=false; //a temporary boolean variable FOR (all neighbors b of d) BEGIN IF (b fulfills the homogeneity criterion) BEGIN IF (V is not marked at the position of b) BEGIN Mark V at the position of b; Q.Append(b); ELSE atborder:=true END END IF(atBorder) Q.Append(d); END Procedure 2 Shrink() Input: array V marks whether a pixel has been visited yet queue Q (FIFO) maintains the drops of the shrinking process image Ib the next image Output: queue Q(FIFO) the initial drops for growing WHILE (NOT Q.IsEmpty()) BEGIN d:=q.extractdrop(); IF (the pixel in Ib corresponding to d does not fulfill the homogeneity criterion) BEGIN FOR (all neighbors b of d) BEGIN IF (V is marked at the position of b) BEGIN Clear V at the position of b; Q.Append(b); END END ELSE Q.Append(d); END 56

61 (a) (b) (c) (d) (e) (f) (g) (h) Figure 3.8: The drops of Q and Q are painted white and gray respectively throughout the following steps: (a) Image I A after growing. (b) The new image I B (c) After clearing V at the drops of Q. (d) During shrinking. (e) After shrinking. (f) Start of growing. (g) During growing. (h) After growing. In order to start the shrinking process after the growing process, one last supplementation has to be done. Between growing and shrinking, elements in V which correspond to the positions of the initial drops for shrinking in Q have to be cleared. This is because during shrinking the drops should always be on unmarked fields of V. Figure 3.8 illustrates the combination of growing and shrinking and shows how a region is tracked over two consecutive images. The drops of both queues, Q (white) and Q (gray), are depicted. By consecutively alternating shrinking and growing, a region can be tracked through a sequence of frames. 57

62 3.1.5 Applicability Consider an image with only one region being tracked. Assume that outside the region there is an object that moves very quickly towards and into the region. As a result, the object will be within the region in the next image. Thus, the object is an island within the region. This island will not be detected by the algorithm, since due to the fast movement, the object never penetrates the boundary of the region but jumps over it from frame to frame. As a consequence, the drops will not detect the object in the shrinking phase, since the drops are never next to the object. Therefore, the algorithm should not be used to track small and fast regions. Here small and fast means that the regions, which emerge from the object, do not overlap in two consecutive frames. Interestingly, a high frame rate supports the applicability of the algorithm, since regions will more likely overlap Running Time The efficiency of the algorithm is caused by two facts: First, a region can be tracked without touching pixels not concerned with the region. This is possible because the drops act as indices to the image and the method is able to update the indices without accessing image information outside the regions. Second, the computational load of tracking the region is only a fraction of the load required to process the entire region, because only the non-overlapping parts are processed. In the following, we investigate the running time more precisely. When tracking the region over the two images I A and I B, the running time of the algorithm is primarily determined by the number of drops that are inserted and extracted from the lists, and the number of memory accesses to the image and array V. However, the number of memory accesses to the image and array V is up to a constant equal to the number of drop insertions. Three processes contribute to the running time: Clearing Here the elements in V that correspond to the drops in Q after growing are cleared. Thus, the running time is O(k A ), where k A is the number of drops at the boundary of region A. Shrinking The main loop within the shrinking step (procedure 2) executes as long as drops are extracted from queue Q. Initially, there are k A drops in the queue. During execution, drops are also inserted into Q and since Q is empty at the end, they 58

63 also contribute to the number of main loop executions. The number of drops is equal to the size of the shrinking area, which is the size of region A without the part overlapping with region B. We denote this number with n A\B. Furthermore, the list Q is built by inserting k A B drops, the number of drops at the boundary of the intersection area of region A and B. Therefore, the running time of shrinking is O(k A + n A\B + k A B ). Growing Analogously, the running time of growing is O(k A B + n B\A + k B ). Thus, the total running time is O(k A ) + O(k A + n A\B + k A B ) + O(k A B + n B\A + n B\A + k B ) = O(k A + k B + k A B + n A\B + n B\A ) (3.1) Hence, the running time depends on two types of terms: In the first type, (k A, k B, k A B ) the contributions depend on the length of the region boundaries. In the second type, (n A\B, n B\A ) the contributions are proportional to the non-overlapping parts of the regions A and B (see figure 3.4c), which is a fraction of the size of the entire regions (in the worst case equal to the size of the regions) Controlling the Tracking When regions of two successive images do not overlap, there is no intersection area and the shrinking step will delete the region. Hence, if the movement between two consecutive images is too large, a control mechanism has to be integrated into the loop of shrinking and growing. The control mechanism checks if tracking has lost the region. In this case, an initial search has to be started. Depending on the application, this procedure might search within the whole or a part of the image for a pixel or a block of pixels having a certain color or texture that satisfies the homogeneity criterion and maybe some other detection criterion. Figure 3.9 sketches how the controlling mechanism is integrated into the algorithm. Also, regions can be discarded from tracking. In this way, the computational power can be concentrated on regions of interest. To exclude a region from being tracked, one simply has to delete its entries in V and the corresponding queues. This can be accomplished by shrinking without a stopping criterion as illustrated in figure

64 Boundary extraction Growing no Region lost? Insert new seed yes Shrinking Figure 3.9: The controlling mechanism detects when a region gets lost. In this case, an initial search is started, re-initializing the tracker with new seeds. Shape information is retrieved by boundary extraction after the growing phase Homogeneity Criterion The homogeneity criterion specifies whether two neighboring pixels or blocks of pixels belong to the same region. There are many possible ways of defining such a criterion. It can be based on lightness, color or texture. It can be formulated either relatively by specifying, for instance, that two pixels are homogeneous if the difference in intensity is below a certain threshold, or absolutely, by classifying each pixel individually and comparing whether their classes are equal. An example is separation into color classes. In our RoboCup application, for instance, we classify the membership of each pixel (block of 4 4 pixels) in the color classes green, white, yellow, blue, orange and black and we define two blocks of pixels as homogeneous if they belong to the same color class Tracking Several Regions Extending the algorithm to track several regions, we assume that the homogeneity criterion is based on classification. We first consider the case that all tracked regions are of the same class. The green surfaces between the marking lines of the soccer playing field depicted in figure 3.2 represent such a case. The algorithm can be applied without change. However, the initialization differs, since seed drops have to be placed into the list Q for each region. Thus, Q and Q store the drops of several disjunct regions. Next, we want to track several regions of different classes. One possible approach would be to use only the one pair of lists Q and Q for all drops and to append the class information to each drop. But following this approach, difficulties arise when we want to extract the boundaries of the regions later, since the drops of different classes cannot be 60

65 accessed separately. Therefore, we decided to use a separate pair of queues for each class and perform growing and shrinking for each pair individually. In contrast to the separate lists, all classes use the same array V whose elements now store the class information instead of boolean values. A special value is reserved to mark unvisited fields. 61

66 3.2 Boundary Extraction The boundary of each region consists of a chain of small edge vectors (see figure 3.10(a)). Each edge vector represents one of the four sides of a pixel (or block of pixels) and the last edge vector ends at the beginning of the first. (a) (b) Figure 3.10: (a) The boundary of a region is composed of a sequence of edge vectors. (b) The directions of the edge vectors depend on the direction of the spreading when they are detected during the growing phase. The extraction of the boundary incorporates two steps. First, we determine the edge vectors and represent them in a specific data structure, the connectivity grid. Second, we extract the boundary by following the sequence of edge vectors. The first step is integrated in the growing phase. Remember that during growing, the drops spread until they reach the boundary of the region. Consider a drop that is next to the region boundary. When the drop spreads, its neighboring positions are investigated and since the drop is next to the boundary, at least one of its neighbors will not fulfill the homogeneity criterion. Each time such an event occurs, we represent the corresponding edge vector in the connectivity grid. The edge vectors can be considered as small line segments that separate the drops within the region from their neighboring positions outside the region. However, the line segment is directed, allowing us later to follow the boundary contour. Figure 3.10(b) illustrates the relationship between the edge 62

67 vector s direction and the direction from the drop within the region to the position outside the region. The connectivity grid corresponds to the image grid, except that the cells of the grid do not correspond to the pixels, but to their corner points. Therefore, the width and height of the grid exceeds those of the image by one, respectively. We reserve four bits for each grid point, expressing in which of the four possible directions an edge vector points. In Figure 3.11a) a small region is shown. Figure 3.11b) illustrates the corresponding connectivity grid of its boundary contour. The four bits of each cell are depicted as quarters of a circle to demonstrate their relation to the direction of the edge vectors. A black quarter means that an edge vector is present. Figure 3.11c) shows the superposition of the image and the connectivity grid with the edge vectors also inserted. Figure 3.11d) demonstrates that two bits can be set at the same time when neighboring regions touch at a corner. This approach is similar to [14], but our method occupies less memory, since [14] employs a supergrid of size (2n + 1) (2m + 1) and we use a grid of size (n + 1) (m + 1), where n m is the size of the image. (a) (b) (c) (d) Figure 3.11: Representing regions and their boundaries: (a) Example image with a gray region; (b) The connectivity grid, whose points have four bits to indicate the direction of outgoing edges; (c) The grid superimposed to the image. The flags indicate the boundary contour. (d) Several flags within a grid point will be set, if different regions touch each other. Boundary extraction can be accomplished by following the sequence of edge vectors. But since some grid points might offer several continuations, it is important to apply a rule in order to guarantee that each boundary is closed. One should always decide for the clockwise or counterclockwise possibility first. Depending on the direction of 63

68 rotation by which the edge vectors have been inserted into the grid, these rules yield the smallest/largest possible region boundary. Each time an edge vector is followed, we delete the edge vector from the grid. Hence, after all boundaries have been extracted, the grid is completely cleared. As mentioned above, the insertion of edge vectors into the connectivity grid takes place during the growing phase. When a drop is detected to be at the boundary of a region, the corresponding edge vector is inserted. An initial grid point must be present, to follow the contour. Therefore, after growing, we use the first drop in Q to obtain this starting point. Note, that the position of the drop itself is not the starting position, but one of the corresponding corner points represented in the grid. More than one region and corresponding boundary contours might exist. Therefore, after having extracted the first contour, we scan Q for the next starting point. While passing through the drops, we look at the connectivity grid whether an edge vector is present next to the actual drop. Four possible grid points have to be verified for each drop. The first edge vector found specifies the starting point of the next boundary contour. To avoid accessing drops in Q several times while searching for the start positions of successive boundary contours, we use a pointer that points to the first drop in Q that has to be inspected when we search for the starting point of the next contour. At the beginning, the pointer points to the first drop in Q. With each boundary extracted, we move the pointer ahead. In this way, when searching for the next starting point, we can skip the elements in front of the pointer. Unfortunately, moving the pointer ahead can not be carried out while extracting the contour, because there is no link between the grid points and the position of the corresponding drops in the queue Q. Note, that even if only one region is tracked, the drops in Q are not necessarily ordered along its contour. This becomes apparent in the case depicted in figure 3.12a)-c). Here, a region is shown that has an island, which means that within the region there is an area that does not satisfy the homogeneity criterion with respect to the region. The region has an outer and inner boundary contour. A seed drop is placed next to the outer boundary contour and growing is started. In each of the figures 3.12a)-c), the queues Q an Q are shown at the top and bottom of the figure. Figure 3.12b) shows the state of the algorithm after the first loop of the growing procedure and figure 3.12c) shows the state after the second loop. Note that in 3.12c), the boundary drops in Q are not stored in the order of the course of the contour. Indeed, the drops 64

69 even change between inner and outer boundaries. Q outer boundary inner boundary isle region 0 seed drop Q' (a) (b) (c) Figure 3.12: This figure illustrates that the order of the drops in Q might not correspond to the course of the boundary contour. Even if only one region is tracked, there might be more than one boundary contour if the region has an island as depicted here. The queues Q and Q are shown (a) after seeding, (b) after the first and (c) after the second loop of the growing procedure. Thus, the pointer cannot be updated while extracting the boundary. Instead, we pass through Q after boundary extraction and search for the first drop that is next to an existing edge vector. Remember that the edge vectors are cleared during extraction. Therefore, the found edge must be a starting point of a new contour and we move the pointer to the corresponding drop. 65

70 3.3 Extracting the Field Lines By Tracking Regions Many vision systems rely on the detection of discontinuities in the image. We use the term discontinuity instead of edges to emphasize that edges are not the only features of interest. Lowe [73] uses SIFT (Scale Invariant Feature Transform) features for stereo matching, ego-motion estimation and map construction. Harris s 3D vision system DROID [40] uses image corner features for 3D reconstruction. In [76] edges are extracted for motion segmentation. However, in all these applications, the features are detected by accessing all pixels in each frame. Since the features represent inhomogeneities in the image, homogeneous regions will surround them. In many cases, these regions will be large enough to make the application of our region tracking algorithm meaningful. Thus, touching the whole image can be avoided by applying the respective detector at the boundary positions of the regions. Moreover, more specialized detectors can be used. For instance, when searching for edges, a direction selective detector can be applied, because the normal direction of the boundary contour of the region can be determined. This can be done during the extraction of the contour. While following the edge vectors, the local tangent and thus, also the normal direction, can be calculated. However, since the boundary contour is a sequence of small edge vectors that have only four different perpendicular directions, several edge vectors in the neighborhood of the actual point should be used for tangent calculation. The simplest method is to use two constantly spaced points on the contour, which are iteratively moved one edge vector ahead. Then, the line connecting the two points can be considered as the tangent of the actual point (see figure 3.13 for illustration). Another tangent Figure 3.13: The tangent at a point on the contour can be calculated by connecting two points in the neighborhood of the actual point. To obtain the tangents of successive points, the two endpoints are shifted stepwise along the contour. 66

71 method to calculate the tangent directions is to smooth the boundary contour before computing the vectorial difference between successive points. In our RoboCup application, we use this way of proceeding to detect the white field markings. We track the green regions of the playing field and apply a detector along line segments that are perpendicular to the contour. The detector searches for color transitions from green to white and back to green. In this way, we obtain sequences of points that represent the field lines as illustrated in figure 3.14b). (a) (b) Figure 3.14: (a) The boundaries of the tracked regions are smoothed and a local line detector is applied. (b) The resulting line contours consist of a set of lines where each line is represented by a sequence of points, which are marked by small circles. Using this method, the field lines are represented twice, once from each of the two neighboring regions. At first glance, this seems to be a disadvantage, but we will see later that feature detection benefits from this fact. Although the method works well, one consideration needs to be addressed. Two regions adjoin at each field line. The fact that they are only separated by relatively thin lines seemingly produces a problem. One might think that a similar effect as described on page 58, where an object intrudes into a region without being detected, can occur with the lines. In fact, if the robot moves fast, a line can intrude into the region without meeting a drop. This scenario is illustrated in figure Indeed, the line would get lost from tracking, if not for the drops at the intersection 67

72 Intersection points Figure 3.15: On the left a region has been grown and the resulting drops are shown. On the right, the robot has moved. The new image and the old region represent the initial state before shrinking. But none of the drops, except the ones at the intersection of the line and the region, will shrink, since they all lie on the playing field. points of the old and new contours. These intersection points are labelled in figure 3.15b). Starting at these locations, the drops spread along the intruded line during the following shrinking phase as depicted in figure In our RoboCup application, intersections of Figure 3.16: During the shrinking phase the drops uncover the line. the field lines will always occur, since all field lines are connected with each other and their extension is greater than any region. Thus, not all the lines can intrude a region without cutting its boundary. 68

73 3.4 Results The advantage of the proposed region tracking algorithm is that only the non-overlapping parts of corresponding regions are processed in successive images. However, if the regions don t overlap, the algorithm has to launch initial searches in each frame and degenerates. Therefore the question is, how often do the regions overlap? Of course, the answer depends on the application. In the following, we present results from our RoboCup application. Here, we have tracked three different types of regions: the green playing field and the blue and yellow regions (goal and post markings). While the robot moved through the environment, we computed the fraction of the processed area with respect to the size of the image and the percentage of overlap of the tracked regions. We determined these values for each pair of successive frames and built the respective average over a longer time span (see table 3.1). We also counted the number of initial searches and repeated the experiment for different movements (rotation/translation) and speeds of the robot. Rotation processed fraction of the image % overlapping frames initial searches 56 /s 8.0% 80% /s 8.4% 77% /s 8.8% 71% /s 10.0% 65% /s 11.0% 60% Translation processed fraction of the image % overlapping frames initial searches 0m/s 6.3% 93% m/s 5.9% 89% m/s 6.8% 84% m/s 7.0% 82% m/s 7.2% 80% m/s 6.9% 81% m/s 7.9% 75% m/s 7.8% 73% m/s 7.7% 79% 24 2 Table 3.1: The average fraction of the processed area with respect to the size of the image and the percentage of the overlapping area with respect to the area of the tracked regions has been determined for different movements and speeds. The last two columns indicate the number of frames and the number of initial searches that had to be started. The tracked regions overlap significantly and only a small fraction of the image data 69

74 is accessed (below 10%). As expected, the rotation is more disadvantaged than the translation. This is because objects that are far away from the robot, such as the goals, have a high speed in the image, if the robot rotates. Therefore, the object regions might not overlap, and initial searches need to be executed. With a rotational speed of 200 per second, 25 initials searches had to be done within 82 frames. Assuming that an initial search requires accessing all the pixels in the image, the algorithm still performs well compared with common methods, which access all the pixels in each frame. Although the theoretical running time of the proposed tracking algorithm is evidently faster than any algorithm that touches all the pixels in each image, the question is whether this is also true for the practical running time, since the maintenance of the drops, queues and connectivity grid might be more time consuming than a straightforward implementation. Therefore, we decided to compare the algorithm with the color segmentation algorithm proposed in [15], which is believed to be the fastest color tracking algorithm at the moment, and which is applied by most teams in RoboCup. The algorithm is based on classifying each pixel into one or more of up to 32 color classes using logical AND operations and it employs a tree-based union find with path compression to group runs of pixels having the same color class. We calibrated both algorithms to track green, blue and yellow regions. We did not track the orange ball and black obstacles, because our system uses different methods than color class tracking to recognize them. The following table gives the absolute running times of the algorithms over 1000 frames with the robot moving. Each image consist of pixels ( YUV 4:2:2 ) and we applied a Pentium III 800 Mhz processor. CMU Algorithm Our Algorithm s s Thus, the proposed algorithm is significantly faster. Moreover, the proposed algorithm also extracts the contour curves of all regions at each frame. However, it is not claimed that the algorithm performs better in all possible cases. There might be applications where CMU s algorithm performs better. This will typically be in cases where many small and fast moving regions are to be tracked. 70

75 Chapter 4 A new Localization Method Using Shape Information In this section, we develop a new method for robot self-localization in the RoboCup environment using the field lines. 4.1 Three Layers for Robot Self-Localization Localization is achieved in three layers, which are summarized in table 4.1. At the bottom layer we have dead-reckoning. Here, odometric information supplies information about the wheel rotations, yielding an estimate of the robot s movement. The sensor delay is approximately 8 ms, which is quite small. For small distances, odometric information is very precise. When the robot moves only 1 centimeter, this movement can typically be resolved. However, errors in odometric information accumulate, in particular, when the robot rotates. Even if the rotation is measured with a very small error, this error has a big influence. For instance, when the robot moves straight ahead after a rotation, the initial error in the robot s heading direction will result in a completely wrong position estimation of the robot. Thus, odometric information is fast and precise for small distances, but accumulates to unbounded positional error in the long term. The second layer, relative correction, compensates for this problem. It assumes that an initial estimate of the robots pose is available and it is able to correct small errors using visual information. As visual information we use the field lines as extracted by the previously described region tracking method. When the initial pose estimate is slightly 71

76 Layer 1 Odometric information by measuring the wheel rotations Layer 2 Relative correction by matching the field lines to a model Layer 3 Feature recognition by detecting features like the center circle Table 4.1: The three layers of localization wrong, the extracted field lines do not fit precisely to a model of the lines. The second level calculates how to correct the pose in order to achieve the best fit. The delay of the visual information is typically between 60 and 100 milliseconds. Thus, when the world changes, the visual input will not reflect changes before 60 milliseconds, which is quite long if compared to the odometric delay. Thus, while the first layer is fast but accumulates errors, the second layer is not that reactive, but can correct positional errors. The first two layers are sufficient to correctly localize the robot for a long time period, provided that an initial correct pose is available. However, when the robot is manually placed in another position (the kidnapped robot problem), when wheel slippage is too large (i.e because of a collision with the ball or another robot) or at the very beginning, when the system is started, this initial position is not available. Therefore, when the initial estimate is wrong, localization fails if solely established on the first two layers. Thus, a third layer exists. The third layer, recognition, performs a feature detection on the extracted line contours. The method is able to detect features, which we refer to as high-level features, because recognizing such a feature immediately yields the unique position of the robot in a global coordinate system. Here, unique is not completely true, because the line structures on the playing field are symmetric and each high-level feature yields two symmetric positions on the respective sides of the field. However, the correct position can be selected by considering recent valid positions of the robot, or, if these are not available, by considering the goal colors. It is this last layer which makes the proposed localization method robust. When all the visual tasks including the region tracking, line extraction, and three levels of localization are performed in each frame, 24 frames per second can be processed. However, we discovered that it is not necessary to process all tasks with the same frame rate. We use the following setup: The green regions are tracked with 30 frames per second, because their boundaries are important for ball detection and the ball should be tracked with a high frame rate. This occupies only 2-8 percent of the processing power 72

77 (Pentium III 800 MHz). The field lines are extracted only every third frame, resulting in a rate of 10 frames per second. Layers 2 and 3 of the localization also run with this frame rate. Layer 1 uses only odometric information, which is triggered 50 times per second, independent of visual processing. The remainder of this chapter is organized as follows: Before we start to describe the individual layers, we first describe the general framework for fusing the results of the layers. This fusion process is done by a Kalman filtering approach. It requires the definition of a system state, a dynamic model and a measurement model of the robot. In this context, we also describe specific coordinate transformations that will be used by all three layers. The data fusion is not trivial, since the sensor measurements of the individual layers are subject to different time delays. We explain how these measurement delays are coped with using an explicit modeling of time. We begin with layer 1, the process of dead-reckoning. Before describing layer 2 and 3, we describe the visual input for these layers: the line contours as extracted by the region tracking algorithm. The line contours are distorted due to the catadioptric mapping and we describe how this distortion is corrected and how the contours are transformed into world coordinates. Then we describe layer 2, relative correction, and layer 3, the recognition process. 4.2 The Robot s System State We specify the robot s position in a cartesian coordinate system whose origin is located at the center of the playing field (see figure 4.1). The cartesian x, y-position together with the orientation φ is referred to as the robot s pose. The orientation φ [0,..., 2π[ is the angle between the x-axis and the heading direction of the robot. Later, when we use a dynamic model of the robot to fuse the results of the three localization layers, three further variables specifying the tanslational (v x, v y ) and rotational (ω) velocities become important. The robot s pose together with these velocities will be referred to as the robot s system state x. ( ) T x = x y φ v x v y ω (4.1) However, when describing the three layers of localization, we only consider the first three components of x which represent the robot s pose. We will also refer to this vector as the 73

78 reduced system state x. ( x = x y φ ) T (4.2) y-axis heading direction y x x-axis Figure 4.1: The robot s position is represented in a global coordinate system whose origin is at the center of the playing field. The cartesian x, y-position together with the orientation φ is referred to as the robot s pose. The orientation φ [0,..., 2π[ is the angle between the x-axis and the heading direction of the robot. 4.3 Coordinate Systems and Transformations The vector x is the robot s pose in the global coordinate system. A corresponding local coordinate system can be uniquely derived from this pose whose relationship to the robot s geometry is illustrated in figure 4.2. The local coordinate system of the robot corresponding to x = (x y φ) T is given by the triple (P, u, v) where P = (x y) T is the Cartesian position in the global system, and u, v, the local x- and y-axis, are given by: ( ) sin φ u = (4.3) cos φ ( ) cos φ v = (4.4) sinφ 74

79 v y u P local coordinate system 0 x global coordinate system Figure 4.2: The x-axis of the local coordinate system of the robot is directed towards vector u, the y-axis towards vector v, which is the heading direction of the robot. P specifies the Cartesian position of the robot in the global system. There are four types of entities which can be represented in either coordinate system, the local or the global: Points, vectors, poses and movements. Both, points and vectors are specified in the form p IR 2. Poses and movements are three-dimensional vectors with the third component being an angle. A pose p is used to specify a position together with an orientation and it uniquely defines a local coordinate system with the y-axis directed towards the specified orientation, given as a polar angle. p = x y φ (4.5) A movement is the difference between two poses, typically between two poses of the same robot and we refer to a movement m IR 3 in the form m = x y φ. (4.6) However, movements will also occur later in layer 3 for describing how a feature has to 75

80 be translated and rotated in order to match a corresponding model feature. When the movement is considered as a change of the robot s pose, φ is the change of the robot s heading direction. We allow φ to have any value in IR, that is we do not delimit φ to [ π,...π[. If an entity is given in the local system, we will denote it with the superscript r for robot. If an entity is specified in the global system, then no subscript is used. Next, we describe how to transform points, vectors, poses and movements from one system to the other. Using homogenous coordinates, transformations for points and vectors can be expressed by matrix multiplications. However, poses and movements which have an angle in the last component cannot be expressed easily by them. Moreover, the matrices would be sparse and it is more efficient to directly implement the transformations without the need for matrix multiplications. For these reasons, we have chosen not to use homogeneous coordinates. Instead, we use a set of transformation functions. In order to specify them correctly we first have to define P := IR IR [0,..., 2π[, the set of robot poses. With x, the robot s pose, table 4.2 defines the transformations from the local to the global system. point : P IR 2 IR 2 vector x point p r = p : P IR 2 IR 2 t r = t pose : P IR 3 IR 3 x pose p r = p movement : P IR 3 IR 3 x movement m r = m x vector Table 4.2: Transforming the point p r, the vector t r, the pose p r and the movement m r from the local system at x into the global coordinate system. point : P IR 2 IR 2 vector x point p = p r : P IR 2 IR 2 t = t r pose : P IR 3 IR 3 x pose m = m r movement : P IR 3 IR 3 x movement m = m r x vector Table 4.3: Transforming the point p, the vector t, the pose p and the movement m from the global system into the local system at x. With (P, u, v) being the local system corresponding to the pose x, p r = (p r x p r y) T, the 76

81 mappings of table 4.2 are given as follows: x point p r := P + p r xu + p r yv (4.7) x vector t r := t r xu + t r yv (4.8) x pose p r := x + x movement m r := p r xu x + p r xv x p r yu y + p r yv y φ r m r xu x + m r xv x m r yu y + m r yv y φ r (4.9) (4.10) Correspondingly, we define transformations for mapping entities from the global to the local system (table 4.3). The mappings point, vector, pose and movement are bijective and the mappings point, vector, pose and movement are the respective inverse functions. They are defined as: 1 x point p := x vector t := x pose p := x ( ) (p P) T u (p P) T v x movement m := ( ) t T u t T v ( p x p y )u ( p x p y )v φ ( m x m y )u ( m x m y )v φ (4.11) (4.12) (4.13) (4.14) 1 Note that ( a x a y ) T ( bx by ) = (ax a y ) ( b x by ) = ax b x + a y b y 77

82 4.4 Relationship between Wheel Rotations and the Robot s Movement Figure 4.3 depicts the local coordinate system of the robot and the geometric relationships that will be used in order to derive equations that relate the velocities of the individual wheels to the translational and rotational velocity of the robot. First of all, we consider D 1 D 1 2 P 02 y 0 x P y P 02 0 P 01 x 1 =( - )/2 D 2 D 2 P 03 3 P 03 D 3 3 L D 3 (a) (b) Figure 4.3: In (a) the robot is viewed from above and the symbols that are used in the text are illustrated. To avoid confusion, not all symbols are drawn in (a) but the geometry is repeated in (b) with additional symbols. the contact points P 0i, i = 1, 2, 3, of the wheels, specified in the local system of the robot. In order to easily calculate these points, we define ( ) cos ψ sin ψ R(ψ) :=, (4.15) sin ψ cos ψ 78

83 to be the rotation matrix which rotates a vector about the angle ψ. Then the contact points are: with θ = (π α) 2 and the definitions ( ) 1 P 0i = LR(γ i ) 0 (4.16) (4.17) γ 1 := θ (4.18) γ 2 := θ + α (4.19) γ 3 := θ + α + β. (4.20) Note, that the contact points are specified in the local coordinate system of the robot. Thus, P 0i also expresses the direction of the respective axis of the wheels. Consecutively, the normalized directional vectors of the wheels are: D i = R( π ( ) γ i) 0 Within the global coordinate system the velocity vectors of the the contact points are v i = (4.21) (ẋ ) + ẏ Ṙ(φ π 2 )P 0i (4.22) where ( ẋ ẏ) is the translational velocity and φ is the heading direction of the robot. Note, that ẋ, ẏ and φ are part of the system state. Then the individual wheel velocities towards the active directions D i are v i = v T i R(φ π 2 )D i (4.23) Substituting equation (4.22) into equation (4.23) results in ((ẋ ) v i = + ẏ Ṙ(φ π ) T 2 )Pr 0i R(φ π 2 )D i = (4.24) (ẋ ) T R(φ π ẏ 2 )D i + P T 0i Ṙ(φ π 2 )T R(φ π 2 )D i {{ =:v rot 79

84 The second term v rot = P T 0i Ṙ(φ π 2 )T R(φ π 2 ) D i (4.25) {{ =:M can be simplified. Defining τ := φ π 2 matrix M evolves to M = Ṙ(φ π 2 )T R(φ π) = 2 ( ) T ( ) = φ sin τ cos τ cos τ sin τ = cos τ sin τ sin τ cos τ ( ) ( ) = φ sin τ cos τ cos τ sin τ = cos τ sin τ sin τ cos τ ( ) 0 cos = φ 2 τ + sin 2 τ = cos 2 τ sin 2 τ 0 ( ) 0 1 = φ. (4.26) 1 0 Substituting this result back into equation 4.25 we have: ( ) T 0 1 v rot = φp0i D i = 1 0 ( = φl R(γi ) ( ( ) )) 1 T 0 1 R( π γ i) ( ) 1 = 0 = φl ( cos γi sinγ i ) T = φl ( cos γ i ) T ( sin π sinγ i 2 +γ i cos π 2 +γ i ( ) 0 1 (cos π 2 +γ ) i sin γ i = ) = φl(cos γi sinγ i ) ( cos γ i ) sin γ i = φl (4.27) 80

85 Thus, equation 4.23 evolves to (ẋ ) T v i = R(φ π ẏ 2 )D i + L φ = (ẋ ) T = R(φ π ( ) 1 ẏ 2 )R(π 2 + γ i) + L 0 φ = (ẋ ) T ( ) 1 = R(φ + γ i ) + L ẏ 0 φ = (ẋ ) T ( ) cos φ + γi = + L ẏ sin φ + γ φ = i = cos(φ + γ i )ẋ + sin(φ + γ i )ẏ + L φ (4.28) The drive velocities are thus non-linear functions of the translational velocity and the angular velocity of the robot. v i = cos(φ + γ i )ẋ + sin(φ + γ i )ẏ + L φ (4.29) Summarizing v 1, v 2 and v 3 in a vector, this relationship can be written in matrix form: v 1 v 2 v 3 = cos(φ + γ 1 ) sin(φ + γ 1 ) L ẋ cos(φ + γ 2 ) sin(φ + γ 2 ) L ẏ cos(φ + γ 3 ) sin(φ + γ 3 ) L φ (4.30) Substituting γ 1, γ 2 and γ 3 yields v 1 v 2 v 3 = cos(φ + θ) sin(φ + θ) L cos(φ + θ + α) sin(φ + θ + α) L cos(φ + θ + α + β) sin(φ + θ + α + β) L ẋ ẏ φ. (4.31) 81

86 Since θ = π α 2 we have v 1 v 2 v 3 = cos(φ + π α π α ) sin(φ + ) L ẋ 2 2 cos(φ + π α + α) sin(φ + π α + α) L 2 2 ẏ cos(φ + π α + α + β) sin(φ + π α + α + β) L φ 2 2 cos(φ + π α π α ) sin(φ + ) L ẋ 2 2 = cos(φ + π+α π+α ) sin(φ + ) L 2 2 ẏ cos(φ + π+α + β) sin(φ + π+α + β) L φ 2 2 {{ =:W(φ) = (4.32) (4.33) Thus, we obtain the relationship v 1 v 2 v 3 = W(φ) ẋ ẏ φ (4.34) and vice versa, ẋ ẏ φ = W 1 (φ) v 1 v 2 v 3. (4.35) Note, that v 1, v 2 and v 3 are absolute values that specify the velocities of the respective contact points in the direction of the corresponding wheel, specified in [cm/s]. Dividing these velocities by the effective wheel radius r e, we obtain the angular velocities ω 1, ω 2 and ω 3 of the respective axes: ω i = v i r e (4.36) Solving this equation for v i and substituting v 1, v 2 and v 3 in equation 4.34 we obtain = 1 ẋ W(φ) r e ẏ. (4.37) φ ω 1 ω 2 ω 3 When considering odometric information later, this relationship will play an important role (see page 96). 82

87 4.5 The Dynamic Model The process model for a system describes how the system states change as a function of time and is usually written as a first-order non-linear vector differential equation or state model of the form ẋ(t) = f(x(t), u(t), t) + q(t) (4.38) where x(t) IR n is a vector of the states of interest at time t, u(t) IR r is a known control input, f(,, ) is a model of the rate of change of system state as a function of time, and q(t) is a random vector describing both dynamic driving noise and uncertainties in the state model itself. We represent the state x(t) at time t of the robot by its cartesian x, y-position (units in [m]), its orientation φ [rad], its translational velocities v x, v y [m/s] and its rotational velocity w [rad/s]. x(t) = Then the differential equations of the model are x(t) y(t) φ(t) v x (t) v y (t) ω(t) (4.39) ẋ(t) = v x (t) + n x (t) ẏ(t) = v y (t) + n y (t) φ(t) = ω(t) + n φ (t) (4.40) v x (t) = a x (t) + n vx (t) v y (t) = a y (t) + n vy (t) ω = a ω (t) + n ω (t). Here, n x, n y, n φ, n vx, n vy and n ω are assumed to be zero-mean, temporally uncorrelated Gaussian process noise errors with variance σx, 2 σy, 2 σφ 2, σ2 v x, σv 2 y and σω, 2 respectively. The terms a x (t), a y (t) and a ω (t) are the accelerations that result from the control u(t). They will be derived in the following. 83

88 Low-level motor control is done by a special electronic system which is embedded in the robot. Communication with the main computer, which processes the computer vision and behavior calculation, is established via a serial interface. The electronic subsystem gets a command for a desired translational and rotational velocity and automatically controls the power of the three motors in order to reach the desired values. The delay from sending the control input to the electronics until the control input has an effect on the motor speeds is approximately eight microseconds. We will denote the control commands at time t by u(t) = u vx (t) u vy (t) u ω (t) (4.41) where u vx (t), u vy (t) and u ω (t) are normalized command-values (v/v max ) without physical units, lying within the interval [ 1.0,..., 1.0]. The values specify the desired translational and rotational velocities at time t. A value of 1.0 corresponds to the maximum reachable speed for the respective component. perspective of the robot. The values u vx (t), u vy (t) are specified from the The y-direction corresponds to the robot s heading and the x-direction corresponds to its right-direction. For instance, to let the robot move with full speed towards its heading direction, u(t) should be set to u(t) = (4.42) To let the robot slowly rotate at a fixed position to the left, u(t) should be set to u(t) = (4.43) The control electronics that receives the commands implements a PID-controller. Since the 90W motors with the 12:1 gear produce a high torque compared to the relatively low weight of the robot( 10kg), we do not consider the precise dynamics of the controller, but assume a simplified closed feedback loop model of its behavior in the following: The 84

89 torque T i (t) of the ith (i = 1, 2, 3) direct current (DC) motor that arises from the control input u(t) depends on the current speed of the motor shaft. A reasonably accurate model that captures this relationship is T i (t) = ᾱu(t) βω i (4.44) where U(t) [V] is the voltage applied to the motor, and ω i (t) [rad/s] is the angular velocity of the shaft of motor i. The constants ᾱ [Nm/V] and β [Nm rad/s] characterize the motor. The salient feature of this model is that the amount of torque available for acceleration is a function of the speed of the motor. When neglecting wheel slippage, the force f i (t) generated by the respective motor driven wheel is simply f i (t) = αu(t) βv i (t) (4.45) where v i (t) [m/s]is the ground-velocity of the wheel mounted at the shaft of motor i. The constants α [N/V] and β [kg/s] 2 can readily be determined from ᾱ, β and the geometry of the robot. The translational acceleration of wheel i is a i (t) = f i(t) m (4.46) where m is the mass that has to be accelerated. Now, we use the matrix W(φ) defined in equation 4.33 on page 82 to project the accelerations of the wheels to the resulting acceleration of the robot: a x (t) a 1 (t) a y (t) := W(φ) 1 a 2 (t) (4.47) a ω (t) a 3 (t) 4.6 Using a Kalman Filter to Fuse the Three Layers We use a Kalman-filtering approach to fuse the resulting positions of the different layers. For an introduction to the filter see for instance [58][11]. We begin with a very general and simplified description of the filter and successively remove the simplifications. 2 Note, that 1N = 1 kgm s 2 85

90 A discrete Kalman filter tries to estimate the state x IR n of a discrete-time controlled process that is governed by the linear stochastic difference equation x k+1 = Ax k + Bu k + q k. (4.48) The state is observed by a measurement vector z IR m that is linearly coupled to the system state: z k = C k x k + r k (4.49) The n n matrix A relates the state at time step k to the state at step k + 1. The n l matrix B relates the control input u IR l to the state x. The m n matrix C in the measurement equation relates the state to the measurement z k. The process noise is represented by the random variable q k which is normal distributed with zero mean and covariance matrix Q. Accordingly, r k models the measurement noise with covariance matrix R. In our system we represent the state x of the robot by its Cartesian x, y-position, its orientation φ, its translational velocities v x, v y and its rotational velocity ω = φ. ( ) T x = x y φ v x v y ω (4.50) The measurement z consists of the measurements from all three layers of localization. z := ( z odo z rel vis z rec ) T (4.51) Here z odo, z rel vis and z rec are the respective measurement vectors of the three layers. The syntax used in equation 4.51 should indicate that the compound measurement vector z consists of the components of the three vectors, rather than of the vectors themselves. The measurement vector z odo of layer 1 consists of the angles α 1, α 2 and α 3 specifying the wheel rotations between two successive time steps. These angles are directly measured by the tick counters of the motors, which provide 768 impulses per wheel revolution. z odo = ( α 1 α 2 α 3 ) T (4.52) The measurement vectors of layer 2 and 3 are defined in a way, reflecting the direct 86

91 measurement of the robot s pose. z rel vis = (x 2 y 2 φ 2 ) T (4.53) z rec = (x 3 y 3 φ 3 ) T (4.54) Although layer 2 supplies only a relative movement, the measurement of an absolute pose can be simulated by adding this relative information to the most recent estimated pose. Layer 3 directly measures the robot s pose by recognizing high-level features that allow the unique inference of the robot s pose. The measurement vector of each layer has to be related to the system state x. Typically, not all components of the system state are required to define the measurement models for the individual layers. For instance, the odometric measurement vector only depends on the last three components of the system state, the translational and rotational velocities of the robot. In contrast, the measurements of layer 2 and 3 depend only on the first three components of the system state, namely the robot s pose. For layer 1, the relationship between the system state and measurement vector is not linear. A detailed description will follow on page 96, where we begin to describe the layers separately. There, the Jacobian matrix C odo will be derived, linearizing the measurement model around the current estimate of the robot s pose: ẋ z odo = C odo ẏ + R odo (4.55) φ Here R odo is the covariance matrix of the odometric measurements. For layer 2 and 3 the measurement model is linear and it can be expressed by the following equations: x z rel vis = C rel vis y φ x z rec = C rec y φ + R rel vis (4.56) + R rec (4.57) 87

92 Here, both C rel vis and C rec are the identity, since the layers are designed to directly measure the robot s pose. R rel vis and R rec are the corresponding measurement covariance matrices. Summarizing the measurement equations of all three layers, we have α 1 α 2 α 3 x 2 y 2 φ 2 x 3 y 3 φ 3 k C odo = C rel vis C rec {{ 0 0 C k x y φ v x v y ω k + r k. (4.58) Assuming that the measurements of the individual layers are independent from one another, the compound measurement matrix r k is r k = R odo R rel vis R rec. (4.59) Note, that the assumption of independent measurements is not completely valid, because the measurements of both layer 2 and layer 3 are based on the same field contours. The measurement covariance matrices R odo, R rel vis and R rec are not static, but depend on the system state and other influences. Odometry, for instance, has larger variances when the robot is moving fast or when it is accelerating. When using features in the second layer to obtain a relative visual correction of the robot s pose, the variances depend i.e on the distance and on the orientation of the features with respect to the robot. However, the dynamic adaption of the measurement matrices is not dealt with in this thesis. The Kalman filter iteratively processes two steps, the measurement update and the time update. The measurement update combines the last estimate of the system state with the current measurement and produces a new estimate of the state. The time update uses the dynamic model to predict the state for the next time step. The calculations that have 88

93 to be carried out are illustrated in figure 4.4. Here, it is assumed that the measurements of the three layers have the same time delay. This simplification will be removed later. * * Initial estimates for x and P k k Measurement Update (1) Compute the Kalman gain T T -1 K k = PC k k ( CPC k k k + R k) (2) Update estimate with measurement z k x ^ k = x * k + K k ( zk - Ckx) * k (3) Update the error covariance P = (I - KC) P k k k k * Time Update (1) Project the state ahead x k+1 * = A x ^ + B k k k u k (2) Project the error covariance ahead * P k+1 = A k Ak + Q k P k T Figure 4.4: The Kalman filter iteratively processes two steps, the measurement update and the time update. The measurement update combines the last estimate of the system state with the current measurement and produces a new estimate of the state. The time update uses the dynamic model to predict the state for the next time step. x^ 0 time update x^ 1 time update x^ 2 time update time measurement update * x 0 z 0 measurement update * x 1 z 1 measurement update * x 2 z 2 x 3 * Figure 4.5: This figure shows the flow of computation while executing the filter steps of figure 4.4 over several time steps. 89

94 4.7 Fusing Delayed Measurements In the above description of using a Kalman filter in order to fuse the measurements of the three layers a simplification was made. It was assumed that the measurements have the same time delays. Of course, this is not true. Odometric information typically enters the system 8 ms after the corresponding physical event (wheel rotation). In contrast, images have a typical delay of approximately 60 ms. Consider what happens when not taking account for this time shift. Assume that just the last incoming data is fused and that the robot is standing still on the playing field at the beginning. Assume further that the robot is detecting the field lines and tries to match the lines onto an internal model in order to correct for small positional errors (layer 2). Then, when the robot starts rotating, odometric information will reflect the rotation after 8 ms. However, at that time the images still reflect the state with the robot not moving because of the larger time delay. The images will not show a rotation before 60 ms after the robot started rotating. Hence, it is incorrect to fuse just the last incoming data. One has to take account for the correspondence of the different signals in time. The above consideration is illustrated in figure 4.6. For the sake of simplicity, the observed signal is considered to be continuous and fusion of only the first two layers (odometry and relative visual correction) are considered. 90

95 physical world perceiving system observed state t 0 a time layer 1 o dometric visual path path layer 2 signal signal t 0 t 0 a 1 a 2 time time t global Figure 4.6: A physical state (i.e. the orientation of the robot) is observed by two different sensor systems, odometry (layer 1) and relative visual correction (layer 2). A physical event (t 0, a) is observed with different time delays. The corresponding event in layer 1, (t 0 + t global ), has a delay of t global. In layer 2 the corresponding event is (t 0 + t global + t 2 1 ), that is the delay is t global + t 2 1. Here t global is the delay of the most reactive sensor (layer 1) and t 2 1 denotes the time shift between layer 1 and 2. Here, only two layers are depicted for the sake of simplicity, however the scheme extends to all three layers. The higher a layer the greater the delay. The time shift between two layers, i.e t 2 1, can be determined by the system by matching the signals to each other. However, it is not possible to determine t global without another reference sensor. In principle, such a reference sensor is just another layer (layer 0) with a time delay t global that is so small that it can be neglected. For fusing the signals of the different layers correctly, it is sufficient to determine the relative time shifts between the layers. However, when incorporating the motion model, t global also needs to be known. In our case t global is the odometric delay which is approximately 8 milliseconds. t

96 4.7.1 Splitting the Kalman Cycle When considering the most recent data entering the system through the different sensor layers, there exists odometric information for which no corresponding visual information is yet available. Accordingly, there are situations, where layer 3 has not yet detected a feature, but there is already relative visual information (layer 2) available. To keep the description easy, let us concentrate only on the first two layers. The scheme can then be adapted easily to all three layers. We split the Kalman cycle into two phases, phase 0 and phase 1. (Considering all three layers, the process would have been split into three phases). Each phase consists of a time and a measurement update. While the dynamic model is the same for both phases, each phase has its own measurement model. During phase 0, corresponding visual and odometric information is fused. During phase 1, the most recent odometric information is used, for which no corresponding visual information is yet available (see figure 4.7). The measurement model for phase 0 is: α 1 α 2 z phase 0,k = α 3 x 2 y 2 φ 2 phase 0,k and the measurement model for phase 1 is z phase1,k = α 1 α 2 α 3 phase 1,k C odo = C rel vis = C odo Figure 4.8 shows the overall flow of computation. x y φ v x v y ω x y φ v x v y ω phase 1,k k + r phase 0,k, (4.60) +r phase 1,k. (4.61) 92

97 phase 0 phase 1 layer 1 (odometry) signal t 0 a 1 odometric tail time layer 2 (relative visual correction) signal t 0 a 2 time t 2-1 Figure 4.7: When considering the most recent information gathered by the different sensor systems, there exists odometric data for which no corresponding visual information is yet available. This is due to the fact that odometric information has a shorter delay (8 ms) compared to visual information (60 ms). Therefore, when only considering the first two layers, the fusion process is divided into two phases. During phase 0 corresponding odometric and visual information are fused, while during phase 1, the most recent odometric data (the odometric tail), for which there is not yet correlating visual data, is used to update the estimate of the system state of the robot. For both phases different measurement equations are used and each phase uses its own filter matrix K i, i = 0, Explicit Representation of Time When dealing with measurement delays, one major problem has to be solved: The correspondence of different signals in time observing the same system state has to be resolved. Assume for a moment, that we have two discrete signals..., a t 3, a t 2, a t 1, a t and..., b t 3, b t 2, b t 1, b t, with a t and b t being the signal values at time step t. Assume further, that the signals are sampled with the same frequency. Then, if we know the relative time shift t between the signals, with b being the signal with the higher measurement delay, then b t corresponds to a t t. Under these simplified conditions, it would suffice, to store the incoming data in two separate queues and to use shifted indices to address corresponding data. 93

98 ^~ ~ ^ ~ ^ x 0 x 1 x 2 phase 1 measurement update z phase 1, 0 x 0 ^ x 0 ~ * time update time update phase1 measurement update z phase 1, 1 x 1 ^ x 1 ~ * phase 1 ~ * measurement update time update time update time update z phase 1, 2 x 2 ^ x 2 time update time phase 1 adding the odometric tail time phase 0 fusing vision and odometry phase 0 measurement update * x 0 z phase 0, 0 phase 0 measurement update * x 1 z phase 0, 1 phase 0 measurement update * x 2 z phase 0, 2 Figure 4.8: The bottom part of this figure corresponds to figure 4.5 where no measurement delays are considered. Beginning with the last estimate ˆx k of the system state, a phase 0 time update propagates the motion model for the time interval in which both odometric and relative visual information is available (yielding x k ), then a phase 0 measurement update is performed, which fuses the corresponding odometric and visual information (yielding ˆx k ). A phase 1 time update is performed in parallel, propagating the motion model for the time interval in which only odometric information is available (yielding x k ). Finally, a phase 1 measurement update is processed, which integrates the odometric tail and produces the final estimate ˆ x k of the system state. This system state is used to control trajectory execution (after having applied a further prediction step). However, reality is more complicated. First of all, the different signals typically are not sampled with the same frequency. Layer 1, odometric information, provides data at a rate of 50 samples per second, while layer 2 and 3 yield only 10 measurements per second. Moreover, it typically happens that the sampling intervals within one signal vary. While this is not true for odometric information, whose sampling intervals are very precise, the time gaps between two successive image frames are not always the same. This can be due to the operating system, the video acquisition drivers, or the varying computational load while processing the images. In the extreme, the processing is so expensive, that a frame has to be dropped. Thus, using a simple shift of indices to take account for the measurement delays is not adequate. Instead, time has to be represented explicitly. With recent processors (i.e. the Pentium III and higher), a so-called performance counter is available, a 64-bit counter that is 94

99 incremented with each clock cycle. With such a counter, time can be measured with a precision below a microsecond (10 6 s). When the signals of the three different layers enter the system, each data item is provided with such a 64 bit time stamp, and the data is stored in a queue. Each layer has its own queue, which we refer to as a data channel in the following. The data channels of the different layers have a length of approximately 100 items. The odometric data channel has 500 items due to the higher sampling frequency. That is, each channel stores the last 500 or 100 incoming data items, together with their time stamps. Note, that the time stamps reflect the time when the signal enters the system, and not the time of the corresponding physical event. Now assume that two successive images are grabbed with time stamps t 1 and t 2. Now, we know the relative time shift t between odometric and visual information, which is 52 milliseconds in our system. We will describe later, how this time shift can be determined automatically by the system. In order to find the odometric data which correspond to the time interval of the two images, we consider the interval t 1 t and t 2 t. We use each shifted time stamp to find the index i of the data item in the odometric data channel that most closely matches by its time stamp. This search can be efficiently implemented (in O(log n)) by binary search, since the data in each channel are ordered by the time stamps. In this way, we find the corresponding odometric data and we fuse the data as described in the previous sections. To find the relative time shift t between odometry and vision, one can simply observe a landmark, start the robot rotating and to stop the rotation after a small time period. Then both odometry and vision will reflect the change of the angle and the time shift can be recovered by determining how the data in the different channels have to be shifted in order to produce the highest correlation. In our system we use the ball lying still on the field as a landmark. It typically suffices to determine the time shift once, however, one could consider techniques which are able to adapt to a changing time shift online. 95

100 4.8 Layer 1: Odometric Information The robot s motors are equipped with tick counters generating 64 impulses per revolution. Since each motor has a 12:1 gear, n ticks = 768 impulses are generated per wheel revolution. The counters are equipped with a sign specifying the direction of rotation. The ticks of each wheel are summed up by the control electronics and periodically (50 times per second) transmitted to the computer. Let λ k,i, i = 1, 2, 3, denote these accumulated ticks of motor i at time t k. By taking the difference of two successive accumulated tick values, the relative ticks between two time steps can be computed. That is, we have the relative tick vector: λ k = λ k,1 λ k,2 λ k,3 = λ k,1 λ k,2 λ k,3 λ k 1,1 λ k 1,2 λ k 1,3 (4.62) We can compute the corresponding wheel rotations from the relative tick vector λ k corresponding to the time interval t = k t k 1 : α k,1 α k,2 α k,3 = 2π n ticks λ k,1 λ k,2 λ k,3 (4.63) This vector ( α k,1, α k,2, α k,3 ) T is the measurement vector of layer 1 at time step k. In section 4.4 on page 82 equation 4.37 was derived, describing the relationship between the robot s translational and rotational velocities, and the angular velocities of the wheels (ω 1, ω 2, ω 3 ) T. This relationship is ω 1 (t) ω 2 (t) = 1 ẋ(t) W(φ(t)) r e ẏ(t). (4.64) ω 3 (t) φ(t) When integrating the angular wheel velocities over the time interval [t k 1,..., t k ], we obtain the relationship α k,1 α k,2 α k,3 = g odo(x(t k 1 ),..., x(k), r e ) := k t=t k 1 1 ẋ(t) W(φ(t)) r e ẏ(t) dt (4.65) φ(t) 96

101 Linearizing g odo around the current estimate x of the system state, we obtain the approximation α 1 α 2 1 ẋ W(φ ) r e ẏ t. (4.66) α 3 φ Thus, the matrix C odo, which was already used in equation 4.60 is given by C odo = 1 r e W(φ ) t. (4.67) 4.9 The Observation Model In layer 1 we solely considered odometric information. The next two layers will use visual information. In order to describe these layers we have to know the relationship between the robot s system state and the location where things can be expected in the image. To establish this relationship is the purpose of the following section. After having described the relationship we will continue with the description of layer 2 and The Omni-Directional Vision System Standard cameras have a limited visual field of about degrees. To enhance the field of view, the idea of using a mirror in combination with a conventional imaging system to obtain omni-directional images has been proposed by Rees in a U.S. patent in 1970 [67]. The general approach of combining a mirror with a lens is referred to as catadioptric image formation. The term is composed of catoptrics, the science of reflecting, and dioptrics, the science of refracting surfaces. Different types of setups have been examined. Yagi and Kawato used a conic mirror [87]. Hong and others used a spherical mirror [43]. Yamazawa and others used a hyperboloidal mirror [88] and finally, Nayar and Baker [63] developed a setup using a parabola mirror and a telecentric lens. The latter two approaches generate images taken by a single center of projection, which is particularly important when aiming to transform the omni-directional images into perspective images. 97

102 (a) cm distance to the robot linear exponential (b) image distance to the center (c) pixel Figure 4.9: The optical mapping. In figure (b) three points are marked by small black dots. Each of these points corresponds to a ray, which is reflected by the mirror (figure (a)) and intersects with the ground plane, yielding corresponding points on the floor. Thus, each distance from the mirror s center to a pixel in the image corresponds to a real-world distance from the robot. This relationship is described by the distance function (figure (c)) The Distance Function For our soccer robots we have designed a specific mirror. Here, the design principle was not the goal to have a single center of projection, but a desired distance function, the function which relates distances in the image to distances in the world. If a convex, rotationally symmetric mirror is placed directly above the camera, then the center of 98

103 the mirror is mapped to the center of the image. A pixel that is at a distance d to the center in the image corresponds to a ray that has an angle α to the viewing direction of the camera. The farther a pixel is away from the center, the greater is the angle of the corresponding ray. For analysis, it is easier to reverse the direction of the light, thinking of a ray as coming out of the camera and tracing the ray back to its origin. Adapting this notion, the rays leave the camera and are reflected by the mirror. After reflection, rays corresponding to outer pixels have a lower inclination towards the ground plane and the distance from the robot where they intersect with the ground plane increases (see figure 4.9a). The distance function describes this relationship. It maps a pixel distance to a world distance. The pixel distance is measured from the location of the pixel to the location of the mirror s center in the image (figure 4.9b). We designed our mirrors to yield a distance function which consists of a linear and an exponential portion as illustrated in figure 4.9c Predicting the Location of Objects in the Image Given the pose of the robot on the playing field, we can predict the location of objects in the image. Before we describe the projection of arbitrary 3D points we first consider the special case that the point is on the two-dimensional playing field. Then we will reduce the general 3D case to the two-dimensional mapping Transformation of Two-Dimensional Points on the Field Consider a point on the field, for instance a point a, being part of the field lines as illustrated in figure In the following we will describe how this point is mapped to the image. The mapping takes place in three stages, transforming the point into different two-dimensional cartesian coordinate systems. The origin of the global world coordinate system is at the center of the playing field as illustrated in figure All existing objects are initially modelled in this system, including the field lines. The robot s local coordinate system is given by its reduced system state x. Then the point a in the global system is transformed into the local 99

104 y a x global coordinate system local coordinate system P v u Figure 4.10: The world coordinate systems. The global system is located in the center of the playing field. The local system is located below the center of the robot s camera and mirror. Its y-axis is directed towards the robot s heading direction, its x-axis to the right. system by the following equation 3 : a r = x point a (4.68) The catadioptric mapping preserves angles between directions seen from the robot s position. Consider two different points in the local world coordinate system. These points form an angle α with the origin of the system, the robot s position. When projecting these points to the image, the points form the same angle with respect to the center of the mirror in the image. The reason is that the whole projection process of a light ray entering the CCD array of the camera takes place in a plane orthogonal to the ground. Essentially, the image directly corresponds to the local world coordinate system after a pointwise application of the distance function. Here, applying the distance function means to measure the length of the point-vector a r, to transform that length by the distance 3 The transformation point is defined on page

105 function and to create a new point-vector a pointing to the same direction but scaled to the transformed length: a = ar a r d( ar ) (4.69) Here d( ) denotes the distance function. However, the image appears rotated, because the camera has been mounted on the robot in a way that the diagonal of the CCD array is directed towards the robot s heading direction. The reason is that the diagonal direction offers a greater range of sight and we wanted the robot s heading direction to benefit from this advantage. Therefore, after having applied the distance function to the point a r the point is represented in a coordinate system which appears to be rotated in the image. We refer to this system as the local image coordinate system and its origin C is located at the center of the mirror in the image. In the image, the y-axis (the vector v ) of this system points in the direction that corresponds to the heading direction of the robot (see figure 4.11). Finally, after projecting the point out of the local image coordinate system, the point is represented in the global image coordinate system which is located at the top left of the image, with the coordinates specifying the pixel position. a = C + a xu + a yv (4.70) Transformation of Arbitrary 3D Points In order to transform arbitrary 3D points, we observe that a 3D point B is mapped to the same image pixel as the ground point a which is on the same ray being reflected into the camera. To calculate the corresponding point b on the floor we just have to determine the ray and intersect it with the ground plane. However, to determine the ray we would have to calculate the point R where the ray is reflected on the mirror. Although the precise calculation is possible, involving the shape of the mirror and the location of the focal point of the camera, it can be simplified by choosing an approximate reference point R which is located at the center of the mirror according to figure For catadioptric systems with a single center of projection the calculation is precise if R is placed at the center of projection, for other setups it is just an approximation. However, especially for 101

106 global image coordinate system * x * y * u * v * a C local image coordinate system Figure 4.11: The image coordinate systems. The origin of the global system is located at the top left corner of the image. The local system is located at the center of the mirror in the image. Its y-axis (vector v ) is directed towards the robot s heading direction. distant objects the error is small. R B b Figure 4.12: In order to project the 3D point B to the image we can calculate a corresponding point b on the ground and apply the two-dimensional mapping Transforming the Contours into World Space Independent of the method that is used to localize the robot, the lines, as extracted from the images, are distorted by the mirror. We account for this distortion by transforming 102

107 each point from image space to world space. This step makes use of the distance function as described on page 98. Figure 4.13 shows the distorted line contours as they are extracted by the region tracking algorithm. Given the robot s pose x, the line can be transformed Figure 4.13: The distorted line contours as extracted by the region tracking algorithm. out of the corresponding local coordinate system. If the robot s pose is incorrect, the lines will not fit to a model, as illustrated in figure In contrast, when the robot s pose is correct, the lines closely match to the model (see figure 4.15). In order to verify whether the lines fit to the model or not, we have to establish a distance function. Before we set up this function, we first describe how the field lines are represented in the model. 103

108 Figure 4.14: The distortion has been removed by applying the distance function and the contours have been transformed out of the local coordinate system. In this example, the estimate of the robot s pose is incorrect and as a result the line contours do not fit to the model. Figure 4.15: When the estimate of the robot s pose is correct, the image lines fit to a model of the field lines. 104

109 4.11 Modelling the Field Lines In order to define a matching distance, we have to define a model of the lines. The field lines consist of curved and straight line segments (see figure 4.16) y 12 9 x q s r h a g p w a Figure 4.16: The model of the field lines consists of 17 straight line segments and five circular arcs, one of which is the center circle. Curved lines can be found in form of the center circle and the quater circles at the corners. We represent straight line segments by the triple (A, v, l), (4.71) where A is the two-dimensional start point of the line segment, v is the normalized direction of the line segment and l represents its length. The circular elements are defined by (M, r, φ 0, φ 1 ), (4.72) 105

110 where M is the center point, r is the radius and φ 0, φ 1 define the enclosing angles of the respective circular arc. In this way, the center circle can be represented by setting φ 0 = 0, φ 1 = 2π and the quater circles can be defined by setting the difference of the angles to π 2. Then the whole model consists of 17 straight line segments and five circular arcs which are listed in table 4.4. Nr. line segment (A, v, l) Nr. line segment (A, v, l) 1 ( ( ( w h), 1 ( 0), 2w) 12 ( w ) ( s, 1 ) 0, g) 2 ( ) ( w h, 1 ) ( 0, 2w) 13 ( w+g ) ( s, 1 ) 0, g) 3 ( ( ) ( w h, 0 ) ( 1, 2h) 14 w ( s), 1 0), g) 4 ( ( ( w h), 0 ) ( 1, 2h) 15 ( w g ) ( s, 1 ) 0, g) 5 ( ( 0 h), 0 ) ( 1, 2h) 16 w ( q), 1 0), p) 6 ( ( ) ( w+p q, 0 ( 1), 2q) 17 ( w p ) ( q, 1 ) 0, p) 7 ( ( ) ( w p q, 0 1), 2q) circular arc (M, r, φ0, φ 1 ) 8 ( ) ( w g s, 0 ( 1), 2s) 18 ( w ) h, a, 3π, 2π) 2 9 ( ( ) ( w+g s, 0 ( 1), 2s) 19 ( w ) h, a, 2 π, 3π) ( ( ) ( w+p q, 1 ) ( 0, p) 20 ( w ) h, a, 1π, π) 2 11 ( ( ) ( w q, 1 ) ( 0, p) 21 ( w ) h, a, 0, 1 π) 2 22 ( ( 0 0), r, 0, 2π) Table 4.4: The elements of the line model. For the world championships in Lisboa 2004 the constants are: w = 600 cm, h = 300 cm, p = s = 150 cm, g = 50 cm, q = 250 cm, r = 100 cm and a = 40 cm. 106

111 4.12 Layer 2: Relative Visual Localization Relative localization assumes that an initial estimate of the robot s pose is available and keeps track of the pose while the robot moves. Assume that initially the robot has been determined to be at pose A. Then, within a small time interval the robot moves to pose P. However, we do not know this new position. Instead, we have an estimate P odo of the position due to odometric information. In the long term, odometric information accumulates to unbounded positional error. Thus, we have to correct P odo with visual information. In the following, we consider two different methods that achieve this correction. The first method is new, and we refer to it as the MATRIX -method. It takes a cloud of points representing the field lines, transforms the points into the global coordinate system corresponding to P odo and calculates a small correcting movement m which makes the points fit better to the model of the lines. Here, the idea behind the method is that the points and the lines attract each other, resulting in an overall force and momentum which iteratively makes the cloud adjust towards the line model. A force field is pre-calculated in order to achieve real-time computation. The second method is an already known-method, which is referred to as the system dynamics approach. It is due to Dickmanns [26] and Wünsche [85, 86] and was already described in chapter 2. Although the method is more efficient when considered isolated, we have not adapted the method in our overall approach, since the third layer for localization, which performs the feature recognition, requires the same input as is used for the MATRIX method. However, we sketch how to adapt the system dynamics approach, since it is elegant, extremely efficient and a prospective base for further research MATRIX: A Force Field Pattern Approach Matching Distance Based on Closest Point Relationships Having a hypothesis for the robot s pose on the playing field, it is possible to transform the perceived line contours into the global coordinate system. In order to evaluate the quality of a hypothesis, we want to define a matching distance function between the perceived lines and a model of the lines. Defining this function is difficult, because the function not only needs to be appropriate, but also calculable in real-time. An intuitive way to define a matching distance function is to determine for each point 107

112 of the perceived line contours the closest point of the model and to calculate the average of the corresponding distances. In order to be more robust against outliers, we sum up a constant cost c outlier for distances which exceed a threshold t outlier (in our implementation c outlier = t outlier = 50 cm). With A being the set of line contour points given in the local coordinate system of the robot and the robot considered at pose x, the distance function is defined as: d( x, A) := 1 n (( p r A valid x point p r p ) + A outlier c outlier ). (4.73) Here n := A denotes the total number of points and p denotes the closest model point of perceived line point p r A subject to the estimated pose x of the robot. The sets A valid and A outlier divide A into valid points and outliers. A similar definition of a matching distance function can be found in [55], where a distance function between two laser range scans has been defined. The computationally most expensive part in 4.73 is the determination of the point correspondence. Given a point p := x point p r we have to iterate through all elements (straight lines and circular arcs) of the line model, calculate the respective closest point, and finally take the point p with the overall smallest distance. For a straight line segment (A, v, l) the respective closest point p can be calculated as s := (p A)v (4.74) 0 if s < 0 t := l if s > l (4.75) s if 0 <= s <= l p = A + tv. (4.76) For a circular arc (M, r, φ 0, φ 1 ), the calculation is v := (p M) (4.77) { M + r v p v = if the polar angle of v is enclosed by φ 0 and φ 1 (4.78) the closest endpoint of the arc otherwise These calculations are illustrated in figure

113 l M r 1 A v p* p 0 p* p (a) (b) Figure 4.17: Calculating the closest point p of a point p for (a) a straight line segment (A, v, l) and (b) a circular arc (M, r, φ 0, φ 1 ) The Quality Matrix On average the perceived line contours consist of approximately 400 points and since the line model has 22 elements, about 8800 correspondence evaluations have to be carried out if we want to examine a hypothetical pose x. Since the line model is always the same, we can optimize the calculation by precomputing the correspondence relationships. We discretize the area of the playing field into small cells (the MATRIX ) and precompute the distance to the closest point for the center of each cell. When evaluating the matching distance function d( x, A), we let each point p := x point p r index its corresponding cell and lookup the distance δ p r := x point p r p. When pre-computing the grid, we verify whether or not the calculated distances exceed the outlier threshold t and store c outlier to the respective cells in these cases. Thus, the distance function evolves to d( x, A) := 1 n δ p r. (4.79) p r A 109

114 This pre-computed quality matrix is illustrated in figure Figure 4.18: Each cell in the quality matrix stores the distance to the closest line point. High intensities correspond to large distances. The quality matrix is used to efficiently calculate the distance function d( x, A), expressing how good the point set A fits to the field lines subject to the robot s pose x. Relative Correction by Particles By means of the distance function d( x, A) we can efficiently verify the hypothesis for the robot s pose x. In particular, when we have an initial estimate P odo, we can distribute a small number of particles (20 in our experiments) in the neighborhood of P odo. The distribution of the particles should account for the system state of the robot, including its velocities. That is, the variance of the particles should be highest in the direction of highest uncertainty. This can be achieved by sampling from a Gaussian distribution with the covariance matrix P, the error covariance matrix of the system state x. An example with the robot moving forward is illustrated in figure Then the distance function d( x i, A) is evaluated for each sample x i and the best sample x ibest is chosen as the correct 110

115 Figure 4.19: During relative localization 20 particles are distributed according to the robot s motion state around the estimated position of the robot. position. That is z relative vision := x ibest. (4.80) Relative Correction by a Force Field Instead of using samples, one can also simulate an attractive force and a momentum between the line model and the point set A. This results in a gradient descent algorithm similar to the well-known Iterative Closest Point (ICP) algorithm [9], starting at P odo and iteratively adjusting the position until the line contours converge towards a best fit to the model. Instead of calculating the forces online, we precalculate a force field in order to achieve real-time computation. Given a point p in the global coordinate system, we define the force vector function f(p) to be a weighted sum of the vectors v k reaching 111

116 points p k : w k := τe η v k. (4.83) from p to the closest point p k on element k of the line model (straight line segments and circular arcs): 22 f(p) := k=1 Here, we delimit the force contributions to f max = 150, with w k v k v k min( v k, f max ), (4.81) v k := p k p. (4.82) The weights w k exponentially decrease with an increasing distance of the corresponding Since we do not use physical units, the constants and units in which lengths are expressed depend on each other. Here the constants are specified for lengths expressed in centimeters. The constant η defines the decline of the force field (η = 0.1 in our implementation) and τ is a normalizing constant which is 22 τ = e η vi. (4.84) i=1 Hence, the weights w k, k = 1, 2,..., 22 sum up to one. Given an initial pose x k and a point set A = {p r 0, p r 1,..., p r n representing points on the field lines, which are specified in the local coordinate system of the robot, we calculate a translational force and a momentum, which in turn result in a relative translation and rotation producing an improved estimate x k+1 of the robot s pose. The overall force vector F( x k, A) acting on the point cloud A and subject to an initial estimate of the robot s pose x k is computed by F( x k, A) = 1 n n i=1 f( x k point p r i ) (4.85) Pre-computing the forces for all possible positions on a discretization of the playing field yields a force-field, which is illustrated in figure The momentum m(p, g) induced by the force f(p) on a single point p onto the center of gravity g of point set A is calculated by m(p, g) = f(p)r(π/2)(g p). (4.86) 112

117 Figure 4.20: The arrows indicate the forces acting on a point at the corresponding position. The length of the arrows indicates the strength of the forces. The force field is used to iteratively calculate a translation and rotation of a point cloud and a corresponding robot position in order to make the points fit to the model of the field lines. Here, R(π/2) is the rotation matrix which rotates a vector 90 degrees (definition on page 78). Then the overall momentum M (x, A, g) acting on g with respect to the robot s pose x is n 1X point M (x, A, g) = m(x pri, g). n i=1 xk (4.87) With x k = y k we compute x k+1 by φk xk+1 xk = + γf(x k, A) yk+1 yk φk+1 = φk + γm (x k, A, g). 113 (4.88) (4.89)

118 Figure 4.21: This figure shows the force vectors acting on the points. Each force vector is a weighted sum of a list of vectors. The number of vectors in each list is equal to the number of straight and curved line segments in the model of the field lines and each vector points to the closest point of the respective line segment, measured from the current point. The force vectors for all possible locations can be pre-computed, since the model of the field lines is static. Then, given a point, it suffices to look up the force vector in the lookup table. Here, the lookup table is a two-dimensional discretization of the playing field. After discarding outliers, an overall translational force can be computed by a weighted sum of the forces. Here γ = 0.1 is a constant, specifying the step width for the iterations. Both figure 4.21 and 4.22 are based on the same exemplary field contours, assuming that they were detected in an image and that they were transformed into the global world coordinate system. The figures show the forces and momenta that would arise, respectively. In our overall setup, the line contours are extracted every third frame, yielding a frequency of ten with a frame rate of 30 frames per second. With the robot s maximum speed 114

119 center of gravity Figure 4.22: Similar to figure 4.21 the forces on the points are considered. However, for calculating a rotational momentum, only the component of the individual force vectors, which are perpendicular to a line connecting the center of gravity and the current point are shown. These components are aggregated to an overall momentum as described in the text. of approximately 350 cm/s the robot can move a maximum distance of approximately 35 cm. Furthermore, we have a good initial estimate of this movement by odometry. Assuming an odometric error of at most 10%, the error that has to be corrected for can be bounded by approximately 3.5 cm. This is the reason why even one iteration is sufficient to correct for these errors. Thus, using the force field pattern approach for relative visual correction, the final measurement of layer 2 is: z relative vision := x k+1. (4.90) 115

120 Considering Fusion Relative localization is processed after image analysis and therefore runs at the frame rate of the video stream. Here, the difficulty arises in using the odometric information that corresponds to the time interval of the respective visual information. This is not trivial because odometry and vision data are subject to different time delays. For instance, vision has a typical delay of 150 milliseconds, while odometry has a delay of approximately 8 milliseconds. For instance, when the robot is standing still and then starts rotating, the rotation will be observed by odometry after 8 milliseconds, but the visual processing will not detect the rotation until 150 milliseconds after the rotation. Here, we can distinguish two different times, the mental time at which an event is detected by the processing system, and the physical time at which the event happened in the physical world. When capturing odometric and visual information, we measure the times when the information is processed by making use of the processor s performance counter, which allows a time resolution below one microsecond (10 6 s). These measured times are the mental times. To obtain physical times we have to subtract the respective time delays of the preprocessing, which includes both hardware and software processing. We store the incoming odometric information together with the corresponding physical time stamps in a queue. Thus, we always have the data of the last 2-3 seconds and we can access the data by physical times. Within the video callback function we also take a time measurement and subtract the 150-millisecond time delay. Then, for two successive images, we have two physical times and we can search for the corresponding odometry information in the data buffer. Since the data is sorted by times within the data buffer, we can use binary search to efficiently find the respective indices of the time interval and we obtain the corresponding odometric information. The precise handling of time is not that important for global localization because here, we re only interested in a rough position of the robot. However, relative localization works on a very fine scale in time and space and the resulting position s precision is of great importance for estimating position and speed not only of the robot itself, but also of other objects that are perceived with relative coordinates from the robot s perspective. Usually, relative localization is active 99 percent of the time and global localization only 1 percent. 116

121 Adapting the System Dynamics Approach This section describes how the system dynamics approach can be applied instead of the MATRIX method to correct for small errors in the estimate of the robot s system state. Based on the model of the field lines, the dynamic model and the observation model, the idea is to predict the appearance and position of the field lines at a few locations in the images and to use only a few detectors to correct the estimate of the system state. In the following, we will describe the kinds of features, the detectors and how the estimate of the system state is corrected. Two different approaches of defining appropriate p = ( x w ) y w w w w w x ( w y w ) p = ( x w ) y w w w x ( w y w ) w w w w w (a) w (b) Figure 4.23: In the first approach, we consider a feature to be a point on a field line together with the orientation of a tangent to the field line at the respective point. The two examples (a) and (b) show that the corresponding field line can be straight or curved. The width of the field line at the point is also a component of the corresponding feature vector. In the text, we refer to such a feature as a tangent point p := (x w y w φ w w w ) T. features are considered, tangent points (see figure 4.23) and point triplets (see figure 4.27). The first version, tangent points, results in a complicated mechanism to measure the components of the feature in the image. Also, the resulting mathematical formulation becomes complicated. Thus, the definition of the features is modified later, yielding the point triplet features. Both versions will be described here, because the second approach can most easily be understood when considering the first approach at the beginning. Furthermore, by defining different features, the flexibility of the overall approach becomes apparent. The basic idea is to consider the orientation and position of the field lines at a few 117

122 wid t h of the f ield li n e to d e t e ct orientation of the field line to detect Figure 4.24: A set of detectors with 16 different orientations and widths is used for the field line detection. The response of the detector is the average intensity of pixels below the light areas of the detector minus the average intensity of pixels below the dark areas. locations on the playing field. By observing how these fixation points behave in the image, the estimates of the system state are calculated. Thus, in an initial approach, we define a feature that will later be referred to as a tangent point. Of course, a point does not have a tangent. Rather, with tangent point we consider a point on a curved or straight field line together with the tangent direction of the corresponding line in the model taken at the 118

123 respective point. Additionally, we include the width of the corresponding field line as a component of the feature-vector. We denote a tangent point with p := (x w, y w, φ w, w w ) T where x w and y w specify its position in the global world coordinate system, φ w is the polar angle of the tangent direction and w w denotes the width of the line. Here, the subscript w emphasizes that the values are specified in world coordinates. This definition is illustrated in figure Note that we always know the coupling between a tangent point and the curved or straight line, whose tangent at that point is considered. This coupling is important when later calculating the Jacobian matrix C, which relates changes in the system state to changes of features in the images. Figure 4.25: Given the position of the robot and the obstacles, two visible tangent points are determined. The points are depicted by small, white disks and the corresponding orientations by small, black line fragments. The cones marked by thin black lines show the parts of the field lines that are occluded by obstacles. Using the observation model, we can predict the position of a tangent point in the image, but we also want to predict the orientation and width of the corresponding line. For the tangent direction, we consider two points in the world coordinate system that are slightly spaced to each side of the tangent point, map these points onto the image and calculate the angle of their difference vector. Similarly, to predict the width of the line in the image, we consider two points around the tangent point, but this time perpendicularly located at both sides of the field line, enclosing the field line in the middle. We denote this mapping with y = g(p, x (k), k c ), (4.91) 119

124 where x (k) is the estimated system state at time step k and k c are known factors of the optical mapping (i.e. specifying the mirror s center in the image). The resulting vector y = (x, y, φ, w ) T consists of four components describing the predicted position, orientation and width of the projected tangent point in the image. Our goal is to measure the discrepancy between the predicted feature y and the real feature y. Since we expect the difference to be small, we use y as a starting point for the detection process. That is, we have an initial estimate of the position, orientation and width of the tangent point in the image, and we choose an appropriate detector from a set of pre-computed filters defined for different orientations and widths of the expected line (see figure 4.24). With the above definition of the tangent points, considering a single point in the image is insufficient to measure the feature s true values y in the image. At least one, better two auxiliary points have to be used. Figure 4.25 shows the robot close to the penalty area and two tangent points that are used in figure 4.26 to measure the discrepancy between the predicted and real appearance of the feature. However, the need for auxiliary points in the image to measure the discrepancy in the orientation strongly suggests that the idea of the tangent point is misrepresented. Although this representation has the advantage of a small feature-vector - thus, resulting in small matrices later - it is more straight-forward to consider the auxiliary points from the very beginning, that is at the level of the world model. Hence, we discard the idea of tangent points and instead define a new feature that we refer to as a point triplet in the following. A point triplet is simply a set of three points on a straight or curved line, equally spaced at a small distance ( 20cm). Figure 4.27 illustrates two cases for straight and curved lines. The corresponding feature vector p is defined as p := (x w1, y w1, x w2, y w2, x w3, y w3 ) T, (4.92) with x wi, y wi as the cartesian coordinates of the ith point (i = 1, 2, 3) specified in the global world coordinate system. The feature vector p was already defined for the tangent points. However, discarding this approach, we redefine this vector. Based on the current estimate of the system state, the position of the point triplet in the image can be predicted. y (k) := g(p, x (k), k c ). (4.93) 120

125 1 * y 1 T 1 * y 2 T 2 2 Figure 4.26: This figure shows how the discrepancy between predicted and real measurements is determined using tangent points features. It shows that auxiliary points are required to measure the orientation. This is one reason, why the idea of the tangent points is discarded and replaced by point triplet features later. The figure shows a closeup of an image at the corner of the penalty area. The estimate of the robot s position is slightly wrong. Thus, the predicted feature locations y 1 and y 2 do not precisely cover the pixels corresponding to the respective field lines. Here, y 1 and y 2 have three components including the predicted x, y-position plus the predicted orientation. To measure the true location of each feature, three line detectors are applied along straight lines. The lines are oriented perpendicular to the predicted orientation of the corresponding field line. Each of the detectors then finds the position of maximum response, which is on a pixel of the field line. Fitting a straight line to the three detected points then yields the measured orientation of the field line. Later, when we discard tangent points and use point triplets instead, the measurements will solely consist of pixel coordinates, not including angular components. Now, y (k) = (x 1, y1, x 2, y2, x 3, y3) T contains the predicted pixel coordinates of the projected points in the image. Also, we predict the orientation and width of the field line at the respective points and select an appropriate detector. The detectors yield the 121

126 p =( x w1 ) y w1 x w2 y w2 x ( w1 y w1 ) x ( w2 y w2 ) x w3 ( y w3 ) x ( w1 y w1 ) x ( w2 y w2 ) x ( w3 y w3 ) x w3 y w3 (a) p =( x w1 ) y w1 x w2 y w2 x w3 y w3 (b) Figure 4.27: In a second approach, we define a point triplet to be a feature consisting of three points on a field line equally spaced at a small distance (between 20 and 80cm). The two examples (a) and (b) show that the corresponding field line can be straight or curved. The resulting feature vector has the simple form p := (x w1, y w1, x w2, y w2, x w3, y w3 ) T. positions of maximal responses, which are summarized in y(k) = (x 1 y 1 x 2 y 2 x 3 y 3 ) T. This step is illustrated in figure Next, we consider the discrepancy between predicted and measured feature values. y(k) = y(k) y (k) (4.94) There is a coupling between a change x(k) of the system state and the measurement discrepancy y(k). However, this mapping is non-linear because of the distortion of the image and the angular components in the system state. Note that this non-linearity is typical, not only for omni-directional vision systems, but also for perspective vision. Hence, we linearize the mapping at the predicted estimate x (k) of the system state. y(k) = C(k) x(k) (4.95) Here, x(k) = x(k) x (k), (4.96) and C(k) is the Jacobian at time step k, the derivation of the vector function g (see equation 4.93) with respect to the system state x. We will describe later how it is calculated. At this point, we consider y(k) to contain only one point triplet. In this case, the mapping described by matrix C(k) is typically not bijective, that is, C(k) cannot be 122

127 * y = ( x 1 ) * * y 1 * x 2 * y 2 ( x * y ) 1 * 1 ( ( x * y ) 2 * 2 ( x * y ) 3 * 3 y 3 * x 3 * y = ( x 1 ) y 1 x 2 y 2 x 3 ( x ) ( y 1 1 ( x ) y 2 2 ( x ) y 3 3 y 3 Figure 4.28: The position of a point triplet in the image y = (x 1, y 1, x 2, y 2, x 3, y 3) T is predicted based on the current estimate of the system state. A local detector is applied along lines orthogonal to the predicted orientation of the field lines at the respective points. Based on the prediction of the field line widths and orientations, the appropriate detector is selected out of a set of pre-computed detectors that vary in width and orientation. The detectors then measure the true position of the point features in the image y = (x 1, y 1, x 2, y 2, x 3, y 3 ) T. inverted. Hence, it is not possible to calculate the system state by replacing x(k) in equation 4.95 with equation 4.96 and solving for x(k): x(k) = x (k) + C(k) 1 y(k). (4.97) Moreover, such a calculation would contradict the overall approach, since we want to include the dynamic and measurement models to interpret the discrepancy y(k). Indeed, we will collect more than one point triplet in the vector y(k) later, and C(k) will be nonsingular in many cases. But even then, we do not use equation 4.97 to calculate the system state. Instead, we apply a Kalman filter: x (k) = x (k) + K(k) y(k). (4.98) 123

128 Here, K(k) is the so-called gain matrix, which is calculated by K = P C T (CP C T + R) 1. (4.99) Here, P is the current error covariance matrix. The matrix is updated by P = P KCP. (4.100) The number of rows and columns of the matrix P equals the number of system states. R is the measurement noise covariance matrix. Its number of rows and columns corresponds to the dimension of y. The matrix R contains the covariances between the individual quantities of the measurement vector y. These covariances depend on the currently selected features. A feature, which is close to the robot gives rise to smaller entries than a distant feature. The reason is, that closer objects have a higher resolution in the image due to the catadioptric image formation. The Jacobian C(k) plays a fundamental role in the overall approach. The components are the partial derivatives of the vector-function g, describing the optical mapping, with respect to the components of the system state x, taken at x. Remember, that the system state was defined as ( ) T x = x, y, φ, v x, v y, ω, (4.101) with x,y,φ representing the robots position and orientation (the robot s pose or reduced system state) and v x, v y, ω specifying the translational and rotational velocities (see page 73). The appearance of features in the image not only depends on the robot s pose and other objects, but also on its translational and rotational velocities. For instance, when the robot rotates very fast and the exposure time of the camera is short, i.e. because of dim light, then features like the field lines tend to smear along the rotational direction. However, in the following, we neglect these effects. It suffices to consider the partial derivatives with respect to x, the reduced system state. When considering only one point triplet p(k) = (x w1 y w1 x w2 y w2 x w3 y w3 ) T and setting (x 1 y 1 x 2 y 2 x 3 y 3 ) T := y(k) = g(p, x(k), k c ) (4.102) 124

129 the Jacobian has the form: C(k) := δx 1 δx δy 1 δx δx 2 δx δy 2 δx δx 3 δx δy 3 δx δx 1 δy δy 1 δy δx 2 δy δy 2 δy δx 3 δy δy 3 δy δx 1 δφ δy 1 δφ δx 2 δφ δy 2 δφ δx 3 δφ δy 3 δφ (4.103) Here, we omitted the variable k in the components of the matrix for the sake of readability, but of course, the matrix is not static, instead depending on the time step k. Because of the distortion by the robot s mirror, the matrix C(k) cannot be computed analytically. Instead, it has to be computed numerically in each time step k by combining knowledge of the model of the field lines, the current estimate of the system state and the detection process. The differential components of the Jacobian C(k) at time step k are approximated by finite differences. For that approximation, it is predicted how the measurements of the detectors evolve in the image when individual components of the system state are slightly modified. That is, small, finite differences x, y and φ are added to the components of the current estimate x (k) of the system state and the resulting geometric constellation of the field lines in the image, before and after the modification, is investigated. Predicting the image position of the currently used features, the question is, what the corresponding detectors would measure if the state changed as supposed by the finite differences. Since the measurements of the detectors are determined by applying them along straight scan lines and returning the position of maximum response, a prediction of the measurement can be made by calculating the intersection points between hypothetical scan lines and the predicted field lines according to the modified system state. Relating the measurements to the finite differences in the system state yields the approximate components of the Jacobian matrix C. This calculation is illustrated in figure

130 x y y x y x y x x y x y x y x y y x y x y x ) ( y y y x y x y x x x y y y x y x y x x x y y y x y x y x x x y y y x y x y x x x y y y x y x y x x x y y y x y x y x x x k C (a) (b) (c1) (d1) (c2) (d2) (e1) (e2) (f) Figure 4.29: The Jacobian C is calculated numerically by finite differences. Figure (b) shows the field lines as they would appear to the robot when it is located as in (a). In (c1), (d1) and (e1), the robot s current pose is slightly modified in one of its components (x, y, φ) T. The original pose is compared to each of the modified poses; in particular, the corresponding predictions of the field lines in the image are compared. Assuming a single point triplet, the response of the detector is predicted by calculating where the detector s scan lines would intersect the geometry in the image that corresponds to the modified poses. In figure (c2), (d2) and (e2) the gray contours reflect the field lines that correspond to the robot s original pose. Each is the same closeup of (b). The black lines reflect the field lines corresponding to the modified poses, and the straight scan lines and black dots illustrate the expected detector measurements. To calculate the intersection points, only the local geometry around the point triplet has to be used. Thus, figures (c-e) show cut-outs. Relating the expected detector measurements to the modification of the robot s pose yields an approximation of the entries into the Jacobian matrix (f). 126

131 Up to this point, only one point triplet was used to correct the current estimate of the system state. Now, we will use several point triplets and collect all their components in a single feature vector. The point triplets used will not remain the same over time. Rather, a selection algorithm will dynamically choose different features. But how many and which features should be used? A feature selection process must carry out the following three steps. First, visible point triplets have to be determined. For that, the portions of the field lines that are not occluded by obstacles have to be determined. Second, point triplets on the field that can be easily recognized with feature detectors have to be selected. They should not be close to obstacles, corners or junctions of the field lines to avoid detection problems. Also, they should not be too distant. Third, from the set of potential point triplets, the combination most suited to correct the current estimate of the system state must be determined. The feature selection process must consider the obstacles to avoid the selection of point triplets that will be occluded in the image. Obstacles can also be tracked with the system dynamics approach. However, in the following, we simply assume that we know these obstacle positions. When observing the playing field from above, the left and right side of an obstacle form a cone together with the position of the robot. Field lines within this cone and behind the obstacle are occluded in the omni-directional images (see figure 4.30). Thus, if the intersection points between the field lines and the sides of the cones are calculated, a list of visible curved and straight line segments can be computed as illustrated in figure On each visible line segment, one or several potential point triplets can be considered. However, the line segments should be long enough to allow a point triplet to be placed in some distance w from the end points of the segment. Otherwise, they should be discarded to avoid detection problems due to imprecisions in the current estimate of the system state. If a line segment is long enough, several point triplets can be placed on it, with the constraint that they are spaced at a distance of at least d = 10cm. It is important to understand that a single point triplet on a straight line is not able to recover the component of a movement that is parallel to the line. This is the wellknown aperture problem. However, by selecting constellations of point triplets that are not parallel, this problem can be overcome. After having selected a set H of l potential point triplets, the best k (i.e. k=2) should 127

132 (a) (b) Figure 4.30: The shaded areas in (a) show sections occluded by obstacles. The visible parts of the field lines (b) can be determined by calculating the intersection points between field lines and rays starting at the robot s position and passing through each side of the obstacles. be selected. Here, the best point triplets are those that allow the best estimate of the system state, that is, with the smallest resulting error covariance matrix P. The error covariance matrix P is recursively computed from the old matrix P by: P = P KCP (4.104) Wünsche developed a criterion for the efficiency of a feature combination in his doctoral thesis [86]. Underlying a Gauss-Markov estimator, the error covariance matrix is calculated by P = (C T NC N ) 1, (4.105) where C N = R 1 C R S. (4.106) Here, C R is the Jacobian matrix based on the reduced system state, and S is a scaling matrix that is necessary to relate errors in the individual components of the system state to each other. Wünsche defined J = C T NC N = P 1 (4.107) 128

133 as a criterion to decide for a feature combination. The feature combination that maximizes J should be selected. In this way, the determinant of P is minimized, favoring small diagonal elements in P, which represent the error variances of the system state components. At the same time, uncorrelated errors are favored. Using a Kalman filter to calculate the error covariance matrix P, the definition J := P 1 (4.108) is the analog to equation However, the problem is that the determinant does not consider the different units of the components of P. For instance, a variance of 1.0 radians in the orientation of the robot is much more severe than a variance of 1.0 centimeters in the cartesian position. Thus, the definition should be adjusted to J := S 2 P 1, (4.109) with S being a scaling matrix that relates the components to each other. Selecting k features from the set H of l potential features, ( l k) combinations are possible. In many applications the number of potential features is restricted. With k = 2 and l = 20 this would result in = 190 required evaluations of the criterion J. Selecting 4 from 10 features produces 61 combinations, which is a treatable computational load. However, when k and l increase, the combinatorial possibilities explode. For instance, when selecting 6 from 50 possible features, approximately combinations are possible. In order to reduce the computational load, Wünsche s idea was, not to compute the best combination of features in each step, but instead to try to improve the current feature combination by exchanging single features. In this way, each of the currently used k features must be considered to be exchanged with one of the k l unused features. That is, the criterion J must be evaluated only k(k l)+1 times. In the example, where 6 from 50 features have to be selected, only = 2201 combinations have to be considered in each step. Compared to the combinations that had to be evaluated before, this corresponds to a reduction of the computational load of two orders of magnitude. 129

134 4.13 Layer 3: Feature Recognition In the third layer, we run a feature detection process that allows us to find sub-structures like corners or the center circle, which yield strong hints for the robot s position. A similar approach has been done in [39], where straight lines and circles are detected for localization. Our approach differs from the previous method in the way we detect the features. While the approach in [39] is based on the Hough transform, which is computationally very expensive, we detect the features using a constructive approach. That is, we try to iteratively compose features from smaller parts. Our method is based on the representation of the field lines as obtained by the new region tracking algorithm described on page 66. Here, the field lines are already organized as chains of points Representation of the Line Contours We use our region tracking algorithm to extract the field lines from the images. We determine the boundary of all green regions in the images and we search for green-white-green transitions perpendicular to the boundary curves. After having extracted the lines, they are represented by the pair (P, C) where P = p 0,..., p n 1 is a set of n points with cartesian x,y-coordinates and C = c 0,..., c l 1 supplies connectivity information that partitions P into l point sequences. Here, each c i = (s i, e i ) is a tuple of indices determining the start and end point in P that belong to the corresponding point sequence. That is, point sequence i consists of the points p si,..., p ei. By manipulating the connectivity information C, point sequences can efficiently be split or merged. The line contours as extracted from the images are distorted due to the mirror in the omni-directional vision system. In the following, we assume that the distortion has been compensated (see page 98). However, even without removing the distortion correctly, we are still able to detect most of the features. Figure 4.31 can provide an impression of the initial data. In many cases, there are long point sequences that correspond precisely to the shape of the field lines. However, there are also outliers, missing lines, and small line fragments due to occlusion and detection errors. 130

135 Figure 4.31: Four examples of extracted line contours. Some long sequences of points correspond precisely to the shape of the field lines. But there are also outliers, missing lines, and small line fragments due to occlusion and detection errors. 131

136 Quality and Quantity of Features In the following sections, we will concentrate on the recognition of specific features which we refer to as high-level features. The idea of hierarchically ordering the features stems from the fact that there exist features, such as the center circle, which yield a unique clue of the robot s position, while others give rise to a whole set of possible positions. Figure 4.32 provides a first idea of different feature types that might be detected in the line contours. corner X-junction T-junction straight line arc center circle parallelism U-structure quarter circle double corner rectangle Figure 4.32: Examples of different types of features are illustrated. One can distinguish between low-level features that give rise to many possible robot poses or high-level features which yield only a few possible robot poses from which the feature could have been observed. In order to put the feature detection process onto a well-defined base, we first establish a mathematically sound definition of what a high-level feature is. For that, we define two terms, the quality and the quantity of a feature type in the following. The quality of a feature type is the dimension of the space of possible robot poses once a feature of the 132

137 respective type is detected. The quantity is the frequency of a feature in the task domain. Of course, we will favor features with a low quantity number. In the following we will establish these definitions more precisely. In the context of the field lines, one of the simplest possible features is a single point on a line. When detecting such a point p i in the image (subscript i for image), the point can be transformed from the image coordinate system into the local coordinate system of the robot (applying the distance function). Let s denote this representation with p r (subscript r for robot). p r := Y 1 f (pi ) (4.110) Here Y 1 f is the generic function which maps the feature f from image space to local world space. Then we know the angle at which the point is seen and its distance. Let us assume for the moment that there is no uncertainty in the measurement. Then the question is: Where could the robot be so that a point on a field line is seen at the given angle and distance? In order to express this question more formally, we have to make some definitions. First of all, given a feature type f (i.e. a point, line, corner,...) we define L f to be the set of all poses (position and orientation) where such features exist within the field model, specified within the global coordinate system. For instance, L p is the set of all points on the field lines, L l the set of all straight line segments, L c the set of all corners, etc... Then given a feature type f, we ask for the set X f of robot poses, such that for all x X there exists an instance p L f, such that p would appear at the location of p i in the image: X f (p i ) := Note, that each element in X f { x p L f : Y f ( x point p) = p i (4.111) is a three dimensional vector with the last component specifying an angle. This last component complicates the following definitions, and we will first consider X f, which is X f without the final component. Typically, the set X f can be thought of as a union of k sets X 1f, X 2f,..., X kf, with each set being an affine subset of IR 2. Then we define the quantity of a feature type to be k and the quality to be the dimension of the X i i = 1, 2,..., k. In the case that the sets have different dimensions 133

138 (which is not the case in this thesis) one should more carefully define the quality as: quality(f) = max i=1,2,...,k (dim( X if )) (4.112) For instance, if we consider the center circle to be a feature type, X center circle consists of two possible robot poses, one on each side of the playing field, from which the robot could have perceived the detected center circle p i in the image. Thus, X center circle decomposes into two subsets of dimension zero. Hence, quality(center cirlce) = 0. Accordingly the quantity of the center circle feature is two. If we consider a single point as a feature, the quality is two, because X point consists of areas of possible points. The quantity in this case is infinite. We will consider another example with straight line segments as features. For the sake of simplicity, imagine that our world model L straight lines would consist of only 1 (instead of 17) straight line segment(s) and that l i is a detected line segment in the image and l r is the corresponding segment in the local coordinate system of the robot as depicted in figure 4.33b). Then figure 4.33c) shows the set X straight lines and its decomposition into the sets X 1straight lines and X 2straight lines. Now we can establish a hierarchy of features. At the top are features with quality zero, which we refer to as high-level features. Features having the same quality are ordered by their quantity with lower quantities yielding a higher position in the hierarchy. In the following sections, we will describe a complex feature detection method. The method begins with low-level features such as points and lines and iteratively constructs higher-level features. At the very top of the construction process, there are five different high-level features: the center circle, and four different corners arising from the structure of the penalty area. Before we detail the detection process, we will first describe how the pose of the robot, which is the measurement vector z rec of the third layer, can be inferred once a high-level feature is detected Direct Pose Inference by High-Level Features Figure 4.34 shows the five high-level features we are going to detect. The four corners seem to be congruent, but their location within the penalty area will be recognized such that left,right, inner and outer corners can be distinguished. Each feature type defines a local coordinate system. We refer to the origin of this 134

139 L straight lines ={l 0 l 0 y r l y (a) x global world coordinate system ~ X 1 straight lines local coordinate system (b) ~ ~ X straight lines = X 1 straight lines x ~ X 2 straight lines y ~ X 2 straight lines global world coordinate system (c) x Figure 4.33: This figure illustrates the definitions of quality and quantity of a feature. In this example, the feature type straight line segment is considered and it is assumed that the world model L straight lines consists only of one line l 0. (see (a)). In (b) the robot is shown having detected a line l i in the image which is transformed into the robot s local coordinate system, where it is denoted with l r. Then figure (c) depicts the sets X1straight lines and X 2straight lines, which constitute the positions from which the robot could have seen the line in order that its perception fits to the model L straight lines. Since each set X 1straight lines represents a one-dimensional line segment quality(straight line) = 1 and since there exist two sets, quantity(straight line) = 2. system together with the orientation of the y-axis as a reference mark. Figure 4.35 shows the reference marks for the five high-level features. By specifying the pose of the reference marks in the global coordinate system, the locations of the features are uniquely given. 135

140 right outer corner right inner corner center circle left inner corner left outer corner Figure 4.34: The center circle and four different corners are recognized by the system. Although the shapes of the different corners are identical, the system is able to identify the position of a detected corner within the penalty area. Within one side of the playing field, each of the corners represents a unique feature, since their position with respect to the penalty area is simultaneously detected in the recognition process. ref reference marks fdafd right inner/outer corner center circle left inner/outer corner Figure 4.35: This figure shows the position of the reference marks considered in relation to the local geometry of the features. 136

141 In particular, the set L f is the set of poses of reference marks in the global system. It specifies at which locations the feature f is instantiated in the line model. Figure 4.35 shows the sets L center circle and L left inner corner. The sets of the other three corners are defined accordingly. When detecting a high-level feature, we actually infer the pose of its L right outer corner L right inner corner L center circle L left outer corner L left inner corner Figure 4.36: The set L f of global poses of the reference marks of each type of feature is shown. reference mark within the extracted field lines after having removed the distortion, that is, within the local coordinate system of the robot. Let us denote q r f as this detected pose of feature f. Then we compare q r f with the poses p i L f and for each combination, we determine a relative transformation M i (a movement consisting of a translation and rotation as defined on page 75) of the robot that matches the poses to each other. With x k 1 being the last estimate of the robot s pose, consider figure 4.37 for the calculation of this transformation: q f = ( x k 1 point q r f) (4.113) M i = q f point (p i point (0 0 π 2 )T ) (4.114) Here, q f is the pose of the reference mark in the global system corresponding to the last estimated robot pose, which is going to be adjusted. 137

142 q f r y local x local M x k x k-1 y p i x global coordinate system Figure 4.37: A high-level feature f is detected, the reference mark of which is at pose q r f in the local coordinate system corresponding to x k 1 of the robot at time step k 1. Then pose p i of the global model L f is considered and our goal is to calculate the movement M of the robot, such that the perceived feature matches the model feature. Finally, we take the smallest movement and define the corresponding robot pose to be the measurement vector of layer 3. z k,recognition = x k 1 + M ismallest. (4.115) Here, x k 1 is the last estimate of the robot s pose and the smallest movement M ismallest is considered to be given in the global coordinate system. In order to determine the smallest movement M ismallest, we have to define a metric on the space of movements. Thus, ( we have to relate ) a difference in orientation to a difference in translation. With T M = x y φ, we use the following distance function: d(m) = x 2 + y 2 + (γ φ) 2, (4.116) with γ = 300cm enforcing the influence of the angle given in radians and x and y π specified in centimeters. Thus, a rotation of 180 degrees is weighted as severely as a translation of about 300 centimeters. This section described, how the robot s pose can be inferred, when detecting a highlevel feature. The following sections are dedicated to the actual recognition process. 138

143 Smoothing the Lines After the field lines are extracted from the images, the actual feature recognition starts. It consists of approximately 20 steps, beginning with a smoothing operation, which is required to simplify the detection of local maxima in curvature later. Smoothing is efficiently done by procedure 4, which is listed in appendix A. Figure 4.38 shows the effect of smoothing. (a) (b) Figure 4.38: The lines (a) before and (b) after smoothing Splitting the Lines The field lines consist of curved and straight lines. For the further detection process, we want to classify the perceived line contours into these two categories. However, as depicted in figure 4.39, the perceived lines often consist of concatenated curved and straight segments. To classify them separately, we have to split the lines at the junctions. The latter coincide with points of local maximum curvature. We retrieve these points by first calculating a curvature measure for each point and then finding the local maxima. Although sophisticated methods exist to calculate curvature measures([6]), we have adopted a very simple approach for the sake of efficiency. For each point, we reference two other points, one in front of and one behind the current point. With these three points, we define two approximate tangent vectors, one reaching from the left to the center point and one from the center to the right point. Finally, we define the curvature at the center point to be the angle by which the first vector has to be rotated to equal the second vector, divided by the distance between the centers of both vectors. In order to be resistent 139

144 (a) (b) Figure 4.39: Often a perceived line consists of concatenated straight and curved segments as illustrated near (a) the center circle and (b) a quater circle at the corner of the playing field. against local noise in the curvature, we choose the enclosing points at some distance to the center point. Since all the points are approximately equally spaced, we can afford to use an index distance instead of a precise geometric distance. That is, for a point p[i] at index i, we choose the enclosing points to be p[i w] and p[i + w] (w = 4 in our implementation). This implies that the curvature can not be calculated for the first and last w points. The principle is illustrated in figure 4.40 and the corresponding pseudo code is given in procedure 5 (see appendix A). p[i] p[i-w] a b p[i+w] Figure 4.40: The curvature (dimension [1/cm]) at point p[i] is defined to be the angle α by which a has to be rotated to equal b, divided by the distance between the centers of both vectors. To detect local maxima of the curvature measure, we have adopted the approach given by procedure 6 in appendix A. While traversing the curvature values, we detect intervals 140

145 of values that exceed a given threshold and within each interval, we determine the index with the maximal value (see figure 4.41). To avoid extrema being too close together, a new interval is opened only if it is at least some given distance from the previous interval. curvature curvature threshold indices of local extrema index position Figure 4.41: The local maxima are found by searching for the highest values within intervals that exceed a given threshold. Figure 4.42 shows the locations of the split points for various lines. The splitting can efficiently be performed by alternating the entries in the incidence array. If line l is split at k locations, the corresponding entry I[l] is replaced by k + 1 new entries that specify the start and end points of each segment according to the location of the split points. Figure 4.42: Some examples that demonstrate the location of the split points. 141

146 Corner Detection This section describes a processing step which was applied in the system as it was in 2003, where the penalty area consisted only of one rectangle (see page 8). Although the processing step is not included in the actual system it might be interesting for researchers, who want to develop a similar feature detection method. The split points not only determine the locations to subdivide the lines into parts, but as a byproduct also can yield the first features. We denote a split point at which the curvature is approximately 90 as a corner feature and we can distinguish two different types depending on whether the curvature is positive or negative. Remember that the lines are detected by stepping around the boundary of the regions neighboring the fieldlines and that the sense of rotation is always the same (see page 63). Thus, in the case of a clockwise rotational direction, if the region has a convex boundary at the point where the corner feature has been detected, the curvature is negative, while it is positive if the boundary is concave. Hence, we denote the respective features as convex and concave corner features. Both cases are illustrated in figure As illustrated in figure concave corner > 0 < 0 convex corner (a) (b) Figure 4.43: Depending on the shape of the region s boundary, a (a) convex or (b) concave corner feature is detected. 4.44, convex corner features occur quite frequently. In contrast, concave corner features occur only at the corners of the penalty area, which are directed inwards into the field. Essentially, concave corner features should also arise at the outer corner markings of the playing field, but the corner posts occlude the lines and no continuous sequence of points is extracted at these locations. Therefore, detecting a concave corner feature gives a very strong hint to the robot s position on the playing field. Figure 4.45 gives an overview of the locations on the playing field where the respective features emerged, underlying a 142

147 playing field as it was defined by RoboCup in the only concave corner feature Figure 4.44: Convex corner features are drawn by dark blue corner arrows. The only detected concave corner feature is drawn by a longer light blue arrow. (a) (b) Figure 4.45: The locations of (a) concave and (b) convex corner features, underlying a field line structure as it was given in Padova

148 Classification After the lines have been split, we want to classify each fragment to be either straight or curved and we employ the following test criterion: Similar to having calculated the curvature measure above, we determine the angle between two vectors, but this time spanned by the first, the middle and the last point of the current line fragment. If the absolute value of the angle exceeds a threshold t φ (t φ = 0.4 radians in our application), then the line is declared to be curved, otherwise straight. Figure 4.46 shows a typical result and demonstrates that sometimes false classifications occur. The bent line passing along the corner of the penalty area has not been split because its curvature is too smooth and weak. Therefore, instead of having two straight fragments, the line has been classified as curved. This is wrong because the field markings of the penalty area do not consist of any curved lines. However, identifying lines as curved is primarily for the detection of the center circle and the corresponding detection process can cope with a small number of wrong classifications. The straight line fragments, the curved parts and the corner features serve as building blocks for more complicated features Constructing Arcs and straight lines Previously, point sequences were classified into straight and curved segments. Next, we construct straight lines and circular arcs from the respective data. A straight line can be approximated by the start and end point of each point sequence. Similarly, one can simply construct an arc using the start, middle and end point by considering the perpendicular bisectors of the two segments. They intersect at the center of the circle including the arc (see figure 4.47). In order to verify whether the points really have the shape of a circular arc, we approximate the radius r by the mean distance of the points p i from the center m. r = 1 n 1 p i m (4.117) n i=0 where n is the number of points in the points sequence. Next, we compute the mean distance from this radius. Finally, we relate this measure to the radius in order to allow larger arcs to have a higher variance. That is, the relative deviation σ rel from the radius 144

149 false classification Figure 4.46: Line fragments classified as curved are drawn by thick gray strokes, while straight fragments are depicted as thin black lines. As illustrated on the left, false classifications can occur. Points of curved sequences are not filled. r is σ rel = 1 n 1 p i m r. (4.118) rn i=0 It seems that two loops are required, one for calculating r and one for calculating σ rel, but this is not necessary. In terms of a random variable X being the distance of a point to the center, equation is the expectation EX of X and equation is the standard deviation of X divided by EX. That is, σ rel = VarX) EX. (4.119) 145

150 p n-1 p n/2 p 0 m Figure 4.47: The three points p 0, p n/2 and p n 1 of a point sequence of length n form two line segments. The center point m of a circular arc is approximated by calculating the intersection point of the perpendicular bisectors of the segments. Since VarX = E[(X EX) 2 ] = EX 2 (EX) 2 ), we have d rel = EX2 (EX) 2. (4.120) EX Thus, r and σ rel can be computed efficiently within the same loop by calculating EX and E(X 2 ). In figure 4.48, some examples of arcs with the respective values of σ rel are depicted. We consider arcs for which σ rel exceeds 0.3 as unreliable and discard them. = = = = rel rel rel rel Figure 4.48: Some examples that illustrate the values of σ rel for various curved point sequences. A precise arc with the estimated radius is shown for each point sequence. The closer the shape resembles the arc, the lower is σ rel. Point sequences with σ rel > 0.3 are discarded. 146

151 (a) (b) Figure 4.49: The goal is to detect the center circle. However, it can happen, that arcs are detected which do not originate from a circle. Two typical cases are illustrated. On the left of figure (a) and (b) the original point sequences are depicted. Circular arcs are constructed from each curved point sequence and the corresponding circle which includes the arc is depicted on the right. For gray circles σ rel > 0.3 and they are discarded. Circles of accepted arcs are painted black. Also, the two perpendicular bisectors that were used to calculate the center point of each arc (see figure 4.47) are depicted by thin straight lines. If circular arcs originate from the same circle, their center points are close together. 147

152 Grouping Arcs and Detecting the Center Circle Although the arc detection discussed above discards curved point sequences that do not have the form of a circular arc, arcs that are not part of the center circle could emerge. Some of these spurious arcs are part of the quater circles, but there are also cases, where arcs emerge from straight lines due to noise effects, detection errors, or a badly calibrated distance function. Figure 4.49 shows some of these cases. To be robust against such outliers, we want to group arcs in order to determine whether there are several arcs supporting the same circle, or whether a single spurious arc has been detected. Also, grouping allows a more precise detection of the overall circle. (a) (b) Figure 4.50: In figure (a) the center points of all detected arcs are depicted by small disks filled in with a smooth black-white transition. In figure (b) the resulting clusters are depicted with the area of the disks indicating the weight of each cluster. All point sequences that contributed to the greatest cluster are painted red. A circle is constructed with the center being the center of the main cluster and the radius being the average distance of the red points from the cluster center. However, the circle is just a first approximation and the solution will be refined by an iterative method later. 148

153 In order to group the arcs, we observe that the center points of arcs belonging to the same circle should be close together. Therefore, we search for clusters of the center points of the arcs. We will refer to them simply as center points in the following description of the cluster algorithm. Initially, we pose a cluster at the first center point. Then, we traverse the remaining center points and calculate the respective distance to the cluster center. If the distance exceeds a variable threshold t dist = 0.6r where r is the radius of the current arc, we open a new cluster at the respective position. Otherwise, we adapt the cluster center to represent the weighted mean position of all assigned center points. Here, the weights are the lengths of the arcs, which we approximate by the number of points of the corresponding point sequences. When several clusters have been opened, a center point first determines the closest cluster and then verifies whether the distance exceeds t dist. In this way, we obtain a set of clusters and choose the one with the greatest weight. If the weight is greater than a threshold t w = 20, we consider the cluster to be the approximate center of a circle. To determine a first approximation of its radius, we calculate the mean distance of all points of the arcs that have contributed to the cluster. However, as illustrated in figure 4.50 the circle is just an approximation. Up to now, the circle has emerged from point sequences that have been classified as curved. However, point sequences that originate from the center circle of the playing field could have been misclassified as straight and hence, are not part of the points from which the circle is determined. We want to include these points before refining the solution of the initially found circle. Figure 4.51 shows an example of such lines that are part of the circle, but have been classified as straight. Remember that for classification, the angle between the two vectors spanned by the first, middle, and end point of each point sequence is used. If the absolute value of this angle exceeds a threshold t φ = 0.4, the point sequence is considered to be curved, otherwise straight. Now, in the presence of misclassifications, one could assume that the threshold t φ = 0.4 is too high. But when lowering the threshold, point sequences that emerge from straight lines are sometimes classified as curved. In fact, there is no threshold t φ that ensures that all the data is classified correctly. Instead, an initial idea of the underlying structure of a point sequence has to be present in order to classify ambiguous data correctly. 149

154 misclassified point sequences (a) (b) (c) Figure 4.51: This figure shows the effect of misclassifications of point sequences for the detection of the center circle. In (a) several point sequences are classified as straight, although they originate from the center circle. Since arcs are only constructed from curved sequences (b), the detected circle is imprecise (c). Thus, before refining the solution, it is important to determine the point sequences which have been misclassified and to include the corresponding points when improving the solution Refining the Initial Solution of the Circle We use the initial hypothesis of the circle to detect and reclassify ambiguous data. We consider all point sequences that have been classified as straight and investigate whether they could have been part of the hypothetical circle. Here, some tolerance is required, since the initial hypothesis is just an approximation of the real circle. Two conditions are verified. First, the distance from the center of the actual straight point sequence to the center of the hypothetical circle is compared with the radius of the circle. Second, the direction of the point sequence is compared with the tangent direction of the circle at a corresponding location. If the difference between the measured distance and the radius is more than half of the radius, we discard the point sequence. That is, we do not consider the point sequence to be a misclassified curved sequence that is part of the circle. Similarly, we discard the point sequence if the difference in the direction is too large. In order to describe the conditions more precisely, let m be the center and r the radius of the initial circle. Furthermore, let p denote the center of the actual straight point sequence and let v be the directional vector of the point sequence, which is the vector from the 150

155 first to the last point of the point sequence. Then we consider the intersection point s of the circle with a ray beginning at the center of the circle and passing through p (see figure 4.52). point sequence v s p r m n t s p m (a) (b) Figure 4.52: This figure illustrates the different symbols used in equations to to determine whether a point sequence belongs to the circle, although it has been classified as straight. This step is important in order to refine the initial solution of the detected circle. The meaning of the symbols is explained in the text. Part (b) shows a close-up of the geometry around the point sequence in part (a). Formally, s is given by s = m + rn, (4.121) where n = p m p m. (4.122) The tangent direction t of the circle at point s is obtained by rotating n 90 degrees. With n = (n x n y ) T, we have We discard the actual point sequence if ( ) ny t =. (4.123) n x p s > 0.5r, (4.124) 151

156 or, if tt v v > cos 25. (4.125) In this way, we detect additional point sequences that also belong to the circle. Thus, we have the points from which the initial circle was constructed plus the points added due to the initial hypothesis of the circle. Next, we will adjust the position and radius of the circle to fit the points. Here, we use Landau s method [52], a simple method that iteratively adjusts the center and the radius of the circle. We restrict the number of iterations to 4 where the computational costs per iteration are linear to the number of points. Figure 4.53 illustrates a typical example, showing the initial and refined circle. (a) (b) Figure 4.53: The position and radius of the initial circle is refined. Points which are considered to originate from the circle are painted with a diagonal pattern. Initially (a), only the points of curved point sequences are considered to be part of the circle. Thus, the initial circle is not optimal. In (b) additional points have been determined by identifying point sequences close to the initial circle. Then the initial circle is iteratively adjusted to the points by Landau s method [52]. Only few iterations are required (two iterations in this example). Finally, we search for a straight line which passes roughly through the center of the circle. If such a line can be found, we consider the circle to be the center circle of the playing field. 152

157 Determining the Principal Directions Straight field lines of the playing field are either parallel or perpendicular. We want to determine the orientations in the extracted contours. We will refer to them as principal directions. We will determine them with the straight lines constructed previously. Typically, spurious straight lines are present and we apply a clustering algorithm to cope with the outliers. For each line i, we calculate its orientation φ i, normalizing the orientation to lay within [0,..., π]. Each line votes for the angles φ i and φ i + π/2. Hence, each line votes for a pair of perpendicular orientations. Here, we apply the same clustering method as for the center circle, but this time working on one-dimensional values. We open a new cluster if the angular difference to the best cluster exceeds 0.3 radians. Otherwise, the cluster center is adjusted, with the weights being the lengths of the contributing lines. Finally, the cluster with the greatest weight determines the first principal direction ψ 0. The second principal direction ψ 1 equals the first, rotated 90 degrees. In the following sections, we will consider two lines through the origin having the directions ψ 0 and ψ 1, respectively. We will refer to these lines as the principal axes a 0 and a 1. Figure 4.54 shows two examples Discarding Unreliable and Grouping Collinear Lines Having determined the main axes a 0 and a 1 we consider three types of lines. Those which are perpendicular to a 0, those which are perpendicular to a 1, and those whose orientation differs from both ψ 0 and ψ 1 by more than 0.3 radians. We consider the latter lines unreliable and discard them. The following step is performed for both a 0 and a 1 together with the respective perpendicular lines. Therefore, we will write a instead of a 0 and a 1 in the following. Let L = {l 0, l 1,..., l n denote the set of lines l i which are perpendicular to a. Furthermore, let m i be the midpoint of line l i. We consider the orthogonal projections of all m i onto the axis a. Lines that are collinear will yield close projection points. Since a passes through the origin, each projection point of m i can be represented by a single value t i, which is the distance of the projected point to the origin. This relationship is illustrated in figure Collinear lines will have the same t i values. However, since the lines are not precisely collinear, the t i will differ slightly. Thus, to find groups of collinear lines that are perpendicular to a, we search for clusters of similar t i values. Again, we apply the same 153

158 a 0 a 0 a 1 a 1 (a) (b) Figure 4.54: Two examples of the determination of principal directions are shown. The original line contours are painted gray. Straight lines are drawn by black arrows and the principal axes found are shown. As can be seen, the method is robust against outliers. clustering algorithm described for the circle detection and determination of the principal directions. A new cluster is opened if the distance to an existing cluster is greater than 20 centimeters. Each cluster stores the lines that were assigned to it. Thus, each cluster represents a group of collinear lines which are perpendicular to the respective main direction. The lines within each group are replaced by a single line that encompasses the full range of the original lines. Finally, we sort the groups by their one-dimensional cluster centers t j. Note that the difference t j t k of two groups of parallel lines is simply the distance between the lines. By sorting for t j, we obtain a topological order of the groups of collinear lines, which will be very useful for the detection of the penalty area and the corresponding corners. Figure 4.56b) illustrates the results of this processing step for two examples. 154

159 l 4 l 1 a l 5 l 7 l 2 l 0 discarded line t 4 t 5 t 1 t 2 l 3 local coordinate system l I l 6 t 0 t 3 t 6 (a) a l II t I local coordinate system t II Figure 4.55: In figure (a) straight lines that are approximately perpendicular to the principal axis a are considered. Lines whose orientation differs more than 0.3 radians (i.e. line l 7 ) from a perpendicular line to the axis are discarded. The midpoints of lines l i are projected onto the principal axis a, yielding the values t i. Clustering the t i values yields groups of collinear lines at t I and t II. In figure (b), the lines in each group are replaced by a single line l I and l II, respectively. 155 (b)

160 Figure 4.56: The results of grouping straight lines are shown. Straight collinear lines are grouped and joined to form single straight lines that are perpendicular to the principal axes. They are topologically ordered along the respective principal direction. The ordering is shown by the dashed thin lines and the corresponding numbers. The original contours and discarded straight lines are painted gray. 156

161 Detecting the Corners of the Penalty Area The rectangle marking the goal area and the rectangle marking the penalty area produce three lines parallel to the baseline. The lines are spaced at a distance of 50 and 100 centimeters (see figure 4.57). Having grouped and sorted the sets of collinear lines, we can penalty area 50 cm 100 cm h 0 h 1 h 2 Figure 4.57: At the penalty area the three parallel lines h 0, h 1 and h 2 are spaced at a distance of 50 and 100 centimeters. This structure can be easily recognized in the field lines. easily detect such a structure. Here, we allow a tolerance of 20 centimeters when verifying the distances. Note that the structure emerges at the start or end of the sequence of sorted collinear line groups, since the lines are the outmost existing lines. Having detected such a structure, the direction towards the goal is now known, since the lines are spaced asymmetrically. Thus, we can distinguish between the left and right sides of the lines. Some additional constraints are necessary in order to avoid spurious detections. First, if we find three parallel lines that have the given structure in their distances, we calculate the overall length of the structure in the direction of the lines. This length should not exceed the length of the penalty area, which is five meters. We allow a tolerance of 50 centimeters. A second constraint is that no lines perpendicular to the three lines are beyond the goal line. In order to find the respective corners, we simply verify whether we find perpendicular lines whose endpoints are close to the given lines h 1 and h

162 h 2 h 1 h 0 left inner corner left outer corner v right inner corner right outer corner Figure 4.58: The corners are detected by identifying the line structure h 0, h 1, h 2 and searching for perpendicular lines that are close to the endpoints of h 1 and h 2. The vector v indicates the direction towards the goal Results of the Feature Detection With our omni-directional vision system, the robot cannot see arbitrarily far, at least not with a sufficiently high resolution. Thus, it is interesting to examine the distances up to which the robot can detect features. To verify from which positions on the playing field which feature can be detected, we let the robot execute a specific trajectory, which scans 158

163 the field as illustrated in figure Then we logged all locations where the respective features were detected and marked the corresponding position of the robot as shown in figure In this experiment, we removed all obstacles from the field. As can be seen, the robot can detect a high-level feature from all positions. However, the playing field in our laboratory is smaller than the playing field used in the competitions, and using the large playing field a few locations exist where the robot is not able to detect any high-level features. These locations are between the center circle and the corners of the penalty area. However, overall localization is not affected by this because layer 1 and layer 2 are able to keep track of the localization without recognizing features. Next, we consider how feature detection is influenced by occlusion. The most critical part are the corners of the penalty area. If an obstacle is directly on the corner, the corner is not detected. This is the reason why there are no markings in figure 4.59 at the locations of the corners. The robot itself occludes them while executing the trajectory. It might be possible to adjust the detection of the corners in a way that even the case of direct occlusion can be compensated. To do so, one must infer the corner position by calculating the intersection of two straight lines. However, the danger is that spurious corners arise from straight line segments. Thus, we decided to go for a more conservative detection algorithm that produces no false positives at the cost of sometimes not detecting a feature. Since there exist four different corners at the penalty area, it is extremely unlikely that all of them are occluded at the same time. Even if all four corners are occluded, the overall localization method is not affected, since the bottom two layers maintain localization without feature recognition, once they obtain a correct initial start position. The detection of the center circle is extremely robust against occlusion. Even if only small parts of the circle are visible, the circle can be detected in most cases. Figure 4.60 illustrates different scenarios with obstacles occluding the center circle and shows under which circumstances the circle can be detected. 159

164 left outer corners right outer corners center circle left inner corners right inner corners Figure 4.59: Each figure shows the robot executing a trajectory, which scans the field, with the respective positions marked from which a feature was detected. 160

165 perceiving robot (a) (b) (c) (d) (e) perceiving robot (f) (g) perceiving robot no detection (h) Figure 4.60: The center circle is successively occluded by an increasing number of obstacles and the processing steps for the center circle detection are illustrated (original contours, arc construction, extension, detection). The robot was unable to detect the center circle only in (g). 161

166 4.14 Results of the Overall Localization In order to test the influence of different environments on localization we examined two different playing fields. The first had a green carpet and artificial lighting, while the second had a reflective linoleum floor with natural light shining from one side through an array of windows. However, since the reflections on the floor where almost white in the images, we reduced the influence of natural light by adding artificial light from above and using venetian blinds, although the blinds were not completely shut. On both playing fields, we let the robot automatically move on a trajectory forming an eight (see figure 4.61). The robot did not loose its position on either field. After minutes we stopped the experiment. Moreover, when the robot was manually transferred to an unknown position, the robot immediately found its correct position after perceiving a feature. The maximum positional error while driving was about 10 cm. SICK LMS Indoor (360 ) Range: 80m Stat. error: 5 mm SICK LMS Outdoor (180 ) Range: 80m Stat. error: 1cm RIEGEL Q140i Outdoor (60/80 ) Range: 450m Stat. error: 2,5 cm Figure 4.61: The path shows a robot moving on the field along a predefined figure. As can be seen, the average deviation is approximately 10 cm for a robot driving 0.8 m/s. While the robot moved, we logged all extracted line contours in a file and later manually verified the feature recognition for all frames. Not a single false positive was detected on either field. Finally, the system was employed during our participation at the world championships in 2004, Lisbon. Here, the playing field had a size of 12 8 meters and we played more 162

167 than 10 games using up to 6 robots. During a game, localization is more difficult, since opponents and team robots are moving and continuously occlude different parts of the playing field. However, localization worked very well and we could not observe any failures. In particular, when robots had to be taken out of the field due to hardware problems and were manually placed at a new location, the feature detection method proved to be valuable for initial localization. In all cases, the robots where able to initially localize after a very short time, typically below a second. 163

168 Chapter 5 Conclusions and Future Work The Middle Size league of RoboCup was the underlying scenario for this thesis: Fully autonomous robots, perceiving their environment with an omni-directional video camera were built and a computer vision system able to reliably localize the robot in real-time within this competitive environment was developed. All methods proposed in this thesis were applied at the world-championships in Lisbon 2004, and were demonstrated to work, while staying both efficient and robust. All the software, including computer vision, motor control, communication, and behavior runs on a single Pentium III 900 MHz processor. Region tracking is performed with 30 frames per second and feature recognition with 10 frames per second. This lower rate is not due to speed limitations, but solely because a higher rate is not necessary. Despite these high demands, the processor load is below 80 percent. To the knowledge of the author, the developed system is the first vision system able to perform real-time recognition of a whole palette of shape-based features and uses them for robust navigation. This happens in real-time within a competitive and highly dynamic environment where occlusion frequently occurs. The following sections try to put this work in a more global context and discuss and indicate directions for further research. To identify required future research, we ask whether and how the developed vision system could be generalized. We ask for the changes that have to be done and what is missing, to turn the specific vision system into a general vision system. Here, the demand is high: With general vision system, a system is meant, that allows to do everything that a human can do by means of his visual sense: Skiing, driving a car at night, reading a book, doing the dishes, etc. 164

169 5.1 Considering the System Dynamics Approach The system dynamics approach (SDA), which is due to Dickmanns [26] and Wünsche [86], is certainly an essential achievement. It has been applied in a variety of applications [8, 50, 59, 83, 90]. However, although the general framework is clear, many problems still have to be solved to produce a truly general vision system. These problems will be described in the following sections Automatic Modeling The most challenging problem is that of automatic modeling. Although the SDA was applied in many different scenarios, the types of considered objects were always userspecified. For autonomous vehicles driving on a highway, the model of the highway and other vehicles was predefined [27]. Although these models typically have some adaptable parameters, the scope of flexibility is very limited. In our RoboCup application, the model of the field lines is also pre-given and static. In the real world, there are many situations where unknown objects might appear. For instance, when a human hikes on a trail with stones and roots of trees sprawling on the way, these objects can have an unpredictable shape. A general system must be able to build models of these objects online. Of course, some kind of templates for the models could exist. For instance, stones might be approximated by polyhedrons; however the number of faces and their geometric constellation would have to be determined online by the system. Also, different objects might need different types of modeling. For instance, when trying to build a rescue robot, to extinguish fires, it is presumably inadequate to model the flames and fumes with polyhedrons. In contrast to rigid objects, their form changes continuously. When trying to build a framework that achieves automatic modeling, the focus should not be the creation of a large database of preexisting models. The complexity of a database would explode, since there is an infinite number of possible objects. It would be better to have a method that is able to efficiently create the models online from the observations as soon as they are needed. Of course, background knowledge in a database could assist this process. A similar problem is currently investigated in the context of SLAM algorithms, which should allow localization and map building at the same time [84]. Here, parameters of objects are included in the system state and both the state values of the robot and the 165

170 environment are estimated at the same time. However, typically only one type of object (i.e. simple artificial beacons) is considered. Similarly, the motion model of objects is sometimes not known a priori. In RoboCup, for instance, different teams have robots with completely different motion properties. Some robots have omni-directional drives, others have a turnable front wheel with two rigid rear wheels, that are unable to move sidewards, for instance. Moreover, since the robots are autonomous and have a behavior that controls the movements, it is difficult to find an appropriate motion model. In the future, one could try to use several hypothetical motion models at the same time and choose the most likely one after having observed the behavior of the object for a while. This idea is similar to that of Monte Carlo Localization where several hypotheses for the robot s position are considered, until one main cluster crystallizes Automatic Learning of Feature Detectors All applications where the system dynamics approach has been applied, use a set of predefined feature detectors. However, when the problem of automatic modeling is solved, and the type and appearance of objects are very flexible, the appearance of features can greatly vary. Thus, methods able to create appropriate detectors online have to be found. When a new type of objects comes into sight, even new types of features might be invented, depending on the properties of the object The Problem of Feature Selection Although Wünsche described a method of how to select appropriate features in [86], the underlying environment in which he developed the approach was very simple. A few polygonal, white objects with a clear black background were considered, and the visibility test of a feature consisted only of a test measuring whether the corresponding face of the feature was directed away from the observer. The problem is, that if the scenario and the internal model of the environment is complex - for instance a 3D model with some hundreds of faces - the visibility test becomes computationally expensive. Furthermore, not only the visibility, but also whether and how a feature can be detected in an image has to be calculated. Even if the visibility of a feature has been determined, the kind of detector that is suitable must be calculated. If the background has the same color as the 166

171 object, it might not even be possible to detect it. One idea to test the visibility and to efficiently determine the appearance of a feature would be to use 3D rendering hardware to project the internal estimate of the model onto an artificial image plane. Then one could use the rendered image to determine whether a feature is visible or not, and how it appears. Another interesting approach would be to investigate how far the emerging research field of kinetic data structures [36] could be used to develop efficient methods for the visibility test. Here, the underlying idea is to develop data structures that are able to efficiently adapt to changing situations without the need for consecutively computing the entire solution from scratch. 5.2 Top-Down Versus Bottom-Up Methods There are two principles that have to be combined in a vision system: On one hand, there are bottom-up methods that take solely the video stream as input, try to detect some features (i.e applying a Canny edge detector) and then try to group the features and recognize objects or extract other information like optical flow or disparity maps for stereo vision. In the extreme, these methods do not use prior knowledge about the domain of application. On the other hand, there are methods like the system dynamics approach, which uses as much knowledge as possible to interpret visual information. They are much more efficient than bottom-up methods; however, their problem is that they need an initial estimate. Moreover, when run isolated, they are not able to recover from catastrophic errors. Thus, both approaches must be combined: The bottom-up method creates hypotheses for features and objects, which can then in turn be tracked using top-down methods. The proposed region tracking method contains elements of both principles: On one hand, the method uses no prior user-specified information about the domain (except the colors). Thus it is able to track the green regions and extract the field lines, without having an initial estimate of the robot s pose. On the other hand, the method uses prior knowledge - it takes the results of the last frame and uses them as a starting point for the next. However, this knowledge is created by the method itself and not super-imposed a priori. Having extracted the field lines, the feature recognition method also works without prior knowledge of the robot s pose. It is able to reliably yield an estimate of 167

172 the system state. Once an estimate is available, tracking and updating cannot be done more efficiently than using the system dynamics approach. Having ended in this top-down recursive estimation phase the low-level bottom-up processes still have to run, since new objects could enter the field of sight. Interestingly, the duality between bottom-up and top-down methods can be observed in some modern art paintings. For instance, in Dietrich Stalmann s painting in figure 5.1, a face is shown, which is over-painted in a second layer with colored, artificial strokes. What s interesting is that the human visual system is able to perceive and operate with these strokes, although they can certainly not be attached to an existing model of an object. Their form and constellation is new. Maybe this is the specific attraction of this painting, that the background shows an object that is well-known in its type, but that at the same time a different layer exists, which shows physically not interpretable strokes. Although a painting is a static image, these two layers can be seen as representative of top-down methods like the system dynamics approach, and bottom-up methods, which are required to initially detect and learn the shape and appearance of new objects and to create initial hypotheses for the top-down methods. 5.3 Criticizing the Proposed Feature Recognition Approach In this thesis a constructive feature recognition approach was proposed, which allows the real-time recognition of a whole palette of different features. Here, the term constructive should indicate the way in which recognition is performed. High-level features or objects, such as the center circle, are constructed from smaller parts like arcs for instance. Although the method has proved to be robust and efficient and although it improved the robustness of the overall localization significantly, it can be criticized: Although the method uses prior information of the task domain, it does not exploit the knowledge in a degree it could be done. The reason is, that the method does not make a hypothesis for the robot s position, but rather tries to construct features like the center circle without using this information. Significant improvements of efficiency could be made, if hypotheses about the robot s position would be made at a very early processing step. Then, the knowledge about the task domain could be exploited in a higher degree. During the initial phase, or when 168

173 recovering from erroneous localization, several hypotheses (i.e 6-12) about the robot s position could be considered in parallel, and feature recognition could be performed, with each recognition process being able to fully exploit the available prior information. Another problem is, that the proposed feature recognition approach is very specialized to the geometry of the RoboCup field lines. Furthermore, the prior knowledge, that is used during recognition, is implicitly integrated in the source code. It is of utmost important to develop methods, that do not integrate processing and prior knowledge in such an inflexible way. It would be nice to have a method that would take a model of the environment as input and that would automatically organize and maybe even program the feature recognition in a way that is most efficient for the given domain. 169

174 Figure 5.1: This painting from Dietrich Stalmann consists of two different layers. The background layer shows a realistic human face. The foreground layer consists of colored strokes, whose interpretation is difficult for a visual system. No physical objects exist that would give rise to such a constellation of strokes. However, the human visual system can detect them and operate with them, although they are completely new for the system. Thus, the two layers can be seen as representative of top-down methods like the system dynamics approach, which could be applied for tracking the face and bottom-up methods, which are able to perceive and learn unknown shapes an yield first hypotheses for objects. 170

175 5.4 The Problem of Light Three influences determine the appearance of the color of an object in the image: The object itself, the illuminating light and the receptors detecting the reflected light. Thus with only a single pixel in an image, it is not known how much of its color is caused by the settings of the camera (hue, gamma, saturation, exposure,...), by the object s surface, or by the light. A yellow sheet of paper, for instance, might be a white sheet of paper lightened by a yellow lamp or a yellow paper in daylight. The overall physics involved in this process are complex: A light source emits electromagnetic radiation at different frequencies with different energy. The spectrum of emitted wavelengths varies with the angle of the rays leaving the light source. The radiation is then reflected, refracted, absorbed, sometimes even re-emitted at different frequencies and these properties again depend on the object s surface properties and on the angle of both the incoming light and the angle at which the object is viewed. The light enters the optical system of the camera where lenses produce chromatic aberrations and other effects. Then it reaches the receptors, which have specific spectral characteristics. Their responses depend on the power and frequencies of the stimulating light. Effects like oversaturation may appear. And finally, not a single object and light source is present, but rather a whole set of objects and light sources, including ambient light. Light reflected by one object can still reach another object before it enters the camera. The image in figure 5.2 is an example of how complex situations can be. Assume a vision system which should enable a robot to grasp one of the depicted objects in the showcase and that the vision system is based on the system dynamics approach. According to the philosophy of the approach, not only a model of the objects, but also of the light sources has to be used. However, with the demand of flexibility, the system must be able to create the model by itself. That is, it has to infer the position, number and kind of light sources and the surface properties of the objects. Here, the problem is that if the model is not detailed enough, a lot of effects that are present in the image cannot be explained. On the other hand, if the model is precise, the number of parameters that have to be estimated will be huge. For instance, let us consider a physically inspired analytic reflectance model like that of [20]. Here, the model considers how the light is reflected from a single point on a surface, depending on the angle from which the point is illuminated and the angle from which it 171

176 Figure 5.2: Silver objects and a glass vase are viewed through the glass door of a show-case. The striped cover of a sofa in the surroundings of the case is reflected within the glass, its colors blending with colors of the interior objects. At the same time, the silver objects reflect their surroundings, which in total, produces a rich spectrum of visual effects, which are difficult to model and predict. is viewed. Without going into the details of the model, we look at a summary of the used symbols in table 5.1. As can be seen, a lot of parameters have to be specified, in order to model the reflectance properties of a single point. Moreover, the considered model is just an approximation of the reflectance of real materials, and it describes only a subclass of materials. That is, different models would have to be applied for different materials, which first requires material identification. But even if modeling of the light sources and the surface properties can be achieved, it becomes difficult to predict the appearance of features in real-time. Assume that we want to track the silver cup at the bottom left of figure 5.2, and that we have determined a set of visible features, one of which is the right 172

177 α Angle between N and H θ Angle between L and H or V and H λ Wavelength D Facet slope distribution function d Fraction of reflectance that is diffuse dω i Solid angle of a beam of incident light E i Energy of the incident light F Reflectance of a perfectly smooth surface f Unblocked fraction of the hemisphere G Geometrical attenuation factor H Unit angular bisector of V and L I i Average intensity of the incident light I ia Intensity of the incident ambient light I r Intensity of the reflected light I ra Intensity of the reflected ambient light k Extinction coefficient L Unit vector in the direction of the light m Root mean square slope of facets N Unit surface normal n Index of refraction R a Ambient light reflectance R Total bidirectional reflectance R d Diffuse bidirectional reflectance R s Specular bidirectional reflectance s Fraction of reflectance that is specular V Unit vector in direction of the viewer w Relative weight of a facet slope Table 5.1: This table summarizes the symbols used in the physically-inspired reflectance model [20]. The length of the list indicates that a huge amount of parameters have to be determined to model the interaction of light and objects in a physically-inspired way. edge of the cup. To apply an appropriate detector, we have to predict the appearance of the feature in the image. However, since the material is reflective, the appearance depends on what is reflected, that is, on other objects and light sources. Figure 5.3 shows a cut out of the right side of the cup. As can be seen, the situation is even more complex, since the cup is viewed through a glass in the door, which itself reflects objects in the surroundings of the show case. Horizontal stripes of a sofa blend over the side of the cup. Only a weak edge arising from the side of the cup can be seen. To predict this complex appearance, even modern rendering hardware would be too slow, since ray tracing methods would have to be applied. Finally, even if modeling and prediction of the appearance could be accomplished in real-time, it is doubtful if the required precision could be achieved. Of course, these considerations do not mean that the system dynamics approach could not be applied, but the question is how things should be modeled. Instead of using a physically deep model of the light and surface reflectance properties, one could try to use a model that is not physically inspired but can explain the appearance in a simpler way. For instance, one could model the surface of objects by several semi-transparent layers: The bottom layer for the intrinsic color or texture of an object, one layer for shading and illumination effects, one for specular effects, one for reflection etc. In a dynamic environment, specular highlights or shadows can move, even on a non-moving object. 173

178 Figure 5.3: The left image is a cut-out of figure 5.2, showing a silver cup within a showcase. Assume that the cup is tracked with the system dynamics approach and that a detector at the right side of the cup has been determined to be visible. A cut-out of this right side is shown at the right part of the figure. To select an appropriate detector, the appearance of the expected edge of the cup has to be predicted. However, this prediction will be very complicated, since the cup is viewed through a glass door, which itself reflects objects in the surroundings. At the right side of the cup, horizontal strips can be seen that belong to a sofa located nearly. However, instead of trying to explain these movements by other light sources and objects, one could try to create geometric and dynamic models of moving shadows and lights within the layers of the object s surface. Once this is accomplished, it is still possible to 174

179 apply deeper models, and the inference of the corresponding parameters will probably be easier after having first derived the parameters of the simpler model. 175

180 Chapter 6 Summary of Contributions The goal of this thesis was to develop a method that enables autonomous mobile robots to localize themselves on a soccer playing field by perceiving the field lines with an omnidirectional vision system. In particular, the goal was to achieve real-time recognition of shape-based features like the center circle, corners and T-junctions. Besides this overall goal - the achievement of which was demonstrated at the world championships of the RoboCup MidSize league in several advancements evolved from its pursuit: The first is a new algorithm which is able to track large homogeneous regions very efficiently. It is able to track several regions at the same time and to extract their boundary contours. The efficiency is accomplished by not processing the entire images but concentrating on parts where regions are expected. The algorithm exploits the fact that corresponding regions in successive images often overlap and it extends the region growing paradigm: Tracking is accomplished by alternating shrinking and growing. This idea is new and very general. A whole class of algorithms can be derived from it and they are not restricted to image processing. When solutions for a problem have to be computed successively, where the framework of the problem changes continuously in time, the idea is to reduce an already computed solution to an intersection, and to derive the new solution from the reduced one. In its concrete application for tracking regions we have shown that the algorithm outperforms existing solutions. The algorithm not only tracks the regions, but also yields their boundary contours. These can be used to apply local detectors in order to find features or objects close to the contours. The contours are given as lists of 176

181 points and tangent, and normal directions can easily and efficiently be calculated. Thus, feature detectors can be applied very selectively and they have to be applied at a few locations only. For instance, when searching for edges, a detector which responds to edges having a predefined direction can be used. The method also exposes a useful interface for higher level algorithms. For instance, visual attention can be implemented by controlling which regions should be tracked. Here the primary issue is that when excluding a region from tracking, the algorithm is much faster, since less pixels are accessed in the images. Another interface is the extracted boundary contour. Since the algorithm is based on region growing, connected boundaries are guaranteed. The boundary can serve as a base for recognition of objects and movements or as a reference into the images. In RoboCup for instance, we track the regions of the green playing field and the boundary helps us to find other objects on the playing field. This is because all objects like the ball and other robots being on the playing field are next to the green regions in the images. Therefore their boundaries can be used as a reference into the image where to search for the objects, and they can be rapidly detected. The second contribution builds up on the first. It was shown how the region tracking algorithm can be used to extract the field lines of the RoboCup playing field and an efficient algorithm was described, which is able to recognize 6 different shape-based features in real-time. The recognition of further features can easily be integrated in the framework. To the knowledge of the author it is the first system that allows robust real-time recognition of such a large amount of complex features, with all the processing running on a single Pentium III 900 MHz processor. Besides the strengthes of the approach we have also described its weaknesses and possibilities for further research were indicated. Furthermore, it was shown how the feature-recognition process is combined with odometric information and a relative matching method in a three-layered localization framework. Here, the relative matching method is also a new contribution. It uses a precalculated force-field to register perceived field-lines to a global model of the playing field. The approach, which is meanwhile known as the MATRIX -method has been adapted by several other teams in RoboCup. 177

182 Appendix A Pseudo-code of Described Algorithms A.1 Thinning The idea of thinning a region is to successively remove pixels from its boundary under the constraint of maintaining its connectedness. Therefore, pixels that connect different segments of the area must not be removed. For a given pixel, we will determine the number of segments it connects. If this is only one segment, then the pixel can be removed. To calculate the number, a simple observation is needed: When going once around the pixel and counting the number of transitions between regions belonging to the area and regions not belonging to the area, the number of transitions is twice the number of connected segments. This relationship is illustrated in figure A.1a). When labeling the neighboring pixels according to figure A.1b) and assuming an 8-connectedness, the pseudo code of procedure 3 performs the thinning. The symbol && expresses a logical AND-operation and the symbol! expresses a logical NOT-operation. The sum of boolean values has to be interpreted as the sum of the respective values 0 or 1. A.2 Smoothing the Line Contours During feature recognition, the line contours are smoothed before splitting them at highcurvature points. We apply the following method: 178

183 segment 2 4 segment segment 2 a8 a7 a6 a1 a2 a3 a5 a4 (a) (b) Figure A.1: Determining the crossing number. Procedure 3 Thinning() Input: array I the pixel array, the value 1 represents the area to thin, 0 the background Output: array I the image with the thinned regions { bool finished; do{ finished= true; for (all pixels p in I){ if (p>0){ Get the values of the 8 neighboring pixels into a1,a2,...,a8 byte sigma=a1+a2+a3+a4+a5+a6+a7+a8; if (sigma>1){ byte chi=(a1!=a3)+(a3!=a5)+(a5!=a7)+(a7!=a1)+ 2*((!a1&&a2&&!a1)+(!a3&&a4&&!a5)+ (!a5&&a6&&!a7)+(!a7&&a8&&!a5)); if(chi==2){ p=0; //remove pixel finished=false; while (!finished) A.3 Calculation of a Curvature Measure During feature recognition, a curvature measure is calculated at each point of the line contours. 179

184 Procedure 4 SmoothLine() Input: array p the array of points of the actual line np the number of points Comments: V2 is the type for a 2D-point and vector. { float q_0 =0.5f; float q_m1=0.25f; float q_p1=0.25f; for ( int i=1;i<np-1;i++){ p[i]=q_m1*p[i-1] +q_0*p[i] +q_p1*p[i+1]; Procedure 5 CalculateLocalCurvatureMeasure() Input: array p the array of points of the actual line np the number of points whalf the number of points to interleave in both directions from the actual point for the calculation of the angle Output: array curvature the array to retrieve the curvatures Comments: V2 is the type for a 2D-point and vector. PolarAngleOf(v) yields the polar angle of vector v. AngleFromTo(α, β) yields the angle ( [ π,..., π]) by which α has to be rotated to equal β. The shortest direction for rotation is chosen. { //Pad the first and last entries with zeros, since the curvature //of the corresponding points can not be calculated... for ( int k=0;k<whalf;k++) curvature[k]=0; for (k=np-whalf;k<np;k++) curvature[k]=0; //Calculate the curvatures for the middle part... for ( int i=whalf;i<np-whalf;i++){ int ia:=i-whalf; int ib:=i+whalf; V2 m:=p[i]; V2 va:=m-p[ia]; V2 vb:=p[ib]-m; curvature[i]:=anglefromto(polarangleof(va),polarangleof(vb)); A.4 Extracting Local Maxima in Curvature Once the curvature measure is calculated for all line contours, the goal is to identify points with local maxima in curvature. The following procedure performs this calculation: 180

185 Procedure 6 GetLocalCurvatureMaxima() Input: nmax the maximum number of extrema the array pos can store curvature the array of curvature values as computed by procedure 5 nc the number of curvature values (equals the number of points of the line) thres the curvature threshold mindistance the minimal allowed index distance between two intervals Output: pos the array to retrieve the indices of the local maxima return value the number of extrema Comments: thres= 0.5, mindistance= 3 in our implementation { float indexofactualextremum=-mindistance-1; int nextrema=0; float bool bestcurvature=0; foundanyextrema=false; for ( int i=0;i<nc;i++){ float aktcurvature=fabs(curvature[i]); if (aktcurvature>thres){ if float dist=i-indexofactualextremum; if (dist>mindistance){ //open a new extremum if (foundanyextrema){ if (nextrema<nmax) pos[nextrema++]=indexofactualextremum; bestcurvature=aktcurvature; indexofactualextremum=i; foundanyextrema=true; else{ if (aktcurvature>bestcurvature){ bestcurvature=aktcurvature; indexofactualextremum=i; (foundanyextema){ if (nextrema<nmax) pos[nextrema++]=indexofactualextremum; return nextrema; //the index of the actual //extremum //the number of extrema //the actual best curvature //if the curvature exceeds the //given threshold 181

186 Appendix B Source Code of the Region Tracking Algorithm In the following section the complete code listing of the proposed region tracking algorithm is given. The source code is written in C++ and does not require any further libraries. All supplementary data structures are listed to allow potential users to easily implement and test the algorithm. The region tracker assumes that tracking takes place on a two-dimensional grid. This grid has not necessarily to be an image. To adapt the tracker to a specific purpose, only one function, the homogeneity function, has to be written. This user-specified procedure is passed to the tracker during initialization using the concept of function pointers. The homogeneity function tells the tracker whether or not the cell (x, y) meets the homogeneity criterion. Actually, by providing the function, the homogeneity criterion is defined. The tracker can work on single pixels, defining a certain color class as a homogeneity criterion, for instance, or it can work on blocks of pixels. The homogeneity criterion has not be based on color similarity. For instance, one could define homogeneity criteria based on texture within a block of pixels. 182

187 B.1 The Homogeneity Criterion Assume that 24 bit RGB is used as image format and that a global pointer BYTE*pImageData is available, which points to the actual image. Then an example for a homogeneity function is: DWORD IsGreen(int x, int y) { DWORD bytetripleindex=(y+dx+x)*3; DWORD greenindex=bytetripleindex+1 BYTE greenvalue=pimagedata[greenindex]; if (greenvalue>200) return 0x ; return 0; Note, that this function is just a pseudo-function. It is not the function, which is used in our system. The homogeneity function has not to care whether (x,y) is a valid image position, the tracker ensures that. One might wonder that the return value is a 32-Bit value and not a boolean value. The reason is that the region tracker performs a bitwise AND operation between the classmask parameter passed at initialization and the mask returned by the homogeneity function to verify whether a specific location in the image meets the homogeneity criterion. The following pseudo code gives another example of a homogeneity function, in the case of a tracker working on blocks of 8 8 pixels. ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// DWORD getyuv422_88_mask(int x, int y) { int x0=x*8; int y0=y*8; int greenpixelcounter=0; DWORD mask=0; for (int ix=0;ix<8;ix++){ for (int iy=0;iy<8;iy++){ int ax=x0+ix; int ay=y0+iy; if (the pixel at ax,ay is green){ greenpixelcounter++; if (greenpixelcounter>=12){ mask =0x ; //Assume, the first bit means "green" return mask; ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// B.2 The Don t touch Array Sometimes one might want to forbid the tracker to track regions in certain areas within the image. 183

188 For instance, when using an omni-directional catadioptric setup, one might want to delimit the tracker to the areas within the mirror. Thus, the user provides an array having the same size as the image (when working on pixels, otherwise the size of the grid). Each element is a byte with the value being zero if the tracker is allowed to access the corresponding position and one, if it is not allowed. The array is plain, that is cell (x,y) is at index y*dx+x. A pointer to the array has to be provided during initialization. B.3 Initializing the Region Tracker The initialization of the tracker is described by means of an example in the following. Let us assume a resolution of pixels with the video format being YUV422. Assume further that the tracker should work on units of 8 8 pixels. Then the initializing code looks as follows: ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// #include "FU_RegionTracker.h" int dx=640; int dy=480; int subsamplingfactor=8; int unitdx=dx/subsamplingfactor; int unitdy=dy/subsamplingfactor; BYTE*pDT=new BYTE[unitDx*unitDy];//The "don t touch" array allows to //define areas in the image, which should //not be accessed by the region //tracker. setmemory(pdt,unitdx*unitdy,1); //Allow the tracker to work on the //whole image. GetMaskFuncXY getmaskxy= getyuv422_88_mask; //The function pointer //type GetMaskFuncXY is defined //in "FU_RegionTracker.h". //"getyuv422_88_mask" is your own //function. The region tracker will call //this function in order to //access your images. A more detailed description //follows in the next section. DWORD classmask=0x ; //Track regions of color class 1 CFU_RegionTracker* pregiontracker=new CFU_RegionTracker(unitDx,unitDy,pDT,getYUV422_88_Mask,subsamplingFactor,classMask); ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// The following sections listen the complete source code of the tracker, encompassing the following files: B.4 FU RegionTracker.h This file together with the corresponding cpp-file implements the core of the region tracker. 184

189 Filename FU Regiontracker.h,.cpp FU Contours.h,.cpp FV2.h,.cpp UTIL TOPFX.h UTIL HEAP implements... the core class of the region tracker the class storing the contour information a small class implementing simple 2D vectors a fast queue for storing the drops a heap structure which is used during initial seeding Table B.1: The listed files, implementing the region tracker #ifndef FU_REGION_TRACKER #define FU_REGION_TRACKER #include "UTIL_Heap.h" #include "FU_Contours.h" #include "UTIL_TopfX.h" #include "FV2.h" #ifndef BYTE //8 Bit #define BYTE unsigned char #endif #ifndef DWORD //32 Bit #define DWORD unsigned long #endif typedef DWORD (*GetMaskFuncXY)(int x, int y); ////////////////////////////////////////////////////////////////////// typedef union { struct{ short x; short y; s; DWORD value; DROP; ////////////////////////////////////////////////////////////////////// typedef struct{ int x; int y; BYTE dir; EdgePiece; ////////////////////////////////////////////////////////////////////// class CSeedRegion { public: CSeedRegion(){; CSeedRegion(int xx,int yy, int ww):x(xx),y(yy),w(ww){; int x; int y; int w; bool operator<(cseedregion w2){return (w<w2.w);; bool operator>(cseedregion w2){return (w<=w2.w);; bool operator>=(cseedregion w2){return (w>=w2.w);; bool operator<=(cseedregion w2){return (w<=w2.w);; bool operator==(cseedregion w2){return (w==w2.w);; ~CSeedRegion(){; ; ////////////////////////////////////////////////////////////////////// class CFU_RegionTracker { public: CFU_RegionTracker(int dx,int dy,byte*pdt_p,getmaskfuncxy getmaskxy_p,int subsamplingfactor_p,dword mask_p); virtual ~CFU_RegionTracker(); void ReInit(); bool GetCentreOfGravity(FV2 &c); 185

190 void Track(); public: CFU_Contours boundaries; private: void GetBoundaries(); void SeedControl(); bool FindNBestSeeders(int n); int DetermineSizeOfRegion(int sx, int sy, DWORD mask); void Grow(); void Shrink(); bool DropNextToFrame(EdgePiece &f, int x,int y); bool GetUndeletedEdgePiece(EdgePiece &f); void FrameToPoints(IV2&p0,IV2&p1, EdgePiece *f); private: bool bstarting; GetMaskFuncXY getmaskxy; //The homogeneity function int dx,dy; //dimension of the image (in picture elements) int ncells; //=dx*dy BYTE* occu; //array of size dx*dy to mark which picture elements are occupied by a drop int gdx,gdy; //dimension of the connectivity grid (gdx=dx+1;gdy=dy+1) int nconcells; //=gdx*gdy BYTE *con; //the connectivity grid: array of size gdx*gdy; it stores the boundaries of the regions BYTE *pdt; //Don t Touch array CUTIL_TopfX<DROP> *pcommondrops; int trackcounter; int id; CUTIL_Heap<CSeedRegion> seedheap; int shiftx[4]; int shifty[4]; int conshiftx[4]; int conshifty[4]; int dropanddirtopointx[4]; int dropanddirtopointy[4]; BYTE nextdir[4][3]; DWORD mask; DWORD ioffirstundeleted; CUTIL_TopfX<DROP> growlist; CUTIL_TopfX<DROP> shrinklist; IV2 gravitysum; int nsize; int subsamplingfactor; ; #endif B.5 FU RegionTracker.cpp This file constitutes the core of the region tracker source code. Here the growing and shrinking is implemented. ////////////////////////////////////////////////////////////////////// //FU_RegionTracker.cpp ////////////////////////////////////////////////////////////////////// #include "stdafx.h" #include "FU_RegionTracker.h" ////////////////////////////////////////////////////////////////////// /** * INPUT: int dx_p,dy_p : the size of the image being tracked, if you don t work on pixels but on block of pixels 186

191 * dx_p,dy_p is the size of the subsampled grid * BYTE*pDT_p : the "Dont t Touch" array. The user has to create this array the size being dx*dy, * containing BYTEs indicating whether the tracker is allowed * to access the corresponding cell (value=0) or whether it is not allowed (value=1) * getmaskxy_p : a function pointer to the user specified homgeneity function * subsamplingfactor_p : the factor of subsampling, for instance when tracking on blocks of 4x4 pixels, * the factor is 4, when tracking on single pixels the factor is 1. * : This factor is required to produce contour coordinates which correspond to the plain pixels. * mask_p : The mask which is used to logical "AND"-combine the results of the homgeneity function. * Only regions with AND-combinations not equal to zero are tracked. */ CFU_RegionTracker::CFU_RegionTracker(int dx_p, int dy_p, BYTE*pDT_p, GetMaskFuncXY getmaskxy_p, int subsamplingfactor_p,dword mask_p) :seedheap(false),shrinklist(dx_p*dy_p),growlist(dx_p*dy_p) { id=1; //the id of the region mask=mask_p; //the mask of the region subsamplingfactor=subsamplingfactor_p; //the subsampling factor (the number of pixels per unit along the x, and y-axis, respectively) dx=dx_p; //the horizontal number of units (one unit has subsamplingfactor pixels in horizontal and vertical direction) dy=dy_p; //the vertical number of units (one unit has subsamplingfactor pixels in horizontal and vertical direction) ncells=dx*dy; occu=new BYTE[nCells]; gdx=dx+1; //the dimensions of the connectivity grid... gdy=dy+1; nconcells=gdx*gdy; con=new BYTE[nConCells]; //the connectivity grid getmaskxy=getmaskxy_p; //image access function pdt=pdt_p; if (!pdt){ //DT= "Dont Touch" //this array must have size dx*dy bytes. A cell different from zero indicates that the corresponding unit in the image //should not be covered by a region //error message return; //pdt must exist bstarting=true; dropanddirtopointx[0]=0; dropanddirtopointy[0]=1; dropanddirtopointx[1]=1; dropanddirtopointy[1]=1; dropanddirtopointx[2]=1; dropanddirtopointy[2]=0; dropanddirtopointx[3]=0; dropanddirtopointy[3]=0; shiftx[0]= 0 ; shifty[0]= 1; shiftx[1]= 1 ; shifty[1]= 0; shiftx[2]= 0 ; shifty[2]=-1; shiftx[3]=-1 ; shifty[3]= 0; nextdir[0][0]=3; nextdir[0][1]=0; nextdir[0][2]=1; nextdir[1][0]=0; nextdir[1][1]=1; nextdir[1][2]=2; nextdir[2][0]=1; nextdir[2][1]=2; nextdir[2][2]=3; nextdir[3][0]=2; nextdir[3][1]=3; nextdir[3][2]=0; conshiftx[0]= 1; conshifty[0]= 0; conshiftx[1]= 0; conshifty[1]=-1; conshiftx[2]= -1; conshifty[2]= 0; conshiftx[3]= 0; conshifty[3]= 1; trackcounter=0; pcommondrops=new CUTIL_TopfX<DROP>(dx*dy); gravitysum.x=gravitysum.y=0; boundaries.reservedefaultmemory(); ReInit(); ////////////////////////////////////////////////////////////////////// void CFU_RegionTracker::ReInit() { ZeroMemory(con,sizeof(con[0])*nConCells); ZeroMemory(occu,sizeof(occu[0])*nCells); 187

192 nsize=0; gravitysum.x=gravitysum.y=0; growlist.clean(); shrinklist.clean(); boundaries.reset(); ////////////////////////////////////////////////////////////////////// CFU_RegionTracker::~CFU_RegionTracker() { delete [] pcommondrops; delete [] con; delete [] occu; ////////////////////////////////////////////////////////////////////// /** * You might modify the beginning of the function. Currently a complete reinitialization * is performed every 300 frames. In some applications you might not need any reinitialization. */ void CFU_RegionTracker::Track() { if (bstarting){ if ((trackcounter%30)==0) ReInit(); bstarting=false; else{ if ((trackcounter%300)==0) ReInit(); SeedControl(); Shrink(); Grow(); GetBoundaries(); trackcounter++; ////////////////////////////////////////////////////////////////////// void CFU_RegionTracker::SeedControl() { bool foundanyseedpixel=false; if (growlist.isempty()&&shrinklist.isempty()){ if (FindNBestSeeders(10)){ foundanyseedpixel=true; ////////////////////////////////////////////////////////////////////// bool CFU_RegionTracker::FindNBestSeeders(int n) { seedheap.clear(); for (int i=0;i<ncells;i++){ if (!occu[i]&&!pdt[i]){ DWORD c=getmaskxy(i,0); if (c&mask){ int ax=i%dx; int ay=i/dx; int asize=determinesizeofregion(ax,ay,mask); CSeedRegion p(ax,ay,asize); seedheap.insert(p); ZeroMemory(occu,dx*dy); if (seedheap.getsize()){ for (int i=0;i<ncells;i++){ occu[i]&=0x7f; bool foundany=false; for (int k=0;k<n;k++){ if (seedheap.getsize()){ CSeedRegion p=seedheap.extracttop(); DWORD o=p.y*dx+p.x; occu[o]=id; DROP d; d.s.x=(short)p.x; 188

193 d.s.y=(short)p.y; growlist.append(d); nsize++; gravitysum.x+=p.x; gravitysum.y+=p.y; foundany=true; else{ break; Grow(); GetBoundaries(); return foundany; ////////////////////////////////////////////////////////////////////// int CFU_RegionTracker::DetermineSizeOfRegion(int sx, int sy, DWORD mask) { int masse=0; long o=sy*dx+sx; occu[o] =0x80; masse++; CUTIL_TopfX<DROP>*pTopf=pCommonDrops; DROP d;d.s.x=sx;d.s.y=sy; ptopf->append(d); while(!ptopf->isempty()) { DROP d=ptopf->extractlast(); long x=d.s.x; long y=d.s.y; for (long k=0;k<4;k++){ long ax=x+shiftx[k]; long ay=y+shifty[k]; o=ay*dx+ax; if ((ax>=0&&ax<dx&&ay>=0&&ay<dy)&&(!pdt[o])&&(occu[o]==0)){ DWORD amask=getmaskxy(ax,ay); if (amask&mask){ occu[o] =0x80; masse++; DROP d; d.s.x=ax; d.s.y=ay; ptopf->append(d); return masse; ////////////////////////////////////////////////////////////////////// void CFU_RegionTracker::Grow() { CUTIL_TopfX<DROP>*pResultingDrops=&shrinkList; CUTIL_TopfX<DROP>*pDropPool=&growList; DWORD o; while(!pdroppool->isempty()) { DROP d=pdroppool->extractlast(); bool generated=false; for (long k=0;k<4;k++){ long ax=d.s.x+shiftx[k]; long ay=d.s.y+shifty[k]; o=ay*dx+ax; bool rand=false; if (ax<0) rand=true; else if (ay<0) rand=true; else if (ay>=dy) rand=true; else if (ax>=dx) rand=true; if (!rand) rand=pdt[o]; DWORD amask=0; if (!rand){ if (!occu[o]){ amask=getmaskxy(ax,ay); 189

194 if (amask&mask){ occu[o]=id; DROP nd; nd.s.x=ax; nd.s.y=ay; pdroppool->append(nd); gravitysum.x+=ax; gravitysum.y+=ay; nsize++; else{ rand=true; else{ if (occu[o]!=id) rand=true; if (rand){ generated=true; int gx=d.s.x+dropanddirtopointx[k]; int gy=d.s.y+dropanddirtopointy[k]; DWORD o0=gy*gdx+gx; BYTE conmask=1<<k; DWORD color=color; con[o0] =conmask; if (generated) { presultingdrops->append(d); ////////////////////////////////////////////////////////////////////// void CFU_RegionTracker::Shrink() { CUTIL_TopfX<DROP>*pResultingDrops=&growList; CUTIL_TopfX<DROP>*pDropPool=&shrinkList; DWORD o; //Clearing for (int i=0;i<pdroppool->n;i++){ DROP d=pdroppool->getat(i); o=d.s.y*dx+d.s.x; occu[o]=0; gravitysum.x-=d.s.x; gravitysum.y-=d.s.y; nsize--; while(!pdroppool->isempty()) { DROP d=pdroppool->extractlast(); DWORD ao=d.s.y*dx+d.s.x; DWORD amask=getmaskxy(d.s.x,d.s.y); if (amask&mask){ presultingdrops->append(d); else{ bool generated=false; for (long k=0;k<4;k++){ long ax,ay; ax=d.s.x+shiftx[k]; ay=d.s.y+shifty[k]; o=ay*dx+ax; if ((ax>=0&&ax<dx&&ay>=0&&ay<dy)&&(!pdt[o])&&(occu[o]==id)){ occu[o]=0; DROP nd; nd.s.x=ax; nd.s.y=ay; gravitysum.x-=ax; gravitysum.y-=ay; nsize--; pdroppool->append(nd); 190

195 //Setting for (i=0;i<presultingdrops->n;i++){ DROP d=presultingdrops->getat(i); o=d.s.y*dx+d.s.x; occu[o]=id; gravitysum.x+=d.s.x; gravitysum.y+=d.s.y; nsize++; ////////////////////////////////////////////////////////////////////// bool CFU_RegionTracker::DropNextToFrame(EdgePiece &f, int x,int y) { int gx[4]; int gy[4]; gx[0]=x;gy[0]=y+1; gx[1]=x+1;gy[1]=y+1; gx[2]=x+1;gy[2]=y; gx[3]=x;gy[3]=y; for (int i=0;i<4;i++){ BYTE mask=1<<i; DWORD to=gy[i]*gdx+gx[i]; if (con[to]&mask){ f.x=x; f.y=y; f.dir=i; return true; return false; ////////////////////////////////////////////////////////////////////// bool CFU_RegionTracker::GetUndeletedEdgePiece(EdgePiece &f) { DWORD i=ioffirstundeleted; DWORD n=shrinklist.n; DROP*p=shrinkList.pMem; while(i<n){ if (DropNextToFrame(f,p[i].s.x,p[i].s.y)){ ioffirstundeleted=i; return true; i++; return false; ////////////////////////////////////////////////////////////////////// void CFU_RegionTracker::GetBoundaries() { CFU_Contours&b=boundaries; b.reset(); EdgePiece f; IV2 p0,p1; ioffirstundeleted=0; int everyx=2; while(getundeletededgepiece(f)){ FrameToPoints(p0,p1,&f); int ldir=f.dir; bool found; int counter=0; do{ if ((counter%everyx)==0){ b.point[b.npoints].x=p0.x*subsamplingfactor; b.point[b.npoints].y=p0.y*subsamplingfactor; b.npoints++; DWORD o0=p0.y*gdx+p0.x; found=false; 191

196 if (con[o0]){ for (int i=0;i<3;i++){ BYTE testdir=nextdir[ldir][i]; BYTE mask=1<<testdir; if (con[o0]&mask){ p1.x=p0.x+conshiftx[testdir]; p1.y=p0.y+conshifty[testdir]; found=true; BYTE burner=~mask; con[o0]&=burner; ldir=testdir; break; p0=p1; counter++; while(found); b.inci[++b.nlines]=b.npoints; b.smoothandgrowinsitu(1.0f); ////////////////////////////////////////////////////////////////////// void CFU_RegionTracker::FrameToPoints(IV2&p0,IV2&p1, EdgePiece *f) { switch(f->dir){ case 0: p0=iv2(f->x,f->y+1); p1=iv2(f->x+1,f->y+1); break; case 1: p0=iv2(f->x+1,f->y+1);p1=iv2(f->x+1,f->y); break; case 2: p0=iv2(f->x+1,f->y); p1=iv2(f->x,f->y); break; case 3: p0=iv2(f->x,f->y); p1=iv2(f->x,f->y+1); break; ////////////////////////////////////////////////////////////////////// bool CFU_RegionTracker::GetCentreOfGravity(FV2 &c) { if (!nsize) return false; c.x=gravitysum.x/(float)nsize; c.y=gravitysum.y/(float)nsize; c*=subsamplingfactor; return true; B.6 FV2.h This file defines a simple 2D point or vector. It is used to represent the extracted field contours. ////////////////////////////////////////////////////////////////////// #ifndef FV2_H #define FV2_H #define FLOAT float ////////////////////////////////////////////////////////////////////// /*The class for a floating point 2D point or vector*/ class FV2 { public: FV2(); FV2(float x,float y); ~FV2(); FV2 operator-(fv2&v); FV2 operator+(fv2&v); FV2 operator*(float l); FV2 operator/(float l); FLOAT operator*(fv2 v); FV2 operator-(); void operator /=(FLOAT f); void operator *=(FLOAT f); 192

197 void operator +=(FV2&v); void operator -=(FV2&v); FV2 operator~(); //yields the orthogonal vector (left turning) public: FLOAT x; FLOAT y; ; ////////////////////////////////////////////////////////////////////// FV2 operator*(float f,fv2 v); ////////////////////////////////////////////////////////////////////// /*The class for an integer 2D point or vector*/ class IV2 { public: IV2(); IV2(int x,int y); ~IV2(); public: int x; int y; ; ////////////////////////////////////////////////////////////////////// #endif B.7 FV2.cpp This file implements a simple 2D point or vector. It is used to represent the extracted field contours. #include "stdafx.h" #include "FV2.h" ////////////////////////////////////////////////////////////////////// // Konstruktion/Destruktion ////////////////////////////////////////////////////////////////////// FV2::FV2() { ////////////////////////////////////////////////////////////////////// FV2::FV2(FLOAT x,float y):x(x),y(y) { ////////////////////////////////////////////////////////////////////// FV2::~FV2() { ////////////////////////////////////////////////////////////////////// FV2 FV2::operator*(FLOAT l) { return FV2(x*l,y*l); ////////////////////////////////////////////////////////////////////// FV2 FV2::operator/(FLOAT l) { return FV2(x/l,y/l); ////////////////////////////////////////////////////////////////////// FV2 FV2::operator -() { return FV2(-x,-y); ///////////////////////////////////////////////////////////////////////// void FV2::operator +=(FV2&v) { x+=v.x; y+=v.y; ///////////////////////////////////////////////////////////////////////// void FV2::operator -=(FV2&v) { x-=v.x; y-=v.y; 193

198 ///////////////////////////////////////////////////////////////////////// void FV2::operator /=(FLOAT f) { x/=f; y/=f; ///////////////////////////////////////////////////////////////////////// void FV2::operator *=(FLOAT f) { x*=f; y*=f; ///////////////////////////////////////////////////////////////////////// FLOAT FV2::operator*(FV2 v) { return x*v.x+y*v.y; ////////////////////////////////////////////////////////////////////// FV2 FV2::operator~() { return FV2(-y,x); ////////////////////////////////////////////////////////////////////// FV2 FV2::operator-(FV2&v) { return FV2(x-v.x,y-v.y); ////////////////////////////////////////////////////////////////////// FV2 FV2::operator+(FV2&v) { return FV2(x+v.x,y+v.y); ////////////////////////////////////////////////////////////////////// FV2 operator*(float f,fv2 v){ return FV2(v.x*f,v.y*f); ////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////// ////////////////////////////////////////////////////////////////////// IV2::IV2() { ////////////////////////////////////////////////////////////////////// IV2::IV2(int x,int y):x(x),y(y) { ////////////////////////////////////////////////////////////////////// IV2::~IV2() { B.8 FU Contours.h This file defines a class to represent the extracted boundary contours. In each frame, the tracker yields new field contours. The field contours are defined as an array of points and an incidence structure that partitions the points into segments. #ifndef FU_CONTOURS_H #define FU_CONTOURS_H #include "FV2.h" #define MAX_CONTOUR_POINTS #define MAX_FU_LINES 2000 /** CFU_Contours stores several closed curves (referred to as lines in the source code), each consisting of a sequence of points. */ class CFU_Contours { public: int GetNumberOfLines(); 194

199 ; int GetNumberOfPoints(); void Reset(); CFU_Contours(); virtual ~CFU_Contours(); void ReserveDefaultMemory(); void AccessLine(FV2*&p,int&nP,int iline); void SmoothAndGrowInsitu( float extent); void Draw(void*pDrawingContext); FV2 *point; int *inci; int npoints; int nlines; #endif B.9 FU Contours.cpp This file implements the class representing the extracted boundary contours. In each frame, the tracker yields new field contours. The field contours are defined as an array of points and an incidence structure that partitions the points into segments. ////////////////////////////////////////////////////////////////////// #include "Stdafx.h" #include "FU_Contours.h" //#include "..\..\Common\ImageWnd.h" extern class CImageWnd*gpImageWnd; ////////////////////////////////////////////////////////////////////// CFU_Contours::CFU_Contours() { npoints=0; nlines=0; inci=0; point=0; ////////////////////////////////////////////////////////////////////// CFU_Contours::~CFU_Contours() { if (point) delete [] point; if (inci) delete [] inci; ////////////////////////////////////////////////////////////////////// /**Reset the contours, that is clear all points and lines. *Prepare the data structure for storing new contour data. */ void CFU_Contours::Reset() { npoints=0; nlines=0; if (inci) inci[0]=0; ////////////////////////////////////////////////////////////////////// /**Reserve memory for MAX_FU_LINES lines and MAX_CONTOUR_POINTS points*/ void CFU_Contours::ReserveDefaultMemory() { point=new FV2[MAX_CONTOUR_POINTS]; inci= new int[max_fu_lines+1]; ////////////////////////////////////////////////////////////////////// /**Draw the contours. Add your own source code here.*/ void CFU_Contours::Draw(void*pDrawingContext) { //add system dependent drawing code// CDC*pDC=(CDC*)pDrawingContext; for (int l=0;l<nlines;l++){ 195

200 int np; FV2*p; AccessLine(p,nP,l); for (int i=0;i<np;i++){ //add system dependent drawing code:// CPoint sp=gpimagewnd->worldtoscreen(v2(p[i].x,p[i].y)); if (!i){ //add system dependent drawing code:// pdc->moveto(sp); else{ //add system dependent drawing code:// pdc->lineto(sp); ////////////////////////////////////////////////////////////////////// /**Smooth and enlarge the lines *Input: extent : the amount of how much the lines are enlarged * If this value is zero, the lines are just smoothed. *Remarks: Each line is a closed curve. For enlarging the lines each * point is moved in direction of the normal at that point. * This function should be optimized. Identical caluclations * are performed several times. */ void CFU_Contours::SmoothAndGrowInsitu( float extent) { for (int i=0;i<nlines;i++){ FV2*p; int np; AccessLine(p,nP,i); for (int i=0;i<np;i++){ int im_2=(i+np-2)%np; //Indices to neighboring points... int im_1=(i+np-1)%np; int ip_1=(i+np+1)%np; int ip_2=(i+np+2)%np; FV2 v_m1=p[i]-p[im_1];//tangent vectors,... FV2 v_m2=p[i]-p[im_2]; FV2 v_p1=p[ip_1]-p[i]; FV2 v_p2=p[ip_2]-p[i]; //Calculate the smoothed tangent vector... FV2 v=0.1f*v_m1+0.4f*v_m2+ 0.4f*v_p1+ 0.1f*v_p2; //Vector v should be normalized here //However, the points have approximately the same distance. //Thefore, instead of an expensive normalization to length one, //we multipy the following constant, assuming an average distance //of 8 pixels between the points. v*=0.125f; //The function also will work, if the assumption is not true, //since it can be tuned with the parameter "extent". FV2 n=~v*extent; //~v yields the vector orthogonal to v (left turning) //Calcualte the new smoothed point p[i]=0.25*p[im_1]+0.25*p[ip_1]+0.5*(p[i]+n); ////////////////////////////////////////////////////////////////////// /**Call this function to access the points of line "iline". *Input: * iline : the zero-based index of the line *Output by reference: * * p : a reference to a pointer which will receive the * adrress of the first point of the line * np : a reference to an interger which receives the * number of points of line "iline". * *Example: Draw the points of the third line * * FV2*p; 196

201 * int np; * AccessLine(p,nP,2); * for (int i=0;i<np;i++){ * DrawPoint(p[i].x,p[i].y); * */ void CFU_Contours::AccessLine(FV2 *&p, int &np, int iline) { int i0=inci[iline+0]; int i1=inci[iline+1]; np=i1-i0; p=&point[i0]; ////////////////////////////////////////////////////////////////////// /**returns the number of lines*/ int CFU_Contours::GetNumberOfLines() { return nlines; ////////////////////////////////////////////////////////////////////// /**returns the number of points*/ int CFU_Contours::GetNumberOfPoints() { return npoints; B.10 UTIL HEAP.h This file defines a template class for a heap structure. It is used, when the tracker performs initial seeding. Here, all regions are determined and inserted in the heap, with the largest regions being on top of the tree structure. #ifndef UTIL_HEAP #define UTIL_HEAP //////////////////////////////////////////////////////////////////////////////////////////// //This template class provides a simple heap, whose maximum size is //fixed. //We made this limitation because some dynamic data structures, i.e. STL, //are often not fully platform independent. //Replace the array "T*a" by a dynamic data structure to have //a dynamic heap. //////////////////////////////////////////////////////////////////////////////////////////// template <class T>class CUTIL_Heap { public: T*a; int ne; int maxsize; bool blowontop; //actual number of elements //maximum number of elements //lowest or greatest element at top? ///////////////////////////////////////////// CUTIL_Heap(bool blowontop_p=true,int maxsize_p=10000) { maxsize=maxsize_p; blowontop=blowontop_p; ne=0; a=new T[maxSize]; ; ////////////////////////////////////////////// ~CUTIL_Heap(){delete [] a;; ///////////////////////////////////////////// inline int GetSize() {return ne; ///////////////////////////////////////////// inline void Clear() { ne=0; inline int LeftSon(int i) { return (i<<1)+1; inline int RightSon(int i) { return (i<<1)+2; inline int GetFather(int i){ return (i-1)/2 ; ///////////////////////////////////////////// bool IsHeapAt(int i) { 197

202 int l=leftson(i); int r=rightson(i); int n=getsize(); if (blowontop){ if (l<n){ if (a[l]<a[i]) { return false; if (r<n){ if (a[r]<a[i]) { return false; else{ if (l<n){ if (a[l]>a[i]) { return false; if (r<n){ if (a[r]>a[i]) { return false; return true; ////////////////////////////////////////////////////////////////////// bool CheckHeap(){ for (int i=0;i<getsize();i++){ if (!IsHeapAt(i)) return false; return true; ////////////////////////////////////////////////////////////////////// void Heapify(int i){ int l=leftson(i); int r=rightson(i); int extrem; int n=getsize(); if (blowontop){ if (l<n&&a[l]<a[i]) extrem=l; else extrem=i; if (r<n&&a[r]<a[extrem]) extrem=r; else{ if (l<n&&a[l]>a[i]) extrem=l; else extrem=i; if (r<n&&a[r]>a[extrem]) extrem=r; if (extrem!=i) { T tmp=a[i]; a[i]=a[extrem]; a[extrem]=tmp; Heapify(extrem); ///////////////////////////////////////////// void BuildHeap(){ int n=ne; if (n<=1) return; int lastparent=getfather(n-1); for (int i=lastparent;i>=0;i--){ Heapify(i); 198

203 ///////////////////////////////////////////// T ExtractTop(){ int n=ne; if (n<1){ //should not happen return T(); T top=a[0]; a[0]=a[n-1]; ne--; Heapify(0); return top; ///////////////////////////////////////////// T RemoveAt(int i){ int n=ne; T t=a[i]; a[i]=a[n-1]; a.removeat(n-1); if (i!=n-1){ i=climb(i); Heapify(i); return t; ///////////////////////////////////////////// void ReplaceTop(T e){ if (!ne) { //should not happen: return; a[0]=e; ne--; Heapify(0); ///////////////////////////////////////////// void Insert(T key){ if (ne>=maxsize) { //Error message: OVERFLOW return; a[ne++]=key; int i=ne-1; if (i==0) return; int f=getfather(i); if (blowontop){ while (i>0&&a[f]>key){ a[i]=a[f]; i=f; if (i!=0) f=getfather(i); else{ while (i>0&&a[f]<key){ a[i]=a[f]; i=f; if (i!=0) f=getfather(i); a[i]=key; ///////////////////////////////////////////// int Climb(int i){ if (i==0) return 0; T key=a[i]; int f=getfather(i); if (blowontop){ while (i>0&&a[f]>key){ a[i]=a[f]; 199

204 i=f; if (i!=0) f=getfather(i); else{ while (i>0&&a[f]<key){ a[i]=a[f]; i=f; if (i!=0) f=getfather(i); a[i]=key; return i; ; //////////////////////////////////////////////////////////////////////////////////////////// #endif B.11 UTIL TOPFX.h This file implements an efficient queue for the drops. It takes advantage of the fact, that the maximum number of elements in the queue cannot exceed the number of cells. #ifndef UTIL_TOPFX #define UTIL_TOPFX template <class ElementTyp> class CUTIL_TopfX { public: CUTIL_TopfX(long nnmax){nmax=nnmax;n=0;size=sizeof(elementtyp);pmem=new ElementTyp[nmax];; virtual ~CUTIL_TopfX(){delete [] pmem;; void Append(ElementTyp el); void Delete(long i); void DeleteLast(); bool IsEmpty(){return n==0;; void Clean(){n=0; ElementTyp GetAt(long i); ElementTyp GetLast(); ElementTyp ExtractLast(); inline GetIndexOfLast(){return n-1;; ElementTyp* GetAndAppend(); long nmax; long n; long size; ElementTyp*pMem; ; ////////////////////////////////////////////////////////////////////////// template <class ElementTyp> void CUTIL_TopfX<ElementTyp>::Append(ElementTyp el) { pmem[n]=el; n++; ////////////////////////////////////////////////////////////////////////// template <class ElementTyp> void CUTIL_TopfX<ElementTyp>::Delete(long i) { pmem[i]=pmem[n-1] n--; ////////////////////////////////////////////////////////////////////////// template <class ElementTyp> void CUTIL_TopfX<ElementTyp>::DeleteLast() { n--; ////////////////////////////////////////////////////////////////////////// template <class ElementTyp> ElementTyp CUTIL_TopfX<ElementTyp>::ExtractLast() { ElementTyp a=pmem[n-1]; 200

205 n--; return a; ////////////////////////////////////////////////////////////////////////// template <class ElementTyp> ElementTyp CUTIL_TopfX<ElementTyp>::GetAt(long i) { return pmem[i]; ////////////////////////////////////////////////////////////////////////// template <class ElementTyp> ElementTyp CUTIL_TopfX<ElementTyp>::GetLast() { return pmem[n-1]; ////////////////////////////////////////////////////////////////////////// template <class ElementTyp> ElementTyp* CUTIL_TopfX<ElementTyp>::GetAndAppend() { return &pmem[n++]; #endif 201

206 Bibliography [1] Navstar Global Positioning System Joint Program Office, System Engeneering Division, [2] Rolf Adams and Leanne Bischof. Seeded region growing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6): , [3] M. D. Alder. Inference of syntax for point sets. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice IV, pages New York: Elsevier Science, [4] J.M. Almeida, A. Martins, E.P. Silva, J. Baptista, A. Patacho, L. Lima, V. Cerqueira, C. Almeida, and R. Picas. ISePorto Robotic Soccer Team for Robocup In CD of the Proceedings of the RoboCup 2003 International Symposium, Padova, Italy, [5] Armando Sousa António Paulo Moreira, Paulo Costa and Luís Paulo Reis. 5dpo Team Description for Year In CD of the Proceedings of the RoboCup 2003 International Symposium, Padova, Italy, [6] H. Asada and M. Brady. The curvature primal sketch. In Proc. 2nd IEEE Workshop Computer Vision: Representation and Control, pages 8 17, [7] L. Asplund et al. Team description RFC Uppsala - Aros. In CD of the Proceedings of the RoboCup 2003 International Symposium, Padova, Italy, [8] M. Bertozzi, A. Conti, and A. Fascioli. Obstacle and lane detection on ARGO. In Proc. IEEE - Conference on Intelligent Transportation Systems, Boston, ,

207 [9] P. J Besl and N. D McKay. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2): , , 111 [10] J. Borenstein, B Everett, and L. Feng. Navigating Mobile Robots: Systems and Techniques. A. K. Peters, Ltd., Wellesley, MA, [11] Karl Brammer and Gerhard Siffling. Kalman-Bucy filters. Norwood, MA: Artech House, ISBN , 85 [12] A. Bredenfeld, V. Becanovic, Th. Christaller, I. Godler, G. Indiveri, K. Ishii, J. Ji, H.-U. Kobialka, N. Mayer, H. Miyamoto, A.F.F. Nassiraei, P.-G. Plöger, P. Schöll, and M. Shimizu. AIS-Musahsi. In CD of the Proceedings of the RoboCup 2003 International Symposium, Padova, Italy, [13] T.M. Breuel. Fast recognition using adaptive subdivisions of transformation space. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.92CH3168-2), pages , [14] Claude R. Brice and Claude L. Fennema. Scene analysis using regions. Artificial Intelligence, 1: , , 63 [15] James Bruce, Tucker Balch, and Manuela Veloso. Fast and inexpensive color image segmentation for interactive robots. In Proceedings of IROS-2000, Japan, October , 70 [16] T. Buchheim, G. Kindermann, R. Lafrenz, H. Rajaie, M. Schanz, F. Schreiber, and P. Levi. Team description Paper 2003 CoPS Stuttgart. In CD of the Proceedings of the RoboCup 2003 International Symposium, Padova, Italy, [17] John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 8: , [18] S. Clark and G. Dissanayake. Simultaneous Localization and Map Building Using Millimeter-Wave Radar to Extract Natural Features. In Proc. of the 1998 IEEE Conference on Robotics and Automation, pages , Detroit, USA [19] S. Clark and H. Durrant-Whyte. Autonomous land vehicle navigation using millimeter wave radar. In Int. Proc. of the IEEE International conference of Robotic and Automation, pages , Belgium

208 [20] R. L. Cook and K. E. Torrance. A reflectance model for computer graphics. ACM Trans. Graph., 1(1):7 24, , 173 [21] I. J. Cox and G. T. Wilfong, editors. Autonomous Robot Vehicles. Spriger Verlag, [22] J. L. Crowley. World modeling and position estimation for a mobile robot using ultrasonic ranging. In IEEE International Conference on Robotics and Automation, pages , [23] R. N. Daves. Fuzzy shell-clustering and applications to circle detection in digital images. International Journal on General Systems, 16: , [24] R. N. Daves and K. Bhaswan. Adaptive fuzzy c-shells clustering and detection of ellipses. IEEE Transactions on Neuronal Networks, 3(5): , [25] E. R. Davies. Machine Vision: theory, algorithms, practicalities. Academic Press, [26] E. D. Dickmanns. Untersuchungen und Arbeitsschritte zum Thema künstliche Intelligenz: Rechnersehen und -steuerung dynamischer Systeme. Interner Bericht, HSBw M/LRT/WE 13a/IB/ , 107, 165 [27] E. D. Dickmanns and B. D. Mysliwetz. Recursive 3-d road and relative egostate recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 14: , February [28] A. Elfes. Sonar based real-world mapping navigation. IEEE Journal of Robotics and Automation, 3(3): , [29] R. Rojas F. v. Hundelshausen. A general algorithm for finding transitions along lines in colored images. In Proceedings of the Computer Vision Winter Workshop, CVWW02, February [30] Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6): ,

209 [31] A. W Fitzgibbon, M Pilu, and R. B Fisher. Direct least square fitting of ellipses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5): , [32] Alex Foessel, Sachin Chheda, and Dimitrios Apostolopoulos. Short-Range Millimeter- Wave Radar Perception in a Polar Environment. In Proceedings of the Field and Service Robotics Conference, August [33] Dieter Fox, Wolfram Burgard, and Sebastian Thrun. Markov localization for mobile robots in dynamic environments. Journal of Artificial Intelligence Research, 11: , [34] T. Fukuda, S. Ito, N. Oota, F. Arai, Y. Abe, K. Tanake, and Y. Tanaka. Navigation system based on ceiling landmark recognition for autonomous mobile robot. In Proc. International Conference on Industrial Electronics Control and Instrumentation(IECON 93), volume 1, pages , [35] Kazuhiro Furuta, Shingo Takehana, Takatugu Matumura, and Naruhito Ozawa. WinKIT 2003 Team Description. In CD of the Proceedings of the RoboCup 2003 International Symposium, Padova, Italy, [36] Leonidas J. Guibas. Kinetic data structures: a state of the art report. A. K. Peters, Ltd, ISBN , 167 [37] J. Guivant, E. Nebot, and S. Baiker. High Accuracy Navigation using Laser Range Sensors in Outdoor Applications. In Proceedings of the 2000 IEEE International Conference on Robotics and Automation, volume 2, pages , [38] J. S Gutmann. Robuste Navigation autonomer mobiler Systeme. PhD thesis, University of Freiburg, DISKI 241, ISBN , 18, 44 [39] Gerd Mayer Hans Utz, Alexander Neubeck and Gerhard Kraetzschmar. Improving vision-based self-localization. Gal A. Kaminka, Pedro U. Lima and Raúl Rojas (eds): RoboCup-2002: Robot Soccer World Cup VI, Springer, , 46, 130 [40] C. Harris. Geometry from visual motion. In A. Blake and A. Yuille, editors, Active Vision, pages MIT Press,

210 [41] R. Hinkel and T. Knieriemen. Environment perception with a laser radar in a fast moving robot. In Proceedings of Symposium on Robot Control, pages , Karlsruhe, Germany, October [42] S. A. Hojjatoleslami and J. Kittler. Region growing: a new approach. IEEE Transactions of Image Processing, 7(7): , [43] J. Hong et al. Image-based homing. In Proc. Int. Conf. Robotics and Automation, [44] P. V. C. Hough. Method and means for recognizing complex patterns. US Patent 3,069,654, December , 46 [45] R. Hal ir and J. Flusser. Numerically stable direct least squares fitting of ellipses. In Proc. 6th International Conference in Central Europe on Computer Graphics, Visualization and Interactive Digital Media (WSCG), [46] E. Nebot J. Guivant and S. Baiker. Autonomous navigation and map building using laser range sensors in outdoor applications. Journal of Robotic Systems, 17(10): , October [47] M. Jamzad, B.S.Sadjad, V.S.Mirrokni, et al. A fast vision system for middle size robots in robocup. Lecture Notes in Computer Science 2377: RoboCup2001, Robot Soccer World Cup V, [48] Matthias Jüngel and Thomas Röfer. Vision-based fast and reactive monte-carlo localization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA-2003), Taipei, Taiwan., pages , [49] Euijin Kim, Miki Haseyama, and Hideo Kitajima. Fast and Robust Ellipse Extraction from Complicated Images. (unpublished paper: download via 47 [50] K. Kluge. Performance evaluation of vision-based lane sensing: Some preliminary tools, metrics, and results. In Proc. IEEE - Conference on Intelligent Transportation Systems, Boston, , 165 [51] D. Kortenkamp, R. P. Bonasso, and R. Murphy, editors. AI-based Mobile Robots: Case studies of successful robot systems. MIT Press, Cambridge, MA,

211 [52] U M Landau. Estimation of a circular arc center and its radius. Comput. Vision Graph. Image Process., 38(3): , [53] S. Lenser and M. Veloso. Sensor resetting localization for poorly modelled mobile robots, , 46 [54] J. J. Leonard and H. F. Durrant-Whyte. Directed sonar sensing for mobile robot navigation. International Journal of Robotics Research, 11(4):89 96, [55] Feng Lu and Evangelos Milios. Robot pose estimation in unknown environments by matching 2d range scans. CVPR94, pages , , 45, 108 [56] A. K. Mackworth. On seeing robots. In A. Basu and X. Li, editors, Computer Vision: Systems, Theory and Applications, pages World Scientific Press, Singapore, [57] S. Mallat and S. Zhong. Characterization of signals from multiscale edges. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 14(7): , [58] P. Maybeck. The Kalman filter: An introduction to concepts. In Autonomous Robot Vehicles. Springer, , 85 [59] P. McLauchlan and J. Malik. Vision for longitudinal vehicle control. In Proc. IEEE - Conference on Intelligent Transportation Systems, Boston, , 165 [60] R. A. McLaughlin and M. D. Alder. Recognising cubes in images. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice IV, pages New York: Elsevier Science, [61] R. A. McLaughlin and M. D. Alder. The Hough Transform versus the UpWrite. IEEE Transactions on Pattern Analysis and Matchine Intelligence, 20(4): , [62] R. Mehrotra, K. R. Namudura, and N. Ranganathan. Gabor filter-based edge detection. Pattern Recognition, 25: , [63] S. K. Nayar and S. Baker. Catadioptric image formation. In Proc DARPA Image Understanding Workshop, pages ,

212 [64] C. Olson. Olson, c. f., locating geometric primitives by pruning the parameter space, pattern recognition 34 (2001), pp , [65] Theo Pavlidis. Integrating region growing and edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(3): , [66] K. K. Pingle. Visual perception by computer. In A. Grasselli, editor, Automatic Interpretation and Classification of Images, pages Academic Press, New York, [67] D. W. Rees. Panoramic television viewing system, United States Patent No. 3,505,465, Apr [68] W. D. Rencken. Concurrent localization and map building for mobile robots using ultrasonic sensors. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages , Yokohama, Japan, July [69] T. Röfer and M. Jüngel. Vision-Based Fast and Reactive Monte-Carlo Localization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA-2003), Taipei, Taiwan, pages , [70] L. G. Roberts. Machine perception of three-dimensional solids. In J. T. Tippet et al., editor, Optical and Electro-Optical Information Processing, pages MIT Press, Cambridge, [71] Mark A. Ruzon and Carlo Tomasi. Color edge detection with the compass operatro. IEEE Conference on Computer Vision and Pattern Recognition, 2: , [72] Erik Schulenburg. Selbstlokalisierung im Roboter-Fußball unter Verwendung einer omnidirektionalen Kamera. Master s thesis, Fakultät für Angewandte Wissenschaften, Albert-Ludwigs-Universität Freiburg, , 46 [73] Stephen Se, David Lowe, and Jim Little. Local and global localization for mobile robots using visual landmarks. IEEE International Conference on Intelligent Robots and Systems, pages , [74] A. Siebert. Dynamic region growing. In Vision Interface, Kelowna,

213 [75] R. Simmons, R. Goodwin, K. Haigh, S. Koenig, and J. O Sullivan. A layered architecture for office delivery robots. In Proceedings of the First International Conference on Autonomous Agents, Marina del Rey, CA, February [76] P. Smith, T. Drummond, and R. Cipolla. Edge tracking for motion segmentation and depth ordering. Proc. British Machine Vision Conference, 2: , [77] Armando Sousa, Paulo Costa, and António Moreira. Variable resolution vision system in mobile robotics. In Proceedings of the 5th Portuguese Conference on Automatic Control, [78] A. M. Thompson. The navigation system of the jpl robot. In Proc. IJCAI, pages , [79] Sebastian Thrun, Dieter Fox, and Wolfram Burgard. Monte carlo localization with mixture proposal distribution. In AAAI/IAAI, pages , , 44, 46 [80] C. Urmson, J. Anhalt, M. Clark, T. Galatali, J.P. Gonzalez, J. Gowdy, A. Gutierrez, S. Harbaugh, M. Johnson-Roberson, H. Kato, P.L. Koon, K. Peterson, B.K. Smith, S. Spiker, E. Tryzelaar, and W.L. Whittaker. High Speed Navigation of Unrehearsed Terrain: Red Team Technology for Grand Challenge tech. report TR-04-37, Robotics Institute, Carnegie Mellon University, June [81] F. v. Hundelshausen. An omnidirectional vision system for soccer robots. Master s thesis, Department of Computer Science, Free University of Berlin, Takustr. 9, D Berlin, April [82] G. Weiß, C. Wetzler, and E. von Puttkamer. Keeping track of position and orientation of moving indoor systems by correlation of range-finder scans. In Proceedings of the International Conference on Intelligent Robots and Systems, pages , [83] D. Willersinn and W. Enkelmann. Robust obstacle detection and tracking by motion analysis. In Proc. IEEE - Conference on Intelligent Transportation Systems, Boston, , 165 [84] S. B. Williams, P. Newman, G. Dissanayake, and H.F. Durrant-Whyte. Autonomous underwater simultaneous localisation and map building. In Proceedings of IEEE 209

214 International Conference on Robotics and Automation (ICRA00), pages 24 28, April, San Francisco , 165 [85] H.-J. Wünsche. Detection and control of mobile robot motion by real-time computer vision. In N. Marquino, editor, Advances in Intelligent Robotics Systems. Proc. SPIE, volume 727, pages , , 29, 107 [86] H.-J. Wünsche. Bewegungssteuerung durch Rechnersehen: Ein Verfahren zur Erfassung und Steuerung räumlicher Bewegungsvorgänge in Echtzeit. Fachberichte Messen, Steueren, Regeln Bd. 20. Berlin: Springer, ISBN , 29, 107, 128, 165, 166 [87] Y. Yagi and s. Kawato. Panoramic scene analysis with conic projection. In Proc. Int. Conf. Robots and Systems, [88] K. Yamazawa, Y. Yagi, and M. Yachida. Omnidirectional imaging with hyperbolidal projection. In Proc. Int. Conf. Robots and Systems, [89] A. Yuille, D. Cohen, and P. Hallinan. Feature extraction from faces using deformable templates. Proc. Conf. Computer Vision and Pattern Recognition, pages , [90] Z. Zomoter and U. Franke. Sensor fusion for improved vision based lane recognition and object tracking with range-finders. In Proc. IEEE - Conference on Intelligent Transportation Systems, Boston, , 165 [91] S. W. Zucker. Region growing: Childhood and adolescence. Computer Graphics and Image Processing, 5: ,