Real-time pedestrian detection in FIR and grayscale images

Transcription

1 Real-time pedestrian detection in FIR and grayscale images Dissertation zur Erlangung des Grades eines Doktor-Ingenieurs(Dr.-Ing.) an der Fakultät für Elektrotechnik und Informationstechnik der Ruhr-Universität Bochum vorgelegt von Aalzen Jaap Wiegersma Drachten Bochum 2006

2

3 Contents 1 Introduction Problem description Object detection in computer vision Segmentation Classication Tracking Related work in pedestrian detection Initial detection in grayscale images Initial detection in r images Classication Tracking Selected approach Image preprocessing Initial detection Classication Tracking Contributions Outline of this thesis

4 CONTENTS 2 2 Initial detection Image pre-processing Image enhancement Image features Gradient based segmentation Grouping vertical gradients Scanning a vertical edge detector through the whole image Temperature segmentation in r images Classication Classiers Neural networks Support vector machines Image features for classication Rectangle features Histograms of gradients and orientations features Training and validation of the classier Creating training datasets Training neural networks Training support vector machines Optimization of the classication Bootstrapping Component classication A simplied approximation to the support vector machine decision rule

5 CONTENTS Feature selection Principal component analysis Adaboost Multi-objective optimization Scanning a classier through the image at every position and scale Tracking Tracking using the Hausdor distance Mean shift tracking Tracking with Condensation Integrating initial detections through time Tracking the classication output Experimental results Initial detection Results on r images Results on grayscale images Classication Results on r images Results on grayscale images Results of component classication Results of classier optimization on classication performance Results of classier optimization on classication speed

6 CONTENTS Scanning a classier through the image at every position and scale Tracking Results on r images Results on grayscale images Discussion Achievements and limitations Initial detection Classication Tracking Complete system performance Further work Summary and conclusions

7 List of Figures 1.1 Examples of pedestrian detection Rectangle lters Calculation of the rectangle sum Image gradients and energy Grouping vertical gradients in a r image Grouping vertical gradients in a grayscale image Use of the gradient sign for initial detection Combining regions of interest A scanning window Scanning an edge detector through the image at every position and scale Scanning an edge detector through the image at every position and scale Temperature segmentation in r images Combining regions of interest Rectangle features for classication Calculation of image features in subregions Subregions for component classication

8 LIST OF FIGURES Classifying the whole image at every scale and resolution Model contour of the Condensation tracker Integrating initial detections through time Matching ground truth data Suitability of initial detections for classication Fir images at dierent temperature ranges Initial detection results Percentage positive detections Calculation times of initial detection for r images Illumination changes in grayscale images Initial detection results in grayscale images Calculation times of initial detection grayscale An example ROC curve Results of support vector machine classication orientations features Results of support vector machine classication gradients and orientations features Results of support vector machine classication rectangle features Results of neural network classication orientations features Results of neural network classication gradients and orientations features Results of neural network classication rectangle features

9 LIST OF FIGURES Processing times feature vector calculation Classication times of classier/image feature combination Classication results grayscale images orientations features Classication results grayscale images gradients orientations features Classication results grayscale images rectangle features Results of component classication Results of PCA feature selection Results of Adaboost feature selection Results of multi-objective optimization feature selection Classication times of optimization algorithms Classication times of optimization algorithms Pedestrian tracking in r images Processing times tracking algorithms Diculties for initial detection because of vertical structures in the background Diculties for initial detection because of many vertical structures in background Pedestrian pushing a stroller A group of pedestrians in a r image Persistence of false positives over time Example of Hausdor tracker in r images

10 LIST OF FIGURES Tracking a car in r images with the Hausdor tracker Tracking a car in grayscale images with the Hausdor tracker

11 List of Algorithms 1 Connected components labeling Grouping vertical gradients Scanning a vertical edge detector Temperature segmentation in r images Adaboost. Algorithm from [39] Assignment of a rank to an individual. Algorithm from [16] Crowding-distance-assignment. Algorithm from [16] Scanning a classier through the image at every scale and location The Hausdor tracker Mean shift tracking The Condensation algorithm Initial detection tracker Tracking the classication output

12 Chapter 1 Introduction 1.1 Problem description This thesis describes a complete system for the detection of pedestrians in monocular camera images and monocular r(far infrared) images recorded from a moving car. The objective of this work is to have a system that can warn the driver of the car when a pedestrian crosses the street. Two examples of a street scene and the output of the detection system are shown in gure 1.1. There are a few important properties such as system should have: (a) FIR pedestrian detection (b) Pedestrian detection in a grayscale image Figure 1.1: Examples of pedestrian detection. 10

13 CHAPTER 1. INTRODUCTION 11 It should be very reliable with respect to false detection; it should have a high detection rate and a low false alarm rate. It should be very reliable with respect to environment conditions; it should be able to operate in a complex city environment as well as on a land road. It should be very reliable with respect to weather conditions; it should be able to operate under various weather conditions, for example high temperatures, low temperatures, rain, and snow. It should able to operate reliably in a dynamic environment; both the car and the pedestrians are usually moving. It should be able to operate in real-time at moderate hardware. The aim of the system presented in this work is to run at at least 20 frames per second on a 1 GHz. desktop processor. This processing speed is what roughly can be expected to be available for driver assistance systems in cars in the next years. Developing a system which satises all of these properties is not trivial. Therefore, the system described here is divided into four components: An image pre-processing component which calculates relevant features from the image. An initial detection component which generates regions of interest in the image. Several initial detection routines can be used for dierent environment conditions. The initial detection routine can also have several parameter setups for dierent weather conditions.

14 CHAPTER 1. INTRODUCTION 12 A classication component which determines which of the regions of interest generated by the initial detection routine contain pedestrians. The quality of the classication component inuences the amount of fail detections of the system. A tracking component which keeps track of pedestrians in successive frames once they have been positively classied by the classication component. The tracking component should be able to handle the motion of the car and the motion of the pedestrian. The tracking component makes it possible to stabilize detections over time and to predict the time of contact in case of a possible collision. This thesis describes a system that can be used for pedestrian detection in grayscale camera images and for pedestrian detection in far infrared images. The r detection system described here is used by BMW as a reference system for testing commercial systems. 1.2 Object detection in computer vision Computer vision research is applied in elds of computer science like robotics and medical image analysis, in industrial applications like driver assistance systems, medical image analysis, and assembly, and in law enforcement applications like face detection and nger print comparison. This section gives a short overview of how the methods in this thesis relate to other elds of computer vision Segmentation The purpose of segmentation is to extract interesting regions from the image. An example of segmentation in an image recorded from

15 CHAPTER 1. INTRODUCTION 13 a medical scanner is locating the exact boundaries of an organ in the image. An example of segmentation in pedestrian detection in r images is locating bright image regions. In this thesis, the term initial detection is used for segmenting regions of interest from the image. The remainder of this section describes segmentation methods which are applied in various elds of computer vision. Color is a cue which can be used for segmentation, for example by grouping pixels of similar color value into a region. An application of color segmentation is for example a robot vision system which segments objects on a table for grabbing. An example of a color based segmentation method is graph based segmentation. In graph based segmentation, the image to be segmented is represented as a graph G = (V, E) where each node v i V corresponds to a pixel in the image, and the edges (v i, v j ) E of V connect pairs of neighboring pixels. Each edge in the graph has a weight w((v i, v j )) which is a measure of dissimilarity in color between the two pixels connected by that edge. A segmentation S is a partition of V into components where each component C S is a connected component in a graph G = (V, E ), where E E. The edges between two pixels inside a component should have relatively low weights and the edges between pixels in dierent components should have relatively high weights. An example of a graph based segmentation method is [18]. Motion also provides information that can be used for segmentation. One motion based segmentation method for static cameras is background subtraction. In background subtraction, an image of the background is stored. In successive camera frames, the current image is subtracted from the stored image. A dierence between the current image and the stored image indicates that something has moved at the pixels in the image where the dierence is unequal to zero. Usually, the stored image of the background needs to be updated frequently because of changes in illumination con-

16 CHAPTER 1. INTRODUCTION 14 ditions over time. Applications of background subtraction are for example people detection in security camera images and trac monitoring. Optic ow [24] is the apparent motion of brightness in the image. Like background subtraction, optic ow can be used for object segmentation. If the image intensity at a point p in the image at time t is I(p x, p y, t), the derivative di dt with respect to t is di dt = I dp x p x dt + I dp y p y dt + I d t. t d t The assumption made in the calculation of the optic ow is that the intensity of a point p does not change over time so di dt = 0. This gives the rst optic ow constraint: I t = T p x dp x dt + I p y dp y dt. (1.1) The second assumption made in optic ow is that it is constant in a small region S p in the image over time. So measuring the gradients in a region S p around p makes it possible to solve the optic ow vector. The optic ow in each point in this region is constrained by 1.1. To nd the optic ow v = ( dp x dt, dp y dt ) can be calculated by minimizing E p = x,y S p ( I v x + I v y + I ) 2. p x p y t An application of optic ow in a static camera environment is the segmentation of moving objects from the scene. At locations where the magnitude of the optic ow vector is larger than zero, there is a moving object. The direction of the vector indicates in which direction the object is moving. An application of optic ow

17 CHAPTER 1. INTRODUCTION 15 is the detection of moving people from a static camera, for example from a security camera or from a standing car. An application of optic ow in a moving camera environment is obstacle avoidance. The whole scene recorded from a moving appears to be moving because of the movement of the camera itself. Objects closer to the camera generate a stronger ow vector eld magnitude than objects further away from the camera because of motion parallax. To avoid a collision with an objects in the scene, there should be a movement into the direction where the ow vector eld has a low magnitude. Another computer vision technology that can be used for object segmentation is stereo vision. From two calibrated cameras, a disparity map can be calculated which provides relative distance information from pixels in the image to the camera. A disparity map can be calculated using a correlation measure based on for example intensity value. The correlation measure is used to nd the corresponding position of a gray level value from the rst image in the second image. The disparity is the dierence in position of a gray level value. A large disparity of an image region means the corresponding object in that region is relatively closer to the camera than an object in an image region with a smaller disparity. The disparity value of an image region can be used for segmenting it from the image. Example applications of stereo vision are pedestrian detection from a moving car [42], [45], and [20] and obstacle detection in robotics. The Active Contour Model [26] is a method which detects contours(edges) in an image. A snake is initialized around the object to be segmented and moves along its interior normal until it stops at the edge of the object. A contour(snake) is a spline represented by a vector v(s) = (x(s), y(s)) where s is the arc length. Finding a contour in an image means minimizing an energy function:

18 CHAPTER 1. INTRODUCTION 16 E Snake = 1 0 E int (v(s)) + E image (v(s)) + E ext (v(s))ds where E int represents the internal energy(bending and discontinuity) of the contour, E image represents the forces caused by the image, and E ext represents the external forces caused by a higher level process. The energy function E snake is minimized using variational calculus. The Active Contour Model is much used in medical image processing, for example for the segmentation of organs in images recorded by a medical scanner. An alternative to the spline representation of Active Contour Models is the Level Set representation[10], which can detect contours with discontinuities. The Active Contour Model from [10] is based on the Mumford Shah function [29] for image segmentation. The energy function E(c 1, c 2, ϕ) for a piecewise smooth image f is: E(c 1, c 2, ϕ) = Ω (f c 1 ) 2 H(ϕ)+ (f c 2 ) 2 (1 H(ϕ))+µ H(ϕ) Ω Ω where Ω is the image domain, f is the image, c 1 and c 2 are the average values of f inside and outside the object respectively, H(ϕ) is the Heaviside function, and µ is a constant for weighting the length term of the contour. The energy function E(c 1, c 2, ϕ) is minimized using methods from variational calculus. An example application of the Level Set representation of the Active Contour Model is segmentation in images recorded from a medical scanner. An improvement to the Active Contour Models is to add a shape prior term E shape to the energy functions. In this way, a prior shape of the object to be segmented can be used to improve the segmentation results. Examples of the use of a shape prior term in the energy function are [12, 9].

19 CHAPTER 1. INTRODUCTION Classication The purpose of classication in the context of computer vision is to learn to distinguish between images of a target class of objects and images of a non-target class of objects. A classier is usually trained on a set of target objects and a set of non-target objects so that it can estimate the class of an unseen image example. The training is done on a database of target images and a database of non-target images which are given a class label by a human observer. From the labeled target and non-target images the classier learns to distinguish between the two classes. Commonly used classiers for computer vision applications are nearest neighbour classiers, neural networks, and support vector machines. In a two class classication problem, a k-nearest neighbour classier selects the k example points closest to the example being classied and votes between those example points to determine the class of the example. Neural networks and support vector machines are explained in section and section respectively. An important choice is which kind of image features are used for classication. Usually, the gray values of an image are not directly used for classication because they are strongly dependent on the illumination conditions. First features are calculated from the image and these features are used as input to the classier. Usually, the features used are some kind of gradient response. For example gradients from a Sobel lter, the orientations of the gradients, or Wavelets. Example applications of neural networks and support vector machines are face detection [35], and car detection [22]. In section 3, the use of classiers for pedestrian detection is described in detail.

20 CHAPTER 1. INTRODUCTION Tracking The purpose of tracking in the context of computer vision is usually to determine some properties(usually position and scale) of an object in a next time step or next image frame. Tracking usually involves an estimation and a conrmation of the estimation. Tracking makes it possible to integrate information about an object through time, it makes it possible to estimate the position and scale of an object in the near future. This can be useful for estimating time to contact, for example. In addition, this makes it possible to limit processing in the next frame to the area in the image around where the tracker estimated the object to be. This is important in real-time image processing, for example. Two commonly used methods for estimating the state of a system from measurements are the Kalman lter and Condensation[25]. They consist of two main steps: a prediction step based on a dynamical model of the object which estimates the state(often position in computer vision) of the object in the next time step and an update step which updates the prediction based on some measurement. The main dierence between Condensation and a Kalman lter is that Condensation is based on factored sampling, which does not require a Gaussian measurement density. This property makes it possible to track objects in a cluttered scene. An example application of tracking with a Kalman lter is vehicle tracking from a static camera [5]. An example application of Condensation is tracking humans from a static camera [25]. Color can also be used as a feature for tracking. The Mean Shift Tracker [11] is based on color histograms. The Mean Shift Tracker is described in detail in section 4.2. An example application of the Mean Shift Tracker is face tracking in color images.

21 CHAPTER 1. INTRODUCTION Related work in pedestrian detection Recently, there has been a lot of interest in driver assistance applications like lane detection, car detection, trac sign recognition, and pedestrian detection. The reasons for this are the desire to make trac safer, to make driving more comfortable for drivers, and the recent technological progress in camera systems and computer hardware which make this research possible. At time of writing, several driver assistance are commercially available. For example a lane departure warning system from Citroen and a r pedestrian detection system from Honda. This section gives an overview of relevant other pedestrian detection research Initial detection in grayscale images In [42], [45], and [20] stereo vision is used to generate hypothesis of pedestrians. A depth map is calculated from which foreground objects can be segmented using range thresholding. The advantage of using stereo is that processing can be limited to a number of objects close to the car. This speeds up further processing and limits the number of fail detections. A disadvantage is that much processing capacity or special hardware is required to calculate the disparity map. Two cameras are needed for the detection process, making it more expensive than monocular processing. Also, it is not clear if it is possible to keep the cameras calibrated over a period of many years for use in everyday driving. A dierent approach is shape-based detection. A shape matching method for initial detection in monocular images is presented in [21]. In order to detect a pedestrian in a grayscale image a hierarchy of edge templates of pedestrians are matched with a distance transform image calculated from a feature image of the original grayscale image. Templates of pedestrians in dierent poses are grouped together in prototypes. The grouping of templates is

22 CHAPTER 1. INTRODUCTION 20 done at multiple levels, resulting in a hierarchy. The leaves of the hierarchy contain all templates and the nodes of the hierarchy contain the prototypes. The hierarchy of templates is scanned at a coarse-to-ne scale through the image. If a template at a certain node in the hierarchy matches(the distance measure between template and image is below a certain threshold), the templates under the leaf are processed. In [32], saliency maps calculated from intensity, color, and orientation features are used for initial detection. In [7] and [3], a combination of a shape-based method and stereo is used for initial detection. An edge image is calculated from a grayscale image. Stereo vision is used to eliminate background edges. A symmetry map is calculated from the vertical foreground edges. A bounding box is created around symmetrical vertical objects and a search for the head of the pedestrian is performed through matching a head model. Further processing discards bounding boxes which no not have the correct width to height ratio or which are too homogeneous to contain a pedestrian. In [14], initial detection is performed by calculating image entropy. In image areas with much structure a template is matched with image contours. Depth information from stereo vision is used to adjust the size of the template for matching image contours. A restriction of using image features like edges for initial detection is that good quality image features are required. Often, it is dicult to obtain good image features from low contrast grayscale images. One important cue humans use for object detection is motion. Humans are good in detection relative motion, for example detecting a moving pedestrian from a moving car. Motion is generally not a good cue for a real-time pedestrian detection system, the calculation of motion information using optical ow is expensive and does not provide a good segmentation result because dierent parts of the pedestrian are moving at dierent speeds. It does not provide

23 CHAPTER 1. INTRODUCTION 21 a complete segmentation; for classication it is necessary to have a complete bounding box around an object. Also, because usually the car is moving at a relatively high speed in comparison to the pedestrian, the whole scene is moving and the motion of the pedestrian is neglectable. There also many pedestrians that are not moving so motion information is useless in this case. In [23], leg motion is used for the detection of pedestrians in color images. The input image is clustered on color values. Usually, the legs of a pedestrian are combined into the same cluster. The temporal change in the shape of such a cluster is used to distinguish it from clusters belonging to other objects in the image. The limitation of using motion for detection is that it can only detect moving pedestrians. Also, the clustering of a color image is time consuming and does not always provide good segmentation results if the input image has a complex background Initial detection in r images The most straightforward approach to pedestrian detection in r images is to use body heat. People appear brighter in a r image than most other objects because of their body temperature. In [30], a probabilistic template is created from a training database containing pedestrians in dierent poses. The template is created by thresholding bright regions from the images in the training database. The probability of each pixel in the template belonging to a pedestrian is calculated based on how frequently it is thresholded in the training database. The template is scanned through the image at various scales from a coarse to ne resolution. In [43], thresholding is also used to segment bright regions from the image. After thresholding, bright regions that have an incorrect width to height ratio for pedestrians are discarded and regions which are unlikely to be a pedestrian based on position in the image are discarded. In addition to detecting complete pedestri-

24 CHAPTER 1. INTRODUCTION 22 ans, a detector that searches for pedestrian heads in also used to generate initial detections. In [4], the r image and its vertical edges are searched for symmetrical regions. Symmetrical regions in cold image areas are discarded. A vertical histogram of edges is calculated in each of the bounding boxes surrounding the symmetrical regions. The shape of the histogram is used to discard bounding boxes which do not match a pedestrian. Also, bounding boxes with an incorrect aspect ratio and bounding boxes which do not meet perspective constraints are discarded. The problem with using image brightness for detection is that at higher outside temperatures, people do not necessarily appear brighter in the image than the background Classication It is possible to completely omit the initial detection step using a pattern recognition approach. In [31], a support vector machine classier is scanned through the image at every scale and every location. The classier is trained on haar wavelets calculated from a database of pedestrian and non-pedestrian color images. An improved version of this system [28] is based on a component classier. There are multiple classiers each classifying a body part. The nal classication result is calculated from the output from the component classiers. In [15], a support vector machine trained on histograms of oriented gradients is scanned at every location and scale through the image. The scanning window is divided into cells. A histogram is calculated in each of the cells. The combined histograms are the input to the support vector machine classier. In [40], a classier trained on rectangle features is scanned through the image at every location and scale. Additionally, motion information is used for classication. Motion information is extracted by calculating dierences between images in time. Adaboost [19] is used to train a classier. The

25 CHAPTER 1. INTRODUCTION 23 features for classication are selected from all possible rectangle features and all possible motion features. The limitation of using motion lters is that these can only be used with a static camera. In [38], a convolutional neural network is scanned through the image at every scale and location. In the convolutional neural network architecture [27], feature selection is performed simultaneously with training the neural network. The feature selection is implemented in a hidden layer with shared weights which are optimized together with the weights of the classication so that a complete classication/feature selection architecture is optimized. A drawback of these approaches is that scanning a classier at every scale and location through the image is too slow for real-time processing. In addition, because there is always a fail classication rate, it would generate a too high amount of fail detections because many classications are performed per frame. An alternative to scanning a classier through the image at every scale and location is to classify the output of the initial detection routine. In [45], the output of a stereo initial detection routine is classied with a neural network. The image gradients are the features used for classication. The regions of interest from the initial detection routine are rescaled to a xed size for classication. In [21], the regions of interest from the initial detection routine are classied with a radial basis function network. To select negative examples for training, a bootstrapping procedure is used. The classier is trained in an iterative way, at each iteration it is validated on a new set of negative examples. The negative examples are added to the training dataset and the classier is retrained. In [37], the data for training the classier is divided into mutually exclusive training clusters. Each cluster contains data from a particular pose, a particular articulation, and a particular illumination condition. The idea of clustering the training data is that reducing the variability of the training data by dividing it into clusters is more eective than training a classier on all

26 CHAPTER 1. INTRODUCTION 24 data. The regions of interest for classication are divided into 9 subregions. A classier is trained multiple times for each subregion, once per training cluster. The features for classication are histograms of gradient orientations. The histograms are weighted by smoothed gradient magnitudes to achieve invariance to illumination changes. The classier is trained using Adaboost which selects features from all subregions. In [42], regions of interest generated by a stereo algorithm searching for the legs of a pedestrian are fed into a feed-forward time delay neural network. The input to the network consists of pixel values from multiple frames. Neurons in a higher layer are only connected to a subselection of neurons in the lower layer, called receptive elds, which makes it possible to detect specic leg poses and motion patterns Tracking One approach to tracking is building a model of human shapes and matching this model with image data. In [2, 34], a linear shape model of a pedestrian is build from pedestrian silhouettes. A B-spline with a xed number of equally spaced control points is tted to the contours of each of the silhouettes. After alignment of the shapes, a mean shape and a set of modes of variation are generated with Principal Component Analysis. The Condensation algorithm [25] is used for tracking. Each sample represents a possible position of the object to be tracked. A zero-order motion model is used because no assumptions can be made about how the pedestrian or the car are moving. The observation density for weighting the samples is generated by measuring the distance to the nearest image feature along the normals of the shape model. Oriented edges discretized into eight bins are used as features for tracking. A drawback of this approach is that it cannot handle scale well. The shape model should be matched to the image at multiple resolutions for each sample. This makes it expensive and

27 CHAPTER 1. INTRODUCTION 25 introduces aliasing problems. Also, a linear shape model cannot capture the many possible poses and orientations of a pedestrian. Work related to learning non-linear shape models can be found in for example [12]. In [43], the heads of pedestrians in r images are tracked. The position of the pedestrian in the next frame is estimated with a Kalman lter and the measurement update is performed by calculating an exact position around the estimated position with a mean-shift method. It is not clear how scaling is handled. In [32], a classier is scanned through the image at each position and each resolution. The output of the classier, which is interpreted as a probability, is used for tracking. Condensation [25] is used for the propagation of classier outputs through time. The position and scale of a person in the image are represented by the state density in the Condensation algorithm. The classier outputs are represented by the observation density in the Condensation algorithm. Instead of using the standard factored sampling, the posterior state density of the previous frame is directly taken as the prior for the current frame. The pedestrian detection system is run and the state density is reinforced or inhibited in the locations where the support vector machine outputs are high. To perform detection of people, the densities are thresholded. This method can eectively handle scaling. The disadvantage of this method is that unlike the two previously described tracking methods, it cannot be used without running the whole detection system rst. 1.4 Selected approach The detection system described in this document consists of tour components: an image pre-processing component, an initial detection component, a classication component, and a tracking com-

28 CHAPTER 1. INTRODUCTION 26 ponent Image preprocessing The image pre-processing component performs the calculation of image features for the initial detection component, classication component, and tracking component. It also contains methods for image enhancement like smoothing, contrast stretching and histogram equalization Initial detection The initial detection component selects regions of interest in the image which may contain pedestrians. So unlike systems that scan a classier through the whole image at each scale and location([31], [15], and [40]), processing is limited to certain regions in the image. It is desirable to have an initial detection routine because otherwise too much processing time is spent on uninteresting parts of the image. Also, there is always a certain false positive classication rate. The larger the number of objects that are classied per frame, the larger the number of false positive classications per frame. For a production system, the false positive rate should be as low as possible. Motion information is not used, because it cannot be assumed that a pedestrian is moving in the scene. The assumption is made that a single pedestrian can be segmented from the scene. The methods described here are not specically designed for the detection of pedestrians that are overlapping each other, although this may work. In this work, three dierent initial detection routines are used. Two initial detection routines are gradient based, the other is region based. The gradient based initial detection routines calculate

29 CHAPTER 1. INTRODUCTION 27 the vertical gradients from the intensity image. Image locations with high vertical gradient response are regions of interest. The rst routine calculates the vertical gradients of the image and combines vertical structures into regions of interest. The second routine scans a vertical edge detector through the image at every scale and location. The lter size of the edge detector is made dependent on the size of the scan window. This makes the initial detection routine suitable for detection objects at all scales. These routines are mainly interesting for r images. In grayscale images, there is usually so much gradient response from background structures that it is dicult to segment pedestrians based on gradient information alone. The region based initial detection routine operate directly on the intensity values. This routine combine regions with similar intensity values. For r images, human body temperature makes pedestrians appear bright in the image, at least at low outside temperatures. In r images, bright regions with a vertical shape are regions of interest Classication The classication of pedestrians is challenging because of the high variability among objects in the pedestrians class. Also, there are many pedestrian like objects in trac scenes, for example trees, poles, parts of building, and parts of cars. Therefore, an advanced classication component is required. Because many objects per frame may be classied, the classier should also be very ecient. In this work, support vector machines and neural networks are used for classication. An important choice is the type of image features used for classication. In this work, several features are tested both on video sequences and r sequences. The features that are used here are

30 CHAPTER 1. INTRODUCTION 28 rectangle features [39], histograms of image gradients and orientations from the image gradients. The reason for using rectangle features is that they can be calculated very eciently, which makes them very suitable for a real-time system. The histograms of orientations from the image gradients are used because they are invariant to small translations inside the regions from which the histograms are calculated. For a real system, the classication performance should be nearly perfect. Especially, the false positive rate(false alarms) should be practically zero. In practice, this is dicult to achieve. A classier trained on a complete set of image features usually has a too high fail classication rate to be useful in a real system. Therefore, in this work, the classication system is optimized with dierent feature selection methods. The idea is that not all features are stable for classication. Selecting the subset of stable features from all features and using only the stable features for classication improves classication performance. To improve classication performance even more, the training data is divided into subdatasets. The training data is divided based on pedestrian orientation, pedestrian size, and environment conditions like outside temperature. Also classication speed is important for a real system. In order to improve the classication speed of support vector machines, the number of support vectors are reduced using the method from [36], and with a multi-objective optimization Tracking The reliable tracking pedestrians is also challenging: the contrast between the pedestrian and the background is often low, the resolution of pedestrians in trac scenes is often low, there are many background objects that resemble pedestrians, due to the move-

31 CHAPTER 1. INTRODUCTION 29 ment of the car, pedestrians often strongly scale in a few frames, multiple pedestrians may occlude each other, and pedestrians usually change shape because they move. In this work, ve tracking methods are applied for tracking pedestrians. The rst tracker is based on the Hausdor distance. Once a pedestrian is detected in a certain frame, a model is created from the gradient image calculated in the region of interest which contains the pedestrian. In the next frame, the tracker searches in the area around the previous location for the position with the smallest Hausdor distance to the model. The second tracker is a mean shift tracker. A model is created for tracking by calculating a histogram of image features of the detected pedestrian. In the next frame, the position containing the pedestrian is determined by minimizing the dierence between a target histogram and the model histogram. The third tracker is a Condendation tracker. The tracker is based on propagating a state density of tracked objects over time. The fourth tracker is a tracker that integrates initial detections through time. This tracker is based on the assumption that a pedestrian which is detected in a certain frame will be detected at approximately the same location in the next frame. The fth tracker is a tracker which integrates classications through time. This tracker is based on the assumption that a pedestrian which is classied in a certain frame will be classied at approximately the same location in the next frame. 1.5 Contributions This work has resulted in a pedestrian detection system for grayscale and r images. The two main goals were to develop a real-time detection system which operates with a high detection rate under various environment conditions. The r version of the system runs easily at 20 frames per second on a 1 GHz. processor. Both

32 CHAPTER 1. INTRODUCTION 30 the grayscale and r detection system are strongly optimized for a high detection rate. Three new real-time initial detection routines were developed, two gradient based routines and one region based routine. A systematic overview of the performance of these detection routines is presented on a variety of conditions by comparing the results of the detection routines to ground truth data. As mentioned in section 1.3.3, several types of classiers and image features are described in the pedestrian detection literature. It is not clear however, which classier/image feature combination works best. A systematic overview of the performance of support vector machines and neural networks in combination with multiple feature types is presented in this thesis. In addition, to improve classication performance, dierent classiers are trained for different temperature ranges(in the case of r images) and dierent classiers are trained for dierent appearances of pedestrians, based on orientation. Also, to improve classication performance, feature selection is performed with a PCA, Adaboost, and multiobjective optimization. In order to achieve real-time classication speed, the number of support vectors of a support vector machine are reduced. An overview of all classication results for dierent classiers trained on dierent image features is presented in this work. Five tracking methods are evaluated on data of pedestrians. The rst is based on the Hausdor distance. The second is a mean shift tracker. The third is a Condensation tracker. The fourth is based on integrating initial detection through time. And the fth is based on integrating classication outputs through time. These trackers can be used to stabilize detections over time and for example make it possible to estimate the time of contact with the pedestrian. An overview of tracking performance of all trackers on r images and grayscale images is presented.

33 CHAPTER 1. INTRODUCTION Outline of this thesis The rest of this thesis is structured in the following way: chapter two contains a description of the initial detection routines and the image preprocessing methods it uses. Chapter three describes the methods used for classication and optimization of the classication. Chapter four describes the methods used for tracking. Chapter ve contains the experimental results. Chapter six contains the discussion and conclusion.

34 Chapter 2 Initial detection The goal of the initial detection is to nd regions of interest in the image. It can be seen as a focus of attention mechanism which limits processing to interesting parts of the image. These regions of interest can contain pedestrians or other objects. It is desirable to have an initial detection routine in contrary to scanning a classier through the whole image at all locations and scales for the following reasons: There is always an unavoidable number of fail classications. By classifying only the output of the initial detection routine instead of the whole image, the number of false detection is reduced. Limiting processing to interesting image regions saves calculation time. It appears that it is necessary to have dierent initial detection routines for dierent images types and dierent conditions. For example, it may be necessary to have a dierent initial detection routine for color/grayscale images than for r images. This section describes three initial detection routines. Two gradient based 32

35 CHAPTER 2. INITIAL DETECTION 33 initial detection routines which use the image gradients for generating initial detections. And a region based initial detection routine which operates on the intensity values directly. An important rate for a detection system is the rate of positive examples detected as positive by the system. Another important rate is the rate of negative examples detection as positive by the system. In the literature, it seems that inconsistent terminology exists for these rates. In this work, these rates are called true positive rate and false positive rate, respectively. 2.1 Image pre-processing Image enhancement Before the detection algorithms are applied to an image, it is usually preprocessed to enhance its quality. To enhance the contrast in the image, contrast stretching can be applied. Usually, before the image features are calculated, smoothing is performed to remove image artifacts. Contrast stretching Contrast stretching enhances an image by stretching the range of intensity values to the maximum possible range. It does this by applying a linear scaling to the image. The pixel with minimum intensity in the image is set to the lowest possible value, the pixel with maximum intensity in the image is set to the highest possible value, and the other pixels are interpolated between the lowest and highest possible values. To perform contrast stretching, the minimum intensity value min and the maximum intensity value max are calculated from the input image. The stretched image j can be calculated from the original image i using

36 CHAPTER 2. INITIAL DETECTION 34 lower j(x, y) = upper upper ( i(x,y) min max min ) i(x, y) = min min < i(x, y) < max i(x, y) = max where x and y are image coordinates, where lower is the lowest possible intensity value and upper is the highest possible intensity value. To reduce the eect of outliers, min and max can be selected at the pixel values at for example 3% and 97% of the image histogram, respectively. Smoothing A convolution kernel for smoothing is usually constructed with the 2D Gaussian G(x, y) = 1 x 2 +y 2 2πσ 2e 2σ 2 For performance reasons, in this work smoothing is performed with two 1-dimensional approximations to a Gaussian and G(x) = {0.25, 0.5, 0.25} G(y) = for a kernel size of three and G(x) = {0.0625, 0.25, 0.375, 0.25, }

37 CHAPTER 2. INITIAL DETECTION 35 and G(y) = for a kernel size of ve. The oating point values in the kernels are selected in a way that a convolution with a kernel can be performed eciently with integer math. For example, G(x) = {0.25, 0.5, 0.25} can be eciently calculated as G(x) = {1, 2, 1} shifted to the right by two bits Image features Rectangle features The rectangle features applied here are the same as those used in [39] for real-time face detection. Figure 2.1 shows an example of the lters to calculate these features. The sum of the pixels in the white region are subtracted from the sum of the pixels in the dark region. The lter in gure 2.1(a) gives a high output at a vertical edge, the lter in gure 2.1(b) gives a high output at a horizontal edge, and the lter in gure 2.1(c) gives a high output at a diagonal edge. The motivation for using these features are that they are extremely fast to calculate. They can be calculated in constant time regardless of their size using an integral image representation related to summed area tables from texture mapping in computer graphics. The integral image at location x, y is the sum of the pixels above and to the left of x, y: ii(x, y) = x x,y y i(x, y )

38 CHAPTER 2. INITIAL DETECTION 36 where ii(x, y)is the integral image and i(x, y) is the original image. The integral image can be calculated using the following formulas: s(x, y) = s(x, y 1) + i(x, y) ii(x, y) = ii(x 1, y) + s(x, y) where s(x, y)is the cumulative column sum, s(x, 1) = 0, and ii( 1, y) = 0. Using the integral image, any rectangular sum can be calculated in only four array references. An example calculation is shown in gure 2.2. The area of rectangle D can be calculated with (ii(4) + ii(1)) (ii(2) + ii(3)), where the values 1,2,3, and 4 are coordinates of the integral image. Figure 2.1: Rectangle lters.

39 CHAPTER 2. INITIAL DETECTION 37 Figure 2.2: Calculation of the rectangle sum. Gradients The image gradients are calculated by convolving the smoothed image with two 1-dimensional kernels. For a lter size of 5, the horizontal gradient image g x is calculated with the following Sobellike kernel g x = The vertical gradient image g y is calculated with the following kernel g y = { 1, 2, 0, 2, 1}. An approximation to the energy image E is calculated from the horizontal gradient image and the vertical gradient image E = g x + g y.

40 CHAPTER 2. INITIAL DETECTION 38 The gradient images and energy images can be used for edge detection. Also, the angle of orientation Θ of the gradient is calculated from g x and g y Θ = arctan( g y g x ). An example of an image, its vertical gradients, its horizontal gradients, and its energy is shown in gure 2.3. (a) A r image (b) It's vertical gradient image (c) It's horizontal gradient image (d) It's energy image Figure 2.3: Image gradients and energy.

41 CHAPTER 2. INITIAL DETECTION Gradient based segmentation Two dierent gradient based initial detection routines are described in this section. The rst is based on clustering vertical gradients. The second is based on scanning a vertical edge detector through the image at every location and scale Grouping vertical gradients The initial detection routine that groups vertical gradients works as following: the vertical gradient image is calculated with the convolution kernel from section A threshold is applied to create a binary image where the regions with a high gradient magnitude are foreground pixels. In the case of r images, the threshold is calculated from the image histogram of the vertical gradient image. The intensity value at 95% in the cumulative image histogram of the vertical gradient image is selected as the threshold. The threshold is selected at 95% percent because this ensures that only image regions with strong vertical gradients are considered fur further processing. In the case of color/grayscale images, the threshold is set to a xed value. This threshold is calculated in the same way as for infrared images. An image histogram is calculated and the threshold is selected at the intensity value at 95% in the cumulative image histogram. Using the connected component algorithm described in algorithm 1, the foreground pixels in the binary image are clustered into regions. Regions with a width to heigth ratio from 0.5 to 1.5 (width divided by height) may contain a pedestrian and are regions of interest. The values of 0.5 and 1.5 are selected based on measuring width to height rations from dierent front view and side view pedestrian images.

42 CHAPTER 2. INITIAL DETECTION 40 Algorithm 1 Connected components labeling. Input: A binary image Output: An image where each foreground region is assigned an unique label. 1. Repeat steps 2 and 3 for each pixel in the image 2. Starting at the rst pixel: scan the image row by row until a foreground pixel is found 3. Consider all adjacent 8-connected pixels to the current pixel(west pixel, south pixel, south west pixel and south east pixel) which were already visited: (a) If all of the visited pixels are background pixels, assign a new label to the current pixel. (b) If one or more of the visited pixels are foreground pixels of the same group, assign the label of that pixel to the current pixel. (c) If multiple visited pixels belong to a dierent group, assign one of the labels of those pixels to the current pixel and mark that the dierent labels belong to the same group. 4. In an additional pass over the whole image, assign an unique group label to the pixels of groups which were marked as equal. An example of this method for r images is shown in gure 2.4. Figure 2.4(a) shows the input image, gure 2.4(b) shows the vertical gradient image, gure 2.4(c) shows the binary image, and gure 2.4(d) shows the regions of interest. An example of this method for grayscale images is shown in gure 2.5. Figure 2.5(a) shows the input image, gure 2.5(b) shows the vertical gradients, gure 2.5(c) shows the binary image, and gure 2.5(d) shows the regions of interest. In the case of r images, there is usually a transition from dark( background) to light(region of interest) to dark(background). So additionally, the sign of the gradient is used to discard regions of interest which cannot contain a pedestrian. An example of

43 CHAPTER 2. INITIAL DETECTION 41 (a) A r image (b) It's vertical gradients (c) It's binary image (d) The initial detections Figure 2.4: Grouping vertical gradients in a r image. the use of the sign of the gradients is shown in gure 2.6. The gray regions contain negative gradients, the white regions contain positive gradients. Only regions of interest consisting of a positive gradient region followed by a negative gradient(from left to right) region are accepted.

44 CHAPTER 2. INITIAL DETECTION 42 (a) A grayscale image (b) It's vertical gradient image (c) It's binary image (d) The initial detections Figure 2.5: Grouping vertical gradients in a grayscale image. Figure 2.6: Use of the gradient sign for initial detection.

45 CHAPTER 2. INITIAL DETECTION 43 Often, in the case of r image, the feet and the head of pedestrians are brighter than the upper body because of the isolation of the coat. These upper body regions do not generate strong gradient magnitudes. Therefore, once a region of interest is generated, a search along a vertical line is performed upwards for another region of interest. This makes it possible to combine the feet with the head. When another region of interest is found, the two regions are combined and it is tested if the region matches the width to height ratio of a pedestrian. An example of this is shown in gure 2.7. In gure 2.7(b), the complete pedestrian cannot be segmented as one region. After the combination of the feet with the head, the pedestrian is segmented as one object 2.7(c). The complete initial detection method is described in algorithm 2.

46 CHAPTER 2. INITIAL DETECTION 44 (a) A r image (b) The binary image (c) The initial detections Figure 2.7: Combining regions of interest.

47 CHAPTER 2. INITIAL DETECTION 45 Algorithm 2 Grouping vertical gradients Input: An image Output: A list of initial detections 1. Perform smoothing on the input image. 2. Calculate the vertical gradients from the smoothed image. 3. Calculate an image histogram from the vertical gradients image and select the intensity value at 95% of the histogram as threshold. 4. Create a binary image from the vertical gradient image using the threshold. Foreground pixels are those pixels whose absolute gradient magnitude is larger than the threshold. 5. Cluster the foreground pixels into regions using algorithm Repeat steps 7 and 8 for each foreground region. 7. If the width to height ratio of the region matches the width to height ratio of a pedestrian, add the coordinates of the region to the output list. Additionally, in r images, if the sign of the gradients of the region does not match the sign of a warm object against a cold background, discard the region. 8. Search for another region above the current region. If the width to height ratio of the combined region matches the width to height ratio of a pedestrian, add the coordinates of the combined region to the output list Scanning a vertical edge detector through the whole image A dierent approach to nding vertical structures in the image is to scan a search window through the image at every position and scale. In practice, it is sucient to scan through a certain range in the image because there are positions where there can not be any pedestrians because of perspective constraints. For example, at the top of the image there is usually sky so there is no need

48 CHAPTER 2. INITIAL DETECTION 46 to search here. The search window consists of two or four edge detectors which search for vertical structures. An example of a search window is shown in gure 2.8. In order to make the detection results invariant to the size of the objects in the image, the size of the lters of the edge detectors are adapted to the size of the search window. The advantage of this method over the initial detection routine which groups vertical features is that when a pedestrian is located under another object with vertical structure(trac sign, tree, building) it can still be detected correctly in the search window in which only the pedestrian ts. The vertical feature grouping initial detection routine would only nd the pedestrian clustered together with the other object. Some experimentation is necessary to nd a threshold value for the edge detector. In general a high threshold is selected in grayscale images in which there is not much contrast between the pedestrian and the background. In r images, a higher threshold can be selected because of the higher contrast between pedestrians and background. The complete method is described in algorithm 3. Figure 2.8: A scanning window.

49 CHAPTER 2. INITIAL DETECTION 47 Algorithm 3 Scanning a vertical edge detector Input: An image Output: A list of initial detections 1. A threshold dependent on the image type(grayscale or r). 2. Repeat for every scale and image position 3. In the scanning windows, calculate the vertical edge response with four vertical lters. Two in the left half of the scanning window and two in the right half. 4. If all four responses are larger than the threshold, add this scan window to the output list. This initial detection routine can be implemented very eciently using the integral image described in section An example of the output from this initial detection routine on r images is shown in gure 2.9. An example of the output from this initial detection routine on grayscale images is shown in gure The regions of interest are marked with orange rectangles. (a) A r image (b) The initial detections Figure 2.9: Scanning an edge detector through the image at every position and scale.

50 CHAPTER 2. INITIAL DETECTION 48 (a) A grayscale image (b) The initial detections Figure 2.10: Scanning an edge detector through the image at every position and scale. 2.3 Temperature segmentation in r images Region based segmentation groups pixels with similar intensity values into regions. In r images, humans appear brighter in the image than most other objects because of their body temperature. Temperature is a property that can be used for segmenting objects from the image. Temperature segmentation is a method that is often used for the initial detection of pedestrians in r images [30], [43], [4]. In this work, a related method is applied. A threshold is calculated from the outside temperature using the image histogram of intensity values(as described in section 2.2.1). The threshold is selected at 95% of the cumulative image histogram. A high threshold is calculated for high outside temperatures and a low threshold is calculated for low outside temperatures. With the threshold, a binary image is calculated. Pixels with an intensity value higher than the threshold are selected as foreground regions. The other pixels are selected as background regions. The foreground pixels in the binary image are warm areas and the background pixels are cold areas. The foreground areas are clustered with algorithm 1. Clusters with the correct width to height

51 CHAPTER 2. INITIAL DETECTION 49 ratio(as described in section 2.2.1) for a pedestrian are selected as regions of interest. An example of this method is shown in gure Figure 2.11(a) shows the input r image, gure 2.11(b) shows the binary image, and gure 2.11(c) shows the regions of interest marked with orange rectangles. (a) A r image (b) The binary image (c) The initial detections Figure 2.11: Temperature segmentation in r images. A problem with using image brightness for initial detection of pedestrians is that pedestrians are often not uniformly bright. Usually, the head and the feet of the pedestrian are much brighter than the body because of the isolation of the coat. An improvement of this method creates additional regions of interest by trying to combine each region of interest with a higher located region of

52 CHAPTER 2. INITIAL DETECTION 50 interest. In this way, the region containing the feet is combined with the region containing the head. An example of this improved method is shown in gure Figure 2.12(a) shows the r image, gure 2.12(b) shows the binary image, and gure 2.12(c) shows an initial detection consisting of two regions. The complete algorithm is described in algorithm 4. (a) A r image (b) It's binary image (c) The initial detections Figure 2.12: Combining regions of interest.

53 CHAPTER 2. INITIAL DETECTION 51 Algorithm 4 Temperature segmentation in r images Input: A r image Output: A list of regions of interest 1. Select an intensity threshold at 95% of the cumulative image histogram. 2. Create a binary image from the r image using the threshold. Pixels with an intensity value higher than the threshold are foreground pixels. 3. Cluster the foreground pixels in the binary image into regions using algorithm Repeat step 5 and 6 for each of the foreground regions. 5. If the width to height ratio of the region matches the width to height ratio of a pedestrian(as described in section 2.2.1), add the coordinates of the region in the output list. 6. Search for another region directly above the current region. If the width to height ratio of the combined region matches the width to height ratio of a pedestrian, add the coordinates of the combined region to the list.

54 Chapter 3 Classication The purpose of classication is to determine which of the regions of interest provided for example by the initial detection routine contain pedestrians and which do not. The classication consists of the following steps: the generation of representative mutual exclusive training examples and validation examples of positive examples(pedestrians) and negative examples(non-pedestrians) the extraction of features from the training and validation examples the training of the classier with the training features the evaluation of the generalization performance of the classier on the validation examples 3.1 Classiers The classier is the learning algorithm which is taught how to separate between two classes of input: features from positive examples and features from negative examples. The classier also 52

55 CHAPTER 3. CLASSIFICATION 53 performs the forward propagation: the assignment of a class label(positive or negative) to an unseen example. In this work, the classication is always a two-class problem. The classier learns to separate between two classes of objects, pedestrians and nonpedestrians. So, the output of the classier is always interpreted as a binary value, pedestrian or non-pedestrian. Two types of classiers are applied in this work: neural networks and support vector machines Neural networks A neural network is a set of processing units which communicate through weighted connections. The processing units(neurons) are divided up into layers. A neural network has one input layer which receives input from outside the network, zero, one or more hidden layers which receive input from the input layer and from each other(in the case that there are multiple hidden layers), an output layer which receives input from the hidden layer(s). The output value y k of a neuron k in the network at time step t is calculated in the following way y k (t) = F k ( j w jk (t 1)y j (t 1) + Θ k (t)) where w jk is the weighted connection from neuron j to neuron k, Θ k (t) is the bias of neuron k, and F k is the activation function of neuron k. In this work, a linear activation function is used for the output neurons. A sigmoidal activation function is used for the hidden neurons:

56 CHAPTER 3. CLASSIFICATION 54 F (s k ) = e s k The neural network used in this work is a three layer feedforward neural network, where the propagation through the network is only forward. The network is not fully connected, neurons only receive input from neurons in the previous layer. The number of hidden neurons varies depending on the problem. The number of neurons in the output layer is one. The number of input neurons are equal to the number of features of the data. During training, the weights and biases of the connections in the network are updated so that the output of the network d p is as close as possible to the target value y p for all examples p. This is done by minimizing an error function with gradient descent. The error function E is dened as E = 1 (d p y p ) 2. 2 p For a neural network with at least one hidden layer, the error is minimized with error back-propagation. The change in a weight is proportional to the negative of the derivative of the error: p w j = γ Ep w j where p is the current training example, and j the index of the weight. This can be rewritten as p w jk = γ Ep s p y p j k

57 CHAPTER 3. CLASSIFICATION 55 where s p k is the input to neuron k. For an output neuron o it holds that E p = (d p o yo)f p o(s p o) s p o and for a hidden unit h it holds that E p s p h N o = F (s p h ) (d p o yo)f p o(s p o)w ho. o=1 For more information about neural networks, see for example [6] Support vector machines The basis idea of support vector machines is to map the training data into a high dimensional feature space F and calculate a separating hyperplane in that feature space. A separating function f : R N {±1} is calculated from the training examples (x 1, y 1 ),..., (x l, y l ) R N {±1}. By using a mapping Θ : R F it is not necessary to work in the high dimensional space F because there exists a kernel k(x, x ) for which holds: k(x, x ) = (Φ(x) Φ(x )). Training a support vector machine consists of calculating a hyperplane w x b = 0 which maximally separates the training data in F. This means minimizing w 2 subject to y i (w x i + b) 1. This gives the quadratic optimization problem minimize 1 2 w 2 with respect to y i ((w Φ(x i )) + b) 1. Introducing the Lagrange parameters α i gives the Lagrangian 1 2 w 2 N α i (y i ((w Φ(x i )) + b) 1) i=1

58 CHAPTER 3. CLASSIFICATION 56 which should be minimized with respect to w, b and should be maximized with respect to α i. The resulting separation function is f(x) = sign( l i=1 α iy i k(x, x i )+ b), where l is the number of support vectors(the training data points which dene the separation plane), and α i are the Lagrange multipliers. Two examples of kernels for classication are: a radial basis function kernel k(x, y) = exp( γ x y 2 ) where γ is the width of the Gaussian and a polynomial kernel k(x, y) = (x y) d where d is the degree of the polynomial. For more information about support vector machines see for example [13]. 3.2 Image features for classication An important choice is the type of image features that are used for classication. The type of image features inuences the classication performance and forward propagation speed. Before the features are calculated, the regions of interest are rescaled to a xed size so that the number of features is constant in each region of interest. For an image size of 320x240 pixels, the window size used is 24x48 pixels. This window size is selected because it is large enough to calculate image features from(as described in section 2.1.2). On the other hand it is small enough to avoid a very large input space to the classier.

59 CHAPTER 3. CLASSIFICATION Rectangle features The rectangle features are the features as described in section Two dierent scales of lters are used to capture dierent levels of detail. A lter size of four by four pixels is used and a lter size of eight by eight pixels is used. Both lters have an overlap of 2 pixels in the horizontal and vertical direction. An example of the application of the lters on pedestrian images is shown in gure 3.1. For the classication of pedestrians, the lter outputs are normalized from 0.1 to 0.9 for neural network classication and from -1.0 to 1.0 for support vector machine classication. (a) A grayscale image (b) It's vertical gradients Figure 3.1: Rectangle features for classication Histograms of gradients and orientations features The gradients and orientations features are those from section First the gradient image and orientations image calculated with a lter size of 5 pixels. This lter size is selected because of the low image resolution of 320x240 pixels. For a higher resolution image a larger lter size would be selected. An example of the gradient calculation was shown in gure 2.3. The region of interest is divided into m by n elds. An example of this is shown in gure 3.2 for m is 4 and n is 8. In each of these regions, a histogram of 8

60 CHAPTER 3. CLASSIFICATION 58 bins is calculated of both the gradient magnitudes and orientations of the gradient. The index of the bin in the histogram is calculated by linearly scaling the gradient magnitude and orientation between their minimum and maximum values. The histograms of all regions are concatenated and are normalized from 0.1 to 0.9 for neural network classication and from -1.0 to 1.0 for support vector machine classication. Both the gradient magnitudes and gradient orientations are used as features for classication. In this way, the information provided by the strength of the gradient as well as the information provided by its direction are used. Figure 3.2: Calculation of image features in subregions. 3.3 Training and validation of the classier Creating training datasets The classier is trained with positive examples of pedestrian images and negative examples of non-pedestrian images. The initial detection routines and a bootstrapping method(section 3.4.1) are used to generate training data. So for example, a classier is trained on data from a particular initial detection routine. The reason for using the initial detection routines for generating training data is that in this way, the classes of positive and negative

61 CHAPTER 3. CLASSIFICATION 59 examples are well dened. Especially, the class of non-pedestrians is in principle innitely large. By using the output from the initial detection routines, the negative examples are limited to those that are generated by the initial detection routine. For each initial detection routine, a separate classier is trained. This is done because dierent initial detection routines generate dierent positive and negative examples. The training data is created by manually marking the pedestrians in image sequences. This is done by drawing a rectangle around a pedestrian in each frame. This results in a set of ground truth data. The initial detection routines are applied to the image sequences and the output of the initial detection routines is compared to the ground truth data. When an initial detection is at the same coordinates as a labeled pedestrian in the ground truth data, it is stored as a positive example in the training data. Otherwise it is stored as a negative example. Separate labels are used for pedestrians in front/rear views and pedestrians in side views to reduce the amount of inner class variability. So a separate classier is trained for back/front views and side views. The idea is that the classication is improved when the classes of objects are better dened. The complete set of image data is divided into three approximately equally large mutually exclusive subdatasets. The rst subset is the training data, which is used for training the classier. The second subset is the validation dataset, which is used to prevent overtting during training of the classier. The third dataset is the test dataset, which is used to evaluate the performance of an optimized classier. There are always three times as many negative examples as there are positive examples in the training and validation datasets. This results in classiers with a low false positive rate. For a real system, false detections are unacceptable so the classier is trained in a way to have the lowest false positive

62 CHAPTER 3. CLASSIFICATION 60 rate possible. Once the available data has been divided into subdatasets, the image features are calculated and stored as feature vectors. For each of the initial detection routines, feature vectors are generated Training neural networks For training neural networks, early stopping is used. The network is trained on the training dataset. At each training iteration, the classication error on the validation dataset is monitored. During a certain amount of iterations after the training has started, both the training error and the validation error decrease. From a certain iteration on, the training error keeps decreasing but the validation error increases. From this point on, the network starts overtting the data: it memorizes the training data and does not generalize well anymore on the validation data. The network at the training iteration where the validation error starts to increase, is selected as the nal network. The optimal network from the early stopping procedure is then tested on a test dataset, mutually exclusive from the training- and validation dataset. After some experimentation, the number of hidden neurons is xed at a value of 10 such that a direct comparison between the performance of dierent networks is possible. The value of 10 neurons was found to be the minimum required value for the optimal performance of the network. So, decreasing the number of hidden neurons below 10 results in a reduced classication performance Training support vector machines For training support vector machines, a grid search is used to nd the optimal value for C, the trade o between the maximization of the margin of the separating hyperplane and training error minimization, and in the case of a radial basis function kernel to nd

63 CHAPTER 3. CLASSIFICATION 61 the optimal value for γ, the width of the Gaussian in the kernel. The grid search is performed by training a support vector machine on a training dataset, and evaluating the performance of the support vector machine on a validation dataset mutually exclusive to the training dataset. The evaluation criteria are true positive rate, false positive rate, and number of support vectors. The optimal network from the grid search is then tested on a test dataset, mutually exclusive from the training- and validation dataset. 3.4 Optimization of the classication The standard classication with neural networks and support vector machines is often not optimal. Also, in the case of support vector machines, the training generates so many support vectors that the classication is too slow for real-time pedestrian detection. Therefore, some form of optimization is necessary, both in classication performance and classication time Bootstrapping The class of negative training examples is usually not well dened. To generate a dataset of representative negative examples, a bootstrapping technique can be applied. This works as follows: A training dataset is generated from all positive training examples and a small number of negative examples. A classier is trained on this data and is tested on a dataset not used for training. The false positives from the test dataset are added to the training dataset and a new classier is trained on the training dataset. Again, this classier is tested on another test dataset and the false positives from this dataset are added to the training examples. This procedure is repeated until a desired false positive rate is achieved.

64 CHAPTER 3. CLASSIFICATION Component classication The idea of component classication is to train a classier for each part of a pedestrian. A set of local classiers whose classication outputs are combined may give better results than one global classier. Figure 3.3 shows an example of the subregions. There are seven subregions in the region of interest, a classier is trained for each subregion. The nal classication decision is made by voting the output of the subregion classiers. For example, for each of the regions of gure 3.3(b), a neural network is trained. When the trained networks are applied to an unseen example, this results in seven network outputs. The network outputs are voted to get the nal output. If at least four networks give a positive output, the nal result is positive. Otherwise the nal result is negative. Figure 3.3: Subregions for component classication A simplied approximation to the support vector machine decision rule The number of support vectors of a decision surface scales approximately with the number of training examples. This makes the forward propagation through a support vector machine slow compared to other classiers such as neural networks. To improve the forward propagation speed, a reduced set of vectors can be calculated from the original set of support vectors. To calculate a

65 CHAPTER 3. CLASSIFICATION 63 reduced set of vectors which approximate the original decision surface, the algorithm from [8] is applied. From the original decision surface Ns Ψ = α a y a Φ(s a ) a=1 where α a are the weights calculated during training, y a {1, 1} are the class labels of the N s support vectors s a a reduced vector set z k of size Nz < Ns is calculated with decision surface Nz Ψ = γ k Φ(z k ) k=1 where γ k are the weights. This is done by minimizing the euclidean distance: ρ = Ψ Ψ. The reduced set of vectors z k, k = 1,..., N z is calculated with an unconstrained conjugate gradient method. 3.5 Feature selection Feature selection is applied to select a stable set of features for classication from the whole set of features and to speed up the forward propagation speed of the classier. The smaller the input dimension of the input data, the faster the forward propagation Principal component analysis Principal component analysis transforms a set of possibly correlated variables into a smaller number of uncorrelated variables. In feature selection this means nding a linear subspace of the

66 CHAPTER 3. CLASSIFICATION 64 complete set of features. The method for generating a linear subspace of a lower dimension is the Karhunen-Loéve transform, a standard technique from statistical pattern recognition. Principal component analysis on a dataset consists of the following steps: 1. Calculation of the mean of each of the dimensions in the dataset. 2. Subtraction of the mean of each dimension in the dataset such that the mean of the dataset is zero. 3. Calculation of the covariance matrix of of the data. 4. Calculation of the eigenvectors and eigenvalues of the covariance matrix. 5. Selection of the eigenvectors with the highest eigenvalue as the principal components of the dataset. 6. Calculation of the reduced dataset by multiplying the vector of eigenvectors with the original dataset. For classication this is used as follows: principal component analysis is applied on the feature vectors of pedestrian images in the training dataset. The matrix containing the eigenvectors for transforming the training dataset into the reduced features dataset is stored. The transformation matrix is used to transform the feature vectors of the training data to a lower dimension and a classier is trained on the reduced set of feature vectors. Before each forward propagation through the classier, the transformation is applied to the feature vector using the stored matrix. The purpose of principal component analysis for a real-time system is to reduce the input space to the classier. For testing the performance of pca classication, the full set of 224 input features to a neural network is reduced to 100 features and 50 features.

67 CHAPTER 3. CLASSIFICATION Adaboost Adaboost [19] is a learning algorithm which constructs a classier from a number of weak classiers. Adaboost is explained in algorithm 5. Adaboost can be used for feature selection by limiting the number of iterations in the algorithm. The features which are the most discriminative are found in increasing iteration order. In this work, a single layer perceptron with a sigmoidal activation function is used as weak learner. The Adaboost algorithm is iterated until a certain classication performance is achieved or until the number of desired features is reached. Like principal component analysis, Adaboost can be used to reduce the input dimension to the classier. In addition, it can boost the classication performance by nding the most discriminative features. Algorithm 5 Adaboost. Algorithm from [39]. Input: A set of examples {(x 1, y 1 ),...,(x n, y n )} with labels y i { 1, 1} Output: A strong classier h(x) 1. Initialize weights of training examples w 1,i = 1, for all i. n 2. Repeat steps 3 until 5 for t is 1,..., T 3. Normalize the weights w t,i = w t,i P n j=1w t,j 4. For each feature j, train a classier weak learner h j. The error of the classier is w t ɛ j = i w i h j (x i ) y i. 5. Update the weights w t+1,i = w t,i β 1 e i t, where e i = 0 if x i is classied correctly, e i = 0 if x i is classied incorrectly, e t is the error of the classier with the lowest error, and β t = 6. The nal strong classier is: ɛt 1 ɛ t { 1 T h(x) = t=1 a th t (x) 1 T 2 t=1 a } t 0 otherwise

68 CHAPTER 3. CLASSIFICATION Multi-objective optimization Feature selection with a genetic algorithm works as follows: the indices of the feature vector are coded on a binary chromosome. A one at the feature index on the chromosome means the feature is selected, a zero means the feature is not selected. A genetic algorithm is used to evolve a population of individuals, each with one chromosome. The chromosomes of the individuals are all initialized to a random value(zero or one). The genetic algorithm performs selection, crossover, mutation, and tness evaluation of each individual. In [44], a genetic algorithm is used to select a subset of features for medical classication problems. In [17], a multi-objective feature selection is used with classication performance and feature dimension as tness criteria. In [33], the genetic algorithm evolves a real valued chromosome. In this way, the genetic search results in a relative weighting of features. In this work, a binary tournament selection is used to select the individuals for reproduction. Tournament selection with n individuals means that n individuals are selected from the population based with a probability proportional to their tness. The individual with the highest tness value from the n individuals is selected. Uniform crossover is used to generate an ospring population from the parent population. This means each gene on a chromosome from an ospring is randomly selected from the corresponding genes on the parent chromosomes. The probability of crossover is set to 0.7. During mutation, the value of a bit can be ipped. The probability for crossover is set to 1.0 / number of features. The tness of an individual is calculated by training and validating a classier as described in sections and on the features that have the value one in the chromosome. The tness value has two objectives: the true positive rate on the validation dataset and the false positive rate on the validation dataset. An

69 CHAPTER 3. CLASSIFICATION 67 important choice in multi-objective optimization is how individuals with multiple tness values are sorted before reproduction. In this work, the method from [16] is used. This method is based on a comparison operator n. Each individual i in the population has two attributes: a rank i rank and a crowding distance i distance. The calculation of the rank is shown in algorithm 6, the calculation of the crowding distance is shown in algorithm 7. The sort function in algorithm 7 sorts on objective value. The parameters fm max and fm min are the maximum and minimum value of object m, respectively. The operator in algorithm 6 is the domination operator. An individual p with n objectives {p 1,..., p n } dominates an individual q if i {1,..., n} p i q i i {1,..., n} p i > q i. If individuals do not dominate each other, they are assigned the same rank. All solutions of the optimization that are not dominated by a dierent solution make up the Pareto optimal set. The comparison operator n is dened as i n j if (i rank < j rank ) or ((i rank = j rank ) and (i distance > j distance )).

70 CHAPTER 3. CLASSIFICATION 68 Algorithm 6 Assignment of a rank to an individual. Algorithm from [16]. Input: A population P with tness values assigned to each individual p P. Output: A ranking assigned to each individual p P. 1. Repeat steps 2 until 6 for each p P. 2. S p = Ø,.n p is Repeat steps for each q P. 4. If (p q) then S p is S p {q}. 5. Else if (q p) then n p is n p If n p = 0 then p rank = 1, F =F 1 {p}. 7. i = Repeat steps 9 until 15 while F Ø. 9. Q=Ø. 10. Repeat steps 11 until 13 for each p F. 11. Repeat steps 12 until 13 for each q S p. 12. n q = n q If n q = 0 then q rank = i + 1, Q = Q {q}. 14. i = i F = Q.

71 CHAPTER 3. CLASSIFICATION 69 Algorithm 7 Crowding-distance-assignment. Algorithm from [16]. Input:A population P of side N with tness values assigned to each individual in P. Output: A crowding distance assigned to each individual. 1. Repeat steps 2 until 7 for i = 1,..., N 2. P (i) distance = Repeat steps 4 until 7 for each objective m P (i). 4. P (i) = sort(p (i), m) 5. P (1) distance = P (N) distance =. 6. Repeat step 7 for i = 2,..., N 1 7. P (i) distance = P (i) distance + (I(i + 1).m I(i 1).m)/(f max m f min m ) 3.6 Scanning a classier through the image at every position and scale It is interesting to compare the initial detection based approaches as described in this work with methods where a classier is scanned through the image at every position and scale. Examples of this system are [39], [32] for face detection, and [40], [32] for pedestrian detection. The working of such a system is described in algorithm 8. For training a classier for this method, manually labeled regions generated as described in section are the positive examples. The negative examples are generated by applying the bootstrapping method as described in section to randomly selected regions of interest of image sequences which contain no pedestrians. As an alternative, the classiers trained on the initial detection routines can be used.

72 CHAPTER 3. CLASSIFICATION 70 Algorithm 8 Scanning a classier through the image at every scale and location. Input: An image Output: A list of positive classications 1. Repeat steps 2 until 4 for every scale and image position 2. In the current scanning window, calculate the image features for classication. 3. Classify the feature vector from the current scanning window. 4. If the classication output is positive, add the coordinates of the scanning window to the output list. To improve the method in algorithm 8, the following modications are made: a separate classier is trained for side views and front/rear views of pedestrians and a separate classier is trained for pedestrians at a low resolution and pedestrians at a high resolution. The optimization methods from section 3.4 and the feature selection methods from section 3.5 are applied. An example of classifying the whole image at every position and scale in shown in gure 3.4.

73 CHAPTER 3. CLASSIFICATION 71 Figure 3.4: Classifying the whole image at every scale and resolution.

74 Chapter 4 Tracking The purpose of tracking is to keep track of an object through successive image frames after a positive classication. In the case of pedestrians, this is challenging. The reasons for this are that: pedestrians keep changing shape if they are moving, they usually increase in size because of the movement of the car, a reliable motion model is not available because of the motion and nicking of the car, the contrast between the pedestrian and the background is often low so it is dicult to calculate reliable image features for tracking, there are usually changes in illumination conditions, the background in urban trac scenes is complex and contains often many pedestrian alike objects, and the resolution of pedestrians is often small in the image so not much information is available for tracking. 72

75 CHAPTER 4. TRACKING Tracking using the Hausdor distance The Hausdor distance is a measure of inequality between two sets of points. Gives two sets of points P = {p 1,..., p m } and Q = {q 1,..., q n }, their Hausdor distance is H(P, Q) = max(max min p P q Q p q, max min q Q p P q p ). If d 1 is the maximum distance of set P to the nearest point in Q and d 2 is he maximum distance of set Q to the nearest point in P then the Hausdor distance is the maximum of these two distances. The partial Hausdor distance can be used to measure the inequality between subsets of two sets of points. Given two sets P = {p 1,..., p m } and Q = {q 1,..., q n }, their partial Hausdor distance is H k (P, Q) = K th p P min q Q p q where 1 k m. Here, the kth largest element is used instead of the maximum element. The partial Hausdor distance makes it possible to locate objects which are partially occluded. The tracker based on the Hausdor distance [41] works as follows. The tracker is initialized with a set of image features P (intensity values or gradients), this is the model set of the tracker. In the next frame, at several image positions and scales with feature set Q, the partial Hausdor distance is calculated between the feature set P and the feature set Q. The location of the feature set Q with the smallest partial Hausdor distance between P and Q is selected as the new location and scale of the object which is tracked. A Kalman lter with a linear motion model is used to estimate the x position and y position of the search region

76 CHAPTER 4. TRACKING 74 for the pedestrian in the next frame in which the feature set Q is calculated. The process noise and measurement noise for the Kalman lter are drawn from a normal distribution. A problem with using the Hausdor distance for tracking is when a model for tracking is generated from the image features, the model does not contain just image features from the object to be tracked but also image features from the background. This invalidates the tracking model when the object is moving. In order to separate model features from background features, the following model adaptation mechanism is used. The set of image features in the next frame are compared to the image features of the tracker model. A list is made of image features which are present in the model but do not correspond to an image feature in the next frame. Also, a list of image features are made which are present in the next frame but do not correspond to a feature in the model. From the rst list, model features are removed randomly with a certain probability. From the second list, image features are randomly added to the model features with a certain probability. The degree of adaptiveness of the tracker can be controlled by varying the probability parameter. The complete description of the Hausdor tracker is described in algorithm 9. In this work, for pedestrians, only the upper half of the body is tracked because the movement of the legs would invalidate the model of the tracker in a few frames. In order to handle the problem of background features in the model, the tracker model is adapted every few frames because of the often high relative speed dierence between the pedestrian and the car.

77 CHAPTER 4. TRACKING 75 Algorithm 9 The Hausdor tracker. Input: A feature image I, the position of the object in the previous frame p t 1, the feature set of the model P. Output: The position of the object in the current frame p t. 1. Estimate with a Kalman lter the next position of the object ˆx t. 2. Search in I in the area around the estimated position ˆx t of the object for the location p t with feature set Q such that the Hausdor distance between P and Q is minimal. 3. If the tracker model needs to be updated this frame make a list L 1 of image features which are present in the model but do not correspond to an image feature in frame t and make a list L 2 of image features are made which are present in the frame t but do not correspond to a feature in the model. 4. If the tracker model needs to be updated this frame, randomly remove features in L 1 from the model and randomly add features from L 2 to the model. 4.2 Mean shift tracking Mean shift tracking [11] is based on the assumption that the position and scale of an object will not change much from one frame to the next. A target model with image feature z has a density function q z, the target candidate located at position y has a feature distribution p z (y). Mean shift tracking nds the position y whose density p z (y) matches the density q z the best. The metric used for the density similarity is the Bhattacharyya coecient: pz ρ(y) ρ[p(y), q] = (y)q z dz. (4.1) For sampled data, the discrete density is calculated from the m- bin histogram: ˆq = {ˆq 1,..., ˆq m } with m u=1 ˆq u = 1 for the model and ˆp(y) = {ˆp 1,..., ˆp m (y)} with m u=1 ˆp u = 1 for the candidate. A

78 CHAPTER 4. TRACKING 76 kernel is used to assign a weighting to pixels, pixels further away from the center of the target are given a lower weight then pixels at the center of the target. This is done because pixels further away from the center are more likely to be aected by occlusion or background. The kernel applied in [11] is the Epanechikov kernel k(x) = { 1 2 c 1 d (d + 2)(1 x) if x < 1 0 otherwise for normalized image coordinates x. To nd the new location ŷ 1 of the target, the mean shift vector is applied ŷ 1 = nh i=1 x iw i g( i=1 w ig( nh ŷ0 x i h ŷ0 x i h 2 ) 2 ) (4.2) where ŷ 0 is the location of the target in the previous frame, g is the kernel for weighting, w i is the weight of pixel x i, n h is the number of pixels in the radius of the kernel h around the current location ŷ 0. The weights w i are calculated as follows w i = m ˆq u δ[b(x i ) u] ˆp u ŷ 0 u=1 where b is a function that assigns an index of a bin to a pixel. The complete tracking algorithm is described in 10. In this work, for tracking pedestrians, the features z used for calculating the densities are intensity values or image gradients. The radius of the kernel h is dependent on the size of the region of interest to be tracked. For a large region of interest a large radius is used, for a small region of interest a small radius is used.

79 CHAPTER 4. TRACKING 77 Algorithm 10 Mean shift tracking Input: The distribution of the target object ˆq and its position in the previous frame ŷ 0, an image with pixels x. Output: The position of the object in the current frame ŷ Calculate the distribution {ˆp u (ŷ 0 )} u=1,...,m in the current frame at ŷ 0 and the Bhattacharyya coecient ρ[ˆp(ŷ 0, ˆq] with Calculate the new position of the target ŷ 1 with Calculate the Bhattacharyya coecient ρ[ˆp(ŷ 1 ), ˆq] at the new position ŷ Repeat step 5 while ρ[ˆp(ŷ 1 ), ˆq] <ρ[ˆp(ŷ 0 ), ˆq]. 5. ŷ 1 = 1 2 (ŷ 0 + ŷ 1 ). 6. If ŷ 1 ŷ 0 > ɛ ŷ 0 = ŷ 1, go to step Tracking with Condensation In the Condensation algorithm[25], tracking is performed by propagating a state density p(x t Z t ) over time with Bayes rule: p(x t Z t ) = k t p(z t x t )p(x t Z t 1 ) where x t is the state at time t, z t is the observations from the image at time t, p(z t x t ) is the observation density at timestep t, the prior p(x t Z t 1 ) is a prediction step from the posterior density p(x t 1 Z t 1 ) from the previous timestep t 1, and k t is a constant. The prior density of timestep t is generated by factored sampling. The posterior density p(x t 1 Z t 1 ) from the previous timestep t 1 is represented by a set of N weighted samples ), n = 1,..., N}. N samples are sampled with replacement from the sample set. Each sample s (n) t 1 is selected with a probability of π (n) t 1. Samples with high weights may be selected {(s (n) t 1, π(n) t 1

80 CHAPTER 4. TRACKING 78 multiple times. Each selected sample undergoes a deterministic step followed by a random step. This gives the prior density p(x t Z t 1 ) of timestep t. Now, the observation density p(z t x t ) is used to calculate the weights of the samples. This gives the posterior density p(x t Z t ) of timestep t. The prior density p(x t Z t 1 ) of timestep t is calculated by applying only a random step. It is assumed that a reliable motion model of a pedestrian does not exist because the camera may move or stand still and the pedestrian may move or stand still. In addition, the pedestrian may be moving perpendicular to the camera or orthogonal to the camera. Therefore, only random motion is used. For tracking pedestrians from a moving car, x position, y position, and scale are the parameters for tracking. Each sample represents a position and scale. For calculating the observation density from the image data, the contour based approach from [25] is followed. A small set of model contours is calculated by performing a principal component analysis on a set of manually labeled contours(as described in [1]). An example of a model contour is shown in gure 4.1. A sample represents a contour with a certain scale. The distance from the contour along a xed set of M normals along the contour is used to calculate the observation density p(z x) of a sample: p(z x) exp { M m=1 } 1 2rM min(z 1(s m ) r(s m ); µ), where r(s m ) is a model contour, z 1 (s m ) is the closest feature to r(s m ) along normal m, and r and µ are constants. In the case of multiple model contours, the contour with the smallest sum is used for calculating the observation density. The sample with the smallest sum of distances is selected as the current position/scale of the pedestrian. For tracking pedestrians, the image energy is

81 CHAPTER 4. TRACKING 79 used as features for tracking. The complete outline of the Condensation algorithm is shown in algorithm 11. Figure 4.1: Model contour of the Condensation tracker. Algorithm 11 The Condensation algorithm. Input: An initial sample set {s 1 1, π1, 1..., s n 1, π1 n } representing position and scale of timestep 1. Output: A sample set {s 1 T, π1 T,..., sn T, πn T } of timestep T. 1. Repeat steps 2 until 4 for t = 1,..., T iterations. 2. Sample with replacement N samples from the set {s 1 t, π 1 t,..., s n t, π n t }. Each sample s i t is selected with probability πi t. 3. Each of the selected samples undergoes a deterministic motion followed by a random motion. 4. Each of the samples is weighted by calculating the observation density. This gives the sample set {s 1 t+1, π 1 t+1,..., s n t+1, π n t+1} of timestep t The sample s i t+1 with the smallest sum of distances calculated for its observation density is selected as the current position/scale of the pedestrian

82 CHAPTER 4. TRACKING Integrating initial detections through time A very simple model-free tracking method is based on the assumption that if a pedestrian is present in a certain frame, it will be present at approximately the same location at approximately the same scale in the next image. To nd the pedestrian in the next image, the output from the initial detection routine is used. The positions and scales of the regions of interest from the initial detection routine are evaluated. If a matching region of interest is found, it is selected as the location of the pedestrian in the next frame. An example of this method is shown in gure 4.2. The advantages of using the initial detection for tracking are that it is invariant to the scaling of the pedestrians through successive frames, the initial detection is developed to detect pedestrians at various sizes. It is also developed for nding pedestrians in complex, structured environments and in changing illumination conditions. The initial detection tracker is described in algorithm 12. Algorithm 12 Initial detection tracker. Input: The current center position (x current, y current ) and scale s current of the tracked region of interest; a list (x 1, y 1, s 1 ),..., (x n, y n, s n ) of N initial detections in the next frame. Output: A new center position (x new, y new ) and scale s new in the next frame. 1. Initialize j = 1. d j =sqrt((x current x j ) 2 + (y current y j ) 2 ) and d s =s current s j 2. Repeat step 3 and 4 for i = 2,..., N 3. d i =sqrt((x current x i ) 2 + (y current y i ) 2 ) and d i =s current s i. 4. if d i < d j and d i d j d j = d i and j = i. 5. (x new, y new ) = (x j, y j ) and s new = s j.

83 CHAPTER 4. TRACKING 81 (a) A positive classication in a r image (b) The initial detections in the next frame (c) The initial detection selected as the new position of the pedestrian Figure 4.2: Integrating initial detections through time. 4.5 Tracking the classication output When detection is performed by scanning a classier through the whole image at every scale and location, as described in section 3.6, the classication output can also be used for tracking. The tracker described in this paragraph resembles the tracker from [32], where Condensation is used for propagating a density of classication outputs over time. The assumption is made that the classication outputs decrease proportional to the distance from the pedestrian in both location and scale. In addition, the assumption is made that the location and scale of a pedestrian do not change

84 CHAPTER 4. TRACKING 82 much from one frame to another. Tracking a pedestrian means locating the highest peak in classication outputs near the position of the pedestrian in the previous frame. This method can possibly reduce the false positive detection rate because it can be expected that false positives do not generate a similar peak in classication output as a pedestrian does. An example of the classication outputs at a few scales was shown in gure 3.4. The complete method is described in algorithm 13. Algorithm 13 Tracking the classication output. Input: An initial set of positions {p 1,..., p n } containing a pedestrian. Output: A set of new positions after T frames. 1. Repeat steps 2 until 5 for t = 1,..., T. 2. Calculate the classication outputs at each location and position in frame t. 3. Repeat steps 4 and 5 for each of the positions p {p 1,..., p n }. 4. Find the highest peak in classication outputs around p. 5. Set the position and scale of p to the position and scale of the highest peak.

85 Chapter 5 Experimental results 5.1 Initial detection In this section, the performance of the initial detection routines evaluated on r and grayscale images is described. To evaluate the performance of the initial detection routines, the output of each of the routines is compared to manually labeled ground-truth data. The comparison is performed as follows: for each frame of a video sequence in the ground-truth database, the output of the initial detection routine is calculated. Then the output of the initial detection routine is matched to the ground truth data. The matching process is shown in gure 5.1. The ground truth data is represented by red outlined rectangles and the output of the initial detection routine is represented by blue outlined rectangles. There is a true positive detection if a region of interest output by the initial detection routine corresponds to a region of interest of interest in the ground-truth database. This is shown by the overlap of the red outlined rectangle with the blue outlined rectangle in gure 5.1. There is a false positive detection if the initial detection routine outputs a region of interest and there is no corresponding region of interest in the ground truth database. This is shown by the blue outlined rectangle in gure 5.1 for which there is no corresponding red outlined rectangle. There is a false 83

86 CHAPTER 5. EXPERIMENTAL RESULTS 84 negative detection if the ground-truth database contains a region of interest for which there is no corresponding region of interest output by the initial detection routine. This is shown by the red outlined rectangle in gure 5.1. Ideally, the true positive rate of an initial detection routine is one and the false positive rate is zero. In practice, a compromise has to be found between the true positive detection rate and the false positive detection rate. Figure 5.1: Matching ground truth data. A certain amount of deviation is allowed for a correspondence. This is shown in gure 5.1. The blue outlined rectangle from the initial detection routine does not exactly match the red outlined rectangle from the ground-truth database. A deviation of 10% in horizontal and vertical direction from the width and height of the ground-truth region of interest is allowed in the horizontal and vertical direction. The deviation is allowed because not all the initial detection routines deliver accurate regions of interest. In addition, the ground-truth database is produced by manually labeling image sequences. Usually, there is always a certain amount of inaccuracy in the ground truth data. The deviation value of 10% is selected because at this level of deviation it is still possible to classify the output of the initial detection routine. The classi-

87 CHAPTER 5. EXPERIMENTAL RESULTS 85 er requires a certain level of accuracy in order to properly learn the class of pedestrians. At a larger deviation, the spatial distribution of image features inside the region of interest makes the region of interest a non-typical pedestrian example. An example of this is shown in gure 5.2. Figure 5.2(a) contains an example suitable for training a classier. The pedestrians t nicely in the regions of interest. Figure 5.2(b) contains an example unsuitable for training a classier. The pedestrian does not t a region of interest. (a) Initial detections suitable for classications (b) Initial detections unsuitable for classication Figure 5.2: Suitability of initial detections for classication. In order to properly measure the performance of the initial detection routines, the ground-truth database should contain a large set of representative sequences. The r images ground-truth database for initial detection consists of 4443 pedestrians for the temperature range of 5, 3510 pedestrians for the temperature range of 15, and 2904 pedestrians for the temperature range of 25. The grayscale images ground-truth database for initial detection consists of 50 sequences containing at least one pedestrian each. One and the same pedestrian which is present in multiple frames is also counted for the number of frames it is present. A measurement is made as soon as the pedestrian is large enough for detection.

88 CHAPTER 5. EXPERIMENTAL RESULTS 86 In the 320x240 pixel images used, a minimum size of 10x20 pixels is used. The exact thresholds for the initial detection routines are not mentioned because they depend strongly on the type of camera used, especially in r images. The results displayed in this section are consistently measured on data from a single r camera and a single grayscale camera Results on r images To measure the performance of the initial detection routines on r sequences, the sequences are divided based on outside temperature. This is done because the intensity distribution of r images changes with temperature. Four temperature ranges are selected: 5, 15, 25, and 35, each with a maximum deviation of 3. An example of r images at dierent temperature ranges is shown in gure 5.3. The exact temperature is always supplied as an argument to the initial detection routine.

89 CHAPTER 5. EXPERIMENTAL RESULTS 87 (a) Fir image at 5 degrees (b) Fir image at 15 degrees (c) Fir image at 25 degrees (d) Fir image at 35 degrees Figure 5.3: Fir images at dierent temperature ranges. Figure 5.4 shows the mean true positive rate of the initial detection routines on the dierent temperature ranges. The range 35 is omitted because image processing in r images is not possible anymore at this temperature range with the particular camera used. At 35 there is too little contrast between objects in the image and the background. This is clearly visible in gure 5.3. It is not possible to report the false positive rate of the initial detection routines, because there is no ground truth data of negative examples. It is possible to calculate the percentage of positive detections of the total number of detections. This is shown in gure 5.5.

90 CHAPTER 5. EXPERIMENTAL RESULTS 88 Figure 5.4: Initial detection results. Figure 5.5: Percentage positive detections.

91 CHAPTER 5. EXPERIMENTAL RESULTS 89 Figure 5.6 shows the calculation times of the initial detection routines for r images. These values do not include the time required for image pre-processing. All values are measured on a computer with a 1470 MHz. AMD Athlon processor. Figure 5.6: Calculation times of initial detection for r images Results on grayscale images In the case of grayscale images, the distribution of intensity values is mainly dependent on the lighting conditions. Because the illumination conditions may change while the car is moving, the sequences are not divided on illumination conditions. So there is a single database containing all grayscale sequences. An example of a sudden change in illumination conditions is shown in gure 5.7. In gure 5.7(a), the image brightness is rather dark while in gure 5.7(b) which is recorded 6 frames later, the sudden appearance of sun light causes the image to be much brighter than the previous image. A set of sequences containing images

92 CHAPTER 5. EXPERIMENTAL RESULTS 90 Figure 5.7: Illumination changes in grayscale images. under various illumination conditions makes it more dicult for the initial detection to nd regions of interest. So a compromise has to be found between generality and performance of the initial detection routines. Of course, the region based brightness segmentation cannot be applied in the case of grayscale images. Figure 5.8 shows the true positive rates of the initial detection routines on grayscale images and the percentage positive detections of the total number of detections.

93 CHAPTER 5. EXPERIMENTAL RESULTS 91 Figure 5.8: Initial detection results in grayscale images. Figure 5.9 shows the calculation times of the initial detection routines for grayscale images. These values include the time required for image pre-processing. All values are measured on a computer with a 1470 MHz. AMD Athlon processor.

94 CHAPTER 5. EXPERIMENTAL RESULTS 92 Figure 5.9: Calculation times of initial detection grayscale. 5.2 Classication The performance of the classication is measured by calculating the true positive rate and the false positive rate of a classier/image feature combination. This is usually visualized with a so called ROC(Receiver Operating Characteristic) curve. A ROC curve displays the true positive rate of the classier at a certain false positive rate. An example of a ROC curve is shown in gure For neural networks, the ROC curve is created by varying the threshold(normally 0.5) at the output neuron. For support vector machines, the ROC curve is created by varying the threshold b in the separating function f(x) = sign( l i=1 α iy i k(x, x i ) + b).

95 CHAPTER 5. EXPERIMENTAL RESULTS 93 Figure 5.10: An example ROC curve. A separate classier is trained for each classier/image feature combination. In addition, separate classiers are trained for side views and front/back views of pedestrians. In general there are two dierent ways of generating data for training a classier: the rst is manually labeling the output of the initial detection routines. The second way is by manually creating ground truth data of pedestrians(positive examples) and using random image parts not containing pedestrians as negative examples for training. The bootstrapping method from section is used to create a representative dataset of negative examples. In r images the object size for classication is selected as 20x40 pixels. In grayscale images the object size for classication is selected as 24x48 pixels. Regions of interest smaller than a size of 20x40 pixels in r images and smaller than 24x48 pixels in grayscale images are not classied because too little information is available for feature calculation. To measure the performance of classication, a representative set

96 CHAPTER 5. EXPERIMENTAL RESULTS 94 of training, validation, and test data has to be selected. All three sub-datasets consist of mostly urban sequences and some land street sequences Results on r images As with the initial detection in r images, the training dataset is divided into sub-datasets based on temperature range. As mentioned in section 5.1.1, the intensity distribution of r images changes with temperature. The same three temperature ranges as used for the initial detection are applied: 5, 15, and 25, each with a maximum deviation of 3. For evaluating the classier, the datasets of 5, 15 are used because a large amount of data is available of these temperature ranges. The training dataset of 5 consists of 729 pedestrian images generated from temperature initial detection, 2581 pedestrian images generated from vertical gradients initial detection, and 3165 pedestrian images generated from scanning a vertical edge detector initial detection. The training dataset of 15 consists of 287 front/back view pedestrian images generated from temperature initial detection, 1728 pedestrian images generated from vertical gradients initial detection, and 1448 pedestrian images generated from the initial detection based on scanning a vertical edge detector through the image. The training dataset is divided into 3 subdatasets, one for training, one for the validation of the trained classier, and one test dataset. The datasets are divided in a way such that no training image is part of more than one subset. There are always three times as many negative training examples in the dataset than there are positive training examples. The reason for having more negative examples than positive examples in the dataset is for achieving a low false positive rate.

97 CHAPTER 5. EXPERIMENTAL RESULTS 95 Figures 5.11, 5.12, and 5.13 show the ROC curves for the support vector machine classication on a test set where the training data is generated by the initial detection routines from sections 2.3 and 2.3 for dierent image features.

98 CHAPTER 5. EXPERIMENTAL RESULTS 96 (a) Orientations features 5 degrees (b) Orientations features 15 degrees Figure 5.11: features. Results of support vector machine classication orientations

99 CHAPTER 5. EXPERIMENTAL RESULTS 97 (a) Gradients orientations features 5 degrees (b) Gradients orientations features 15 degrees Figure 5.12: Results of support vector machine classication gradients and orientations features.

100 CHAPTER 5. EXPERIMENTAL RESULTS 98 (a) Rectangle features 5 degrees (b) Rectangle features 15 degrees Figure 5.13: Results of support vector machine classication rectangle features.

101 CHAPTER 5. EXPERIMENTAL RESULTS 99 Figures 5.14, 5.15, and 5.16 show the ROC curves for the neural networks classication on a test set where the training data is generated by the initial detection routines from section 2.3.

102 CHAPTER 5. EXPERIMENTAL RESULTS 100 (a) Orientations features 5 degrees (b) Orientations features 15 degrees Figure 5.14: Results of neural network classication orientations features.

103 CHAPTER 5. EXPERIMENTAL RESULTS 101 (a) Gradients orientations features 5 degrees (b) Gradients orientations features 15 degrees Figure 5.15: Results of neural network classication gradients and orientations features.

104 CHAPTER 5. EXPERIMENTAL RESULTS 102 (a) Rectangle features 5 degrees (b) Rectangle features 15 degrees Figure 5.16: Results of neural network classication rectangle features.

105 CHAPTER 5. EXPERIMENTAL RESULTS 103 Figure 5.17 shows the processing times of the feature vector calculation for forward propagation through the classier. Figure 5.17: Processing times feature vector calculation. Figure 5.18 shows the classication times of the dierent classier/feature combinations.

106 CHAPTER 5. EXPERIMENTAL RESULTS 104 Figure 5.18: Classication times of classier/image feature combination Results on grayscale images As with the initial detection in grayscale images, one single dataset is used for training a classier for grayscale images. The reason for this is that because the distribution of intensity values in the image depends on changing illumination conditions, it is impossible to select a classier for a particular illumination condition beforehand. Figure 5.8 shows that the performance of the initial detection routines on grayscale images is limited. Therefore only ground truth data in combination with the bootstrapping method from section is used to generate training data. The training dataset consists of 444 front/back view pedestrian images and 342 side view pedestrian images. There are three times as many negative examples in the training database as there are positive examples. The remaining positive examples are divided between the validation and test database. Figure 5.19, 5.20, and 5.21 show

107 CHAPTER 5. EXPERIMENTAL RESULTS 105 the support vector machine and neural networks classication results on a test set of grayscale images where the training data is generated using the bootstrapping method from section

108 CHAPTER 5. EXPERIMENTAL RESULTS 106 (a) Support vector machine orientations features (b) Neural network orientations features Figure 5.19: Classication results grayscale images orientations features.

109 CHAPTER 5. EXPERIMENTAL RESULTS 107 (a) Support vector machine gradients orientations features (b) Neural network gradients orientations features Figure 5.20: features. Classication results grayscale images gradients orientations

110 CHAPTER 5. EXPERIMENTAL RESULTS 108 (a) Support vector machines rectangle features (b) Neural networks rectangle features Figure 5.21: Classication results grayscale images rectangle features.

111 CHAPTER 5. EXPERIMENTAL RESULTS 109 Figure 5.18 shows the classication times of the dierent classier/feature combinations for r and grayscale images Results of component classication Figure 5.22 shows the results of the component classication of r images with a support vector machines and a neural network trained on histograms of gradients and orientations features on data from 5. Figure 5.22: Results of component classication Results of classier optimization on classication performance Figure 5.23 shows the results of the feature selection methods from section 3.5 on the classication performance. All data is from r images recorded at 5. Figure 5.23 shows the results of a support vector machine trained on a feature set of histograms of

112 CHAPTER 5. EXPERIMENTAL RESULTS 110 orientations reduced with a principal component analysis. Figure 5.24 shows the results of Adaboost compared to a neural network trained on the full set of histograms of gradients and orientations features. As sub-feature set for Adaboost, a single histogram of 8 bins is used. The maximum number of weak-learners is limited to 20. Figure 5.25 shows the results of multi-objective optimization feature selection compared to the full set of histograms of orientations features. Figure 5.23: Results of PCA feature selection.

113 CHAPTER 5. EXPERIMENTAL RESULTS 111 Figure 5.24: Results of Adaboost feature selection. Figure 5.25: Results of multi-objective optimization feature selection.

114 CHAPTER 5. EXPERIMENTAL RESULTS Results of classier optimization on classication speed This sections shows the results of the support vector machine reduction algorithm described in section and feature selection algorithms from section 3.5. Figures 5.26 and 5.27 display the classication time in milliseconds for each of the optimization algorithms.

115 CHAPTER 5. EXPERIMENTAL RESULTS 113 (a) Support vector reduction (b) Adaboost Figure 5.26: Classication times of optimization algorithms.

116 CHAPTER 5. EXPERIMENTAL RESULTS 114 (a) Principal component analysis (b) Feature selection with multi-objective optimization Figure 5.27: Classication times of optimization algorithms.

117 CHAPTER 5. EXPERIMENTAL RESULTS Scanning a classier through the image at every position and scale For scanning a classier through the image at every position and scale, the training dataset for grayscale images is generated from manually labeled ground truth data and the bootstrapping method from section For r images, the training datasets generated by the initial detection routines are used. The support vector machine classier trained on a complete set of image features is used for r images as well as for grayscale images. Figure 3.4 showed an example of scanning a classier through the image at every position and scale. As can be seen from gure 3.4, there are often multiple positive classications on one object at dierent scales. There are a total of 4863 classications per frame for r images and for grayscale images. Figures 5.11, gure 5.14, and gure 5.19 show the false positive rate for each of the classier/feature combinations. Because of the high number of classications per frame, there is a high average number of false positives per frame. Of interest is also the computation time per frame. Figure 5.18 shows the computation time in milliseconds per frame for the dierent classier/feature combinations. 5.3 Tracking To measure the performance of the tracking algorithms, a similar method is applied as for measuring the performance of the initial detection routines. The output of the tracking algorithms is matched to manually labeled ground-truth data(the same ground truth-data as used for initial detection and classication). The same procedure as described in section 5.1 is used for comparing the output of the tracking algorithms to the ground-truth data. A certain small amount of deviation from the ground-truth data

118 CHAPTER 5. EXPERIMENTAL RESULTS 116 is allowed. As with the initial detection, a deviation of 10% in horizontal and vertical direction from the width and height of the ground-truth region of interest is allowed in the horizontal and vertical direction. Usually, the tracker is started by a positive classication. In order to make the results of the tracking algorithms independent of the initial detection algorithms and the classication when measuring the tracker performance, the tracker is started manually. As with measuring the performance of the initial detection routines and the classication, a representative set of data is selected for testing the tracker. This set consists of a mixture of land street and urban scenarios. To avoid that the tracking results are inuenced by pedestrians moving out of the image(for example caused by the car moving along the pedestrian), only tracking runs are counted where the tracker loses track of the pedestrian before it moves out of the image Results on r images Figure 5.28 shows the average number of frames a pedestrian is tracked with the dierent tracking algorithms on r sequences. The test dataset consists of 30 sequences containing at least one pedestrian. Only the tracker which integrates initial detections through time, the tracker which integrates classications through time and the tracker based on the Hausdor distance are shown. The Mean shift tracker and the Condensation tracker do not produce any useful results at all and are not displayed.

119 CHAPTER 5. EXPERIMENTAL RESULTS 117 Figure 5.28: Pedestrian tracking in r images. Figure 5.29 shows the processing times for the dierent tracking algorithms in r images.

120 CHAPTER 5. EXPERIMENTAL RESULTS 118 Figure 5.29: Processing times tracking algorithms Results on grayscale images Only the tracker which integrates classications over time generates somewhat acceptable results on grayscale images. The average number of frames tracked is 17 with a standard deviation of 14 frames. The test dataset for evaluating this tracker consists of 30 sequences containing at least one pedestrian. The processing time for this tracker is the same as for the tracker which integrates classications over time in r image(see gure 5.29). The tracker which integrates initial detections through time, the tracker based on the Hausdor distance, the Mean shift tracker, and the Condensation tracker do not produce any useful results at all.

121 Chapter 6 Discussion 6.1 Achievements and limitations The results of the initial detection algorithms are reasonable for r images and disappointing for grayscale images as gure 5.4 and gure 5.8 demonstrate. It holds for both image types that the true positive detection rate is quite low and the false positive rate is high. The true positive rate for r images is still acceptable, the probability that a pedestrian is detected in a few frames time is still high at a frame rate at 20 frames per second. The true positive rate for grayscale images is so low that is not useful to use the initial detection routines for grayscale images. As gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 for r images and gures 5.19, 5.20, 5.21 for grayscale images demonstrate, the results of classication are very good. Figure 5.22 and gures 5.23, 5.24 and 5.25 show that the component classier and the optimization methods for classication lead to better classication performance. Figure 5.28 and section show that aside from the tracking method which integrates the output of the classier which is scanned through the whole image at every position and scale through the image, the tracking results are very disappointing. 119

122 CHAPTER 6. DISCUSSION 120 This holds for r images as well as for grayscale images. The Hausdor tracker, the Mean Shift Tracker, and the Condensation tracker perform so poorly when tracking pedestrians from a moving car, that they are not useful for tracking for this purpose. Before going into more detail what the strengths and weaknesses of the components of the detection system are, it should be mentioned that the disappointing results of the initial detection routines for grayscale images and the tracking methods can for some degree be explained by the complexity of pedestrian detection. As mentioned in section 1.2, many computer vision applications operate in a controlled and static environment where color or motion segmentation can be applied or where a search for a xed pattern can be applied to segment objects from a scene. Although these computer vision problems are not trivial, they are not as complicated as detecting moving, shape changing objects from a moving camera in a complex structured urban background in low resolution images. Also image processing applications for driver assistance systems like lane detection and car detection on a highway scenario deal with detecting rigid objects in a largely homogeneous background Initial detection Figure 5.4 shows that for r images, temperature segmentation performs the best at low outside temperatures. This is obvious, at low outside temperatures, many objects have a low intensity value because they are cool while pedestrians have a high intensity value because of their body temperature. Figure 5.4 shows that the true positive rate of the temperature segmentation decreases with increasing temperature. The reason for this is that at higher temperatures, the contrast between the pedestrian and the background is less than it is at lower temperatures. At temperatures higher than 20, the temperature segmentation does not perform

123 CHAPTER 6. DISCUSSION 121 satisfactory anymore. Figure 5.4 shows that for r images, the vertical gradient based initial detection routine and the initial detection routine which scans a vertical edge detector through the image at every position and scale perform the best at low outside temperatures. There performance decreases with an increase in outside temperature. The reason for this is that at higher temperatures, the contrast between the pedestrian and the background is less than it is at lower temperatures. The gradient magnitude calculated from the intensity image have a weaker response so the initial detection routines have more diculties segmenting objects from the background than at higher temperatures. Figure 5.8 shows that for grayscale images, the vertical gradient based initial detection routine and the initial detection routine which scans a vertical edge detector through the image at every position and scale do not produce acceptable results. The main diculty for the initial detection routines is the complex structure of the background. What all initial detection routines do is basically search for vertical objects in the image. Because of the many vertical structures in the background in urban scenarios, the initial detection routines segment the pedestrian together with a background structure or do not segment the pedestrian at all because they are vertical structures with a stronger gradient magnitude in the image. Examples are shown in gure 6.1 for r images. An example of the many vertical structures in grayscale images and the diculties this creates for the initial detection routines from section 2.2 are shown in gure 6.2. What may happen is that the initial detection routines only nd a part of the pedestrian. The match with the ground truth data, which is used for evaluating the initial detection routines, can for this reason also be unsuccessful.

124 CHAPTER 6. DISCUSSION 122 Figure 6.1: Diculties for initial detection because of vertical structures in the background. (a) A grayscale image (b) It's vertical gradients (c) It's initial detections Figure 6.2: Diculties for initial detection because of many vertical structures in background.

125 CHAPTER 6. DISCUSSION 123 In a scene with little background structure, the initial detection performs quite well. Both in the case of r images and grayscale images. This was for example shown in gure 2.4 for r images and gure 2.5 for grayscale images. This is also the reason the results displayed in gure 5.4 are at least somewhat acceptable for r images. Especially at low temperatures, many background objects have a low intensity value because they are cool while pedestrians have a high intensity value because they are warm. So there is no strong gradient response from background structures. This increases the performance of the initial detection of pedestrians. Other diculties for the initial detection is that pedestrians are not always perfect vertical structures, pedestrians change shape when they move, and pedestrians may be carrying or pushing objects. For example, a pedestrian carrying a backpack or pushing a stroller. Figure 6.3 shows an example situation. The initial detection routines described in this work were not specically adapted to these specic situations, although they are set up to be quite exible with respect to the shape of the objects they detect. Figure 6.3: Pedestrian pushing a stroller. Even more complications arise when there are groups of pedestrians present in the image. Groups may contain an arbitrary number of pedestrians which may be occluding each other in an

126 CHAPTER 6. DISCUSSION 124 arbitrary number of ways. An example of this is shown in gure 6.4. When pedestrians appear in one single bright blob in r images as shown in gure 6.4, it is not possible to segment them with the initial detection routines described in this work. The initial detection routines described in this work are designed for detecting single pedestrians only. Figure 6.4: A group of pedestrians in a r image. A solution to the problems with the initial detection routines is to omit the initial detection all together by scanning a classier through the image at every position and scale like described in section 3.6. The results of this method are discussed in section The threshold parameters applied in algorithms 2.5, 3, and 4 adapt the initial detection routines for r images to a specic temperature range. For example, at a higher temperature, the threshold value is lowered because the gradient magnitude response is weaker at higher temperatures. For grayscale images, these threshold parameters are set to a xed value for images recorded in daylight conditions Classication Figures 5.11, 5.12, 5.13 and 5.14, 5.15, 5.16 for r images, and

127 CHAPTER 6. DISCUSSION 125 gures 5.19, 5.20, 5.21 for grayscale images show that the classication results are very good in general. The classier/image features combinations achieve a high true positive rate at a low false positive rate. However, the results should be considered in the context of the complete detection system. In the case of r images, the initial detection routines may generate up to 12 times as much negative detections as they generate positive detections. This is shown in gure 5.5. In the case of grayscale images, the classier is scanned through the whole image at every position and scale. At a very large number of negative examples, the absolute number of false positive detections per second may still be high. As gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 show, the classication results of all classier/feature combinations perform better on data from 5 than on data from 15. Better means the ROC curve lies higher and more to the left. The reason for this is that at 5, the contrast between the pedestrian and the background is larger than at 15. This means that at 5, there is a better gradient response than at 15. All classication features use gradient information. This explains why the results are better at 5. Figures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 show that for r images, the classication results of the histograms of gradients and gradient orientations features are generally the best for both support vector machines and neural networks at both at 5 and 15. Second best are the rectangle features. Last are the histograms of orientations features. This shows the magnitude of the gradient in combination with the orientations of the gradient perform better than the orientation of the gradient alone. In addition, a feature set as large as the rectangle feature set(1140 features) is not required for successful classication, the smaller feature set of histograms of gradient magnitudes and gradient orientations(448 features) performs better. From gures 5.19,5.20,

128 CHAPTER 6. DISCUSSION 126 and 5.21 it becomes clear that in the case of grayscale images, the classication results of the histograms of orientations features are comparable to the results of the rectangle features. By comparing gures 5.11, 5.12, 5.13 to gures 5.14, 5.15, 5.16 and by looking at gures 5.19, 5.20, 5.21 it becomes clear that the results of the support vector machine classication are generally a bit better than the results of the neural network classication. However, the forward propagation speed of neural networks is much higher than the forward propagation speed of support vector machines. This can be seen in gure Comparing gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 to gures 5.19, 5.20, 5.21 shows that the classication results on r images are comparable to the the classication results on grayscale images. These gures should not be compared directly because the training data for the r data is generated from initial detection routines while the training data for the grayscale data is generated from ground truth data using a bootstrapping method. The classication results on front/back views of pedestrians are comparable to the results on side views of pedestrians. This becomes clear by looking at gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 for r images and gures 5.19, 5.20, 5.21 for grayscale images. In some gures, the front view results are better while in other gures, the side view results are better. This is remarkable because the class of side views of pedestrians is less well dened than the class of front/back views of pedestrians. The side view class of pedestrians contains pedestrians oriented to the left and to the right, while the front view class contains pedestrians of a homogeneous orientation. Apparently, in the case of side view classication, the classier has enough representational capabilities to learn both orientations. Section 5.2 mentions that a minimum region of interest size of 20x40 pixels is used for r images and a minimum size of 24x48

129 CHAPTER 6. DISCUSSION 127 pixels is used for grayscale images. Below these minimum image sizes there is too little feature information for calculating the histograms of gradient orientations and gradient magnitudes and the classication performance strongly decreases. In general, it is better to use higher resolution images for object detection. For pedestrian detection, a minimum image size of 640x480 pixels is recommendable. For the detection system described in section 3.6, which scans a classier through the image at every scale and location, the classication results from gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 for r images and gures 5.19, 5.20, 5.21 for grayscale images are of importance. The false positive rates of these classiers are very low. However, with 4863 classications per frame at 20 frames per second, this still means a high average number of false positive detections per second. For a real production system, this is unacceptable. One possible solution to this problem is to integrate detections over time by tracking the positively classied object. Only when an object is classied as a pedestrian in successive frames, it is considered a positive detection. Unfortunately, in practice it appears that false detections may be persistent in time. An example of this is shown in gure 6.5. (a) A false positive classication (b) The same false positive 10 frames later Figure 6.5: Persistence of false positives over time.

130 CHAPTER 6. DISCUSSION 128 Another issue concerning scanning a classier through the image at every scale and location is computation time. The computation time consists of the feature calculation time(including preprocessing) and time for the classier to perform forward propagation. As can be seen from gure 5.17 and gure 5.18 the combination orientations features with a neural network classier has a computation time of 0.5 milliseconds for feature calculation plus 0.2 milliseconds for forward propagation is 0.7 milliseconds. With 4863 classications per frame, this gives a computation time of millisecond per frame (2x millisecond for front/back view and side view classication). This is clearly too slow for a real-time system, although there is potential for optimization. Still despite these two drawbacks, this classier based system in combination with the tracker which integrates classication outputs through time is the best performing system. By comparing gure 5.22 to gure 5.12 and gure 5.15 it becomes clear that the component based classier performs better than the full-body classier. The combined answer of a set of classiers each trained for a part of a pedestrian has better discriminative capabilities than one full-body classier. Figure 5.17 shows that the rectangle features are the fastest to calculate. All feature types take less than a millisecond to calculate for one region of interest. Figure 5.18 shows the forward propagation speed of the classier/image feature combinations. Neural networks take much less time than support vector machines. The orientations features take the least time in forward propagation. This is obvious, the orientation feature vector consists of less features(224) than the gradients and orientations feature vector(448 features) and the rectangle gradients(1140 features) feature vector. Figures 5.23, 5.24, and 5.25 show the results of feature selection on the classication performance. The classication results of feature

131 CHAPTER 6. DISCUSSION 129 selection with a PCA are approximately equal to the classication results on the full set of features. This can be seen by comparing by comparing gure 5.23 to gure This shows the full feature set is not required for an optimal classication result. As gure 5.24 shows, the results of Adaboost are somewhat disappointing, although it achieves its performance with a small subset of features. The reason for the disappointing results are that because all types of features used are generated by gradient lters of very small size compared to the whole region of interest containing the pedestrian. Adaboost searches for small feature subsets which are characteristic for the object to be classied. There are no single key features which are characteristic for a pedestrian. It is a larger set of features which makes it possible to achieve good classication performance. This is what happens in feature selection with multi-objective optimization. A larger subset of features selected from the complete set of features is selected at once. As can be seen from gure 5.25 this gives better results then Adaboost. A disadvantage of using multi-objective optimization for feature selection is that it almost prohibitively slow. For each individual from each generation, a network has to be trained. Training one support vector machine is slow because of the grid search required to select the training parameters. Training a neural network is slow because of the many iteration required for gradient descent. Figures 5.26 and 5.27 show the results of the classier optimization on the forward propagation time through the network. From gure 5.26(a) it becomes clear that the support vector reduction results in a classier with a much smaller number of support vectors. A reduction of 95% can be achieved without a loss of classication performance. Figure 5.26(b) compares the forward propagation speed of Adaboost to a neural network trained on all features. Adaboost is much faster than a classier trained

132 CHAPTER 6. DISCUSSION 130 on the full set of features. Figure 5.27(a) compares the forward propagation speed of a support vector machine trained on all features to a support vector machine trained to a reduced vector set created with principal component analysis. It is obvious that as the number of features gets smaller, the forward propagation speed increases. Figure 5.27(b) compares the forward propagation speed of a support vector machine trained with feature selection based on multi-objective optimization to a full set of features. Although the feature set selected with multi-objective optimization is smaller(116 features) than the full set of features(224 features), the forward propagation speed is slower. The reason for this is that the original network parameters c and γ were used for training the support vector machine. These are not the optimal values for the reduced network and resulted in a large set of support vectors. From the feature selection methods, the PCA feature selection method gives the best classication results. It is also by far the easiest and fastest feature selection method to apply. The PCA is therefore strongly preferable over Adaboost and multi-objective optimization. One general limitation of neural networks and support vector machines is that both types of classiers are black boxes. It can not be veried that a trained network produces a correct output for an unseen example. This should be taken into consideration when applying a system using a classier in a production system for braking a car in the case of a possible collision. There is never the guarantee that the system will make the correct decision Tracking It becomes clear by looking at gure 5.28 and section that the tracking methods which use image features for tracking: the

133 CHAPTER 6. DISCUSSION 131 Hausdor tracker, the Mean Shift Tracker, and the Condensation tracker perform poorly on pedestrian images. The integration of initial detections and classication outputs through time perform better. These trackers does not operate on image features directly. There are two conditions which must be satised for a tracker to be able to track an object based on its image features: the shape of the object should not change much from one frame to the other so the distribution of image features does not change much from frame to frame, there should be a motion model available for estimating the movement the object makes. In the case of tracking pedestrians from a moving car, usually both conditions are not satised. First, the shape of a pedestrian changes either by its own movement or its size changes by the movement of the car. Second, a reliable motion model of a pedestrian is not available: it is not always the case that a pedestrian is moving. In addition, because of the possible movement of the car, a pedestrian may be moving in the image, even if it is not because of its own movement. The Hausdor tracker is based on calculating at a target location the Hausdor distance of a set of image features to the image features of the tracker model. The Hausdor tracker allows for some degree of discrepancy between the model set of features and the target set of features by randomly removing model features and randomly adding target features. It also allows for some degree of scaling of the object by calculating the target set of features at different scales. Because of the strong scaling of pedestrians caused by the motion of the car and the possible motion of the pedestrian(arm movement, orientation change), the Hausdor tracker cannot reliably keep track of the pedestrian. What usually happens is that after the tracker is started, it locks on a part of the

134 CHAPTER 6. DISCUSSION 132 pedestrian after some time before it loses track of the pedestrian. An example of this for r images is shown in gure 6.6. In addition, because there exists no reliable motion model of a pedestrian, the Kalman lter used in the Hausdor tracker cannot accurately predict the location of the pedestrian in the next frame. This is what causes the tracker to lose track of its target. (a) On a positive classication, the tracker is started. (b) Tracking locks on a part of the pedestrian Figure 6.6: Example of Hausdor tracker in r images. To demonstrate that the tracker can principle be used for tracking objects from a moving camera, it is applied for tracking cars. As can be seen in gure 6.7 and gure 6.8, the Hausdor tracker can successfully track passing cars in r images and grayscale images. Cars are easier to track than pedestrians because of the following reasons: in contrary to pedestrians, cars are not changing shape when they move, the relative speed dierence between the camera and a moving car is much smaller than the speed dierence between the camera and the pedestrian so there is much less scaling of the object to be tracked, the resolution of cars in the image is usually higher than the resolution of pedestrians in the image, and the position and scale of a passing car can be estimated more accurately than the position and scale of a pedestrian. The Condensation tracker has related problems. It contains a set

135 CHAPTER 6. DISCUSSION 133 (a) frame 1 (b) frame 100 (c) frame 170 Figure 6.7: Tracking a car in r images with the Hausdor tracker. of xed shapes of pedestrians which it uses for comparing target locations to. Because of the strong scaling of pedestrians caused by the motion of the car and the low resolution of the images, the comparison of the xed shapes with target locations suers from aliasing problems. In addition, because three dimensions are tracked: horizontal position, vertical position, and scale, the number of samples needed in the Condensation algorithm gets prohibitively large. This effect is increased by the absence of a motion model. If there is no motion model which moves the samples somewhat in the right direction, an even larger number of randomly moving samples is required to achieve the same performance. If a large number of

136 CHAPTER 6. DISCUSSION 134 (a) frame 1 (b) frame 100 (c) frame 185 Figure 6.8: Tracking a car in grayscale images with the Hausdor tracker. samples is used for tracking in three dimensions, the tracker gets too slow because of the time it takes to evaluate all the samples. If the number of samples is reduced for improving eciency, the tracker rapidly loses the object it is tracking. The mean shift tracker is originally designed for tracking objects in color images. The reason for this is that 24-bit color images contain more information for tracking than 8-bit grayscale images. The distinction in histograms between the tracker model and the target object is usually not as large in 8-bit intensity images as it is in 24-bit color images. For this reason, it is dicult for the mean shift tracker to track objects in the low contrast 8-bit intensity