ALGORITHMS FOR FACE AND FACIAL FEATURE DETECTION

Transcription

1 Degree Programme in Information Technology XINGHAN LUO ALGORITHMS FOR FACE AND FACIAL FEATURE DETECTION Master of Science Thesis Examiners: Dr. Atanas Gotchev Prof. Karen Egiazarian Institute of Signal Processing Tampere University of Technology (TUT), Finland Prof. Joern Ostermann Institut fuer Informationsverarbeitung (TNT) Leibniz University of Hanover (LUH), Germany Examiners and topic approved in the Information Technology Department Council meeting on 1 st June 2006

2 i Abstract TAMPERE UNIVERSITY OF TECHNOLOGY Master's Degree Programme in Information Technology LUO, XINGHAN: Algorithms for Face and Facial Feature Detection Master of Science (Technology) Thesis, 58 pages, 5 Appendix pages December 2007 Major: Image and Video Signal Processing Examiners: Dr. Atanas Gotchev (TUT), Prof. Karen Egiazarian (TUT); Prof. Joern Ostermann (LUH) Keywords: face detection, facial feature detection, AdaBoost, cascade, Active Appearance Models, landmarks, optimization, facial animation Automatic face detection and facial feature localization are important computer vision problems, which have challenged a number of researchers to develop fast and accurate algorithms for their solutions. This master of science thesis structured in two parts, addresses two modern approaches for face and facial feature detection, namely the cascade Adaptive Boosting (AdaBoost) face detection [4], [5] and Active Appearance Models (AAM) [1], [2] based facial feature detection. In the first part, AdaBoost method and its extensions are overviewed and a Matlab based test platform is described. To develop it, training and test face databases have been collected and generated. Matlab packages for face detection have been modified to include multiple detection elimination module and a graphical user interface for easy manipulation and demonstration. The demo software has been also extended with an OpenCV-based implementation. In the second part, facial feature detection as a first step in a facial animation application is studied. An AAM based facial feature detection system, aimed at lip motion and face shape detection/tracking has been targeted. Its development has included proper landmark definition for lip/face shape models, training/test set selection and marking, modification of an existing AAM module built on the OpenCV library. Specifically, the detection accuracy has been improved by landmarks optimization based on training set self-rebuilt convergent iterations. Detection and tracking results presented demonstrate significant improvements compared with previous implementations. As the main contribution of this thesis project, an accurate facial feature geometric database has been set up based on automatic accurate facial feature detection. It is expected to improve the performance of facial animation [53] techniques being developed for 3DTV-related applications.

3 ii Preface This master thesis has been accomplished during the period of September 2006 August The research work has been split into two parts: study and development of face detection techniques and study and development of facial feature detection techniques for the need of facial animation systems. The former topic has been carried out in the Institute of Signal Processing, Tampere University of Technology (Finland), while the latter topic has been carried out in the Institut fuer Informationsverarbeitung (TNT), Leibniz University of Hanover (Germany) within the EC-funded 3DTV Network of Excellence. The work was funded by the same project. I would like to express my sincere appreciation to my kind teacher and supervisor Dr. Atanas Gotchev at TUT, who has offered me this interesting thesis topic, given me valuable and helpful advice and instructions, always been supportive and patient. I would also like to thank Prof. Karen Egiazarian, who has provided financial support for the research, given important remarks and suggestions. I would like to extend my special thanks to my daily supervisor at LUH, PhD student Kang Liu, who has always been a nice friend and active colleague, leading the direction of the work and research, sharing his knowledge and experience with me, guiding me and helping me by those constructive discussions. Thanks to my German host Prof. Joern Ostermann, who warmly welcomed me and involved me in the ambitious research in his group, he has always been paying close attention to the progress of the work and offering precious technical remarks and suggestions. Xinghan Luo

4 iii Table of contents ABSTRACT...I PREFACE... II ABBREVIATIONS...V LIST OF FIGURES...VI LIST OF TABLES... VII 1. INTRODUCTION Problem statement Thesis objectives Thesis organization AN OVERVIEW OF TECHNIQUES FOR FACE AND FACIAL FEATURE DETECTION Techniques for face detection Facial feature detection algorithms Applications ADABOOST AND CASCADE FACE DETECTION Algorithm basics Features and integral image AdaBoost core method Cascade of classifiers Extensions and software packages Floatboost methods OpenCV face detector Implementations Database collection Adaboot Matlab implementation and modification Matlab GUI-based platform for FD Multiple detection elimination AAM-BASED FACIAL FEATURE DETECTION AAM basics Building active appearance models AAM search AAM tools and AAM-API package AAM tools AAM-API package Object contour detection and tracking Training set optimization...35

5 iv Training and test set Landmark definition and marking Optimized training set selection Landmark set optimization Accuracy assessment Appearance model initial projection Exhaustive approximation OpenCV-based initialization for face tracking Experimental results and discussion Training set landmark optimization results Lip and face contour detection and tracking results Facial animation application and results CONCLUSION...53 BIBLIOGRAPHY...54 APPENDIX...59

6 v Abbreviations FD FFD AdaBoost AAM NN PCA ATR HCI GST VPF PDM SPDNN SVM GWN OpenCV FERET GPA LLE VTTS TTS Face Detection Facial Feature Detection Adaptive Boosting Active Appearance Models Neural Network Principal Component Analysis Automatic Target Recognition Human-Computer Interaction Generalized Symmetry Transform Variance Projection Function Point Distributed Models Self-Growing Probabilistic Decision Neural Network Support Vector Machine Gabor Wavelet Network Intel s Open Source Computer Vision Library The Facial Recognition Technology Gerneralised Procrustes Analysis Locally Linear Embedding Visual Text to Speech Synthesizer Text To Speech

7 vi List of figures Figure 3.1 Four basis Haar-like features a, b, c, d...12 Figure 3.2 Integral image...13 Figure 3.3 Basic scheme of AdaBoost and its main goals...14 Figure 3.4 Schematic depiction of the detection cascade...16 Figure 3.5 Extended set of Haar-like features...18 Figure 3.6 FERET and frontal face samples...20 Figure 3.7 (a) Matlab GUI for FD...22 Figure 3.7 (b) Tool bar...22 Figure 3.7 (c) (1)(2)(3) Single and batch processing...23 Figure 3.7 (d) Real-time processing report...23 Figure 3.7 (e) Statistics save and plot...24 Figure 3.7 (f) Options and parameter setting...24 Figure 3.8 (a)(b) Multiple detection elimination...26 Figure 3.9 Merge multiple detection...27 Figure 3.10 Eliminate low weight overlap detection...28 Figure 3.11 Final result of multiple detection elimination...28 Figure 4.1 Example AAM texture and shape models of human face...30 Figure 4.2 Example appearance models of human face...31 Figure 4.3 AAM search on unseen image...32 Figure 4.4 Typical AAM-API based object tracking system...34 Figure 4.5 Example training and test samples...35 Figure 4.6 Face and lip landmark and contour definition...36 Figure 4.7 Flow chart for optimized training set selection...37 Figure 4.8 Example of lip and face training samples by optimized selection...38 Figure 4.9 Flow chart for training set self-rebuild convergent iterations...39 Figure 4.10 (a)(b) Example 10-iteration convergence plot for x, y coordinates...40 Figure 4.11 Example correction for train000 for the mund training set...41 Figure 4.12 Flow chart for accuracy measurement...43 Figure 4.13 Comparison of initial approximation for face model...45 Figure 4.14 (a) Highlight regions...46 Figure 4.14 (b) (1) Manual landmarks (2) Optimized landmarks...46 Figure 4.14 (c) (1) Manual landmarks (2) Optimized landmarks...47 Figure 4.15 (a)(b)(c) Example resulting frames...48 Figure 4.16 (a)(b)(c) Example of typical error mouth tracking...49 Figure 4.17 System block diagram of VTTS facial animation...50 Figure 4.18 (a)(b)(c) Comparison of synthetic mouth motions in continuous frames 52

8 vii List of tables Table 2.1 Main FD methods...4 Table 2.2 Main FFD methods...7 Table 2.3 FFD methods comparison and evaluation...9 Table 4.1 Video clip series for training and test samples...35 Table training sample selection from 4 mund video sequences...37 Table 4.3 Total number of selected training and test samples...38 Table 4.4 (a) x, y values of convergence iterations for No.10 landmark...41 Table 4.4 (b) x, y values of convergence iterations for No.12 landmark...41 Table 4.5 Statistics of overall error corrections for34 landmarks in train

9 1 1. Introduction When we observe an image, many thought processes occur in our brain that go into interpreting and analysing it, enable us to de-construct the image into individual objects, create our understanding of what the image presents, and further our interpretation of the objects. We will then form an opinion of what is inside this image, and what is happening. This seems to be easy and natural, in some sense automatic for us, since we already have fairly complicated and well-trained living biological vision and signal processing system. This sophisticated system is so efficient and accurate that enables us to perceive, detect, recognize and analyse objects and scenes, and be very sensible to trivial differences. The field of computer vision is very much centred on the replication and simulation of the innate ability to learn and comprehend that a human has, but it is by far most academic, and as yet still to be accurately achieved. Among all computer vision research topics, human facial image processing is one of the most essential ones. It includes challenging sub-topics such as face detection, face tracking, face recognition, pose estimation, expression recognition and facial animation. It is also essential for intelligent vision-based human computer interaction and other applications. Any human face-specific techniques use the positions of faces and facial features, i.e. faces and facial features in an image or an image sequence should be localized. Algorithms for face detection and facial feature detection have been targeted so to provide the initial information for any further processing on facial images. Due to the variation in appearance and shape of different faces in different conditions, detecting face or facial feature is considered a very demanding computer vision problem not completely solved yet. However, recent years have witnessed several breakthroughs in that field. In this master thesis, modern algorithms for face and facial feature detection are overviewed. Two state-of-the-art algorithms, namely AdaBoost-based face detection [4], [5] and AAM-based facial feature detection [1], [2], [56], [57], [58] are studied in detail, modified and implemented. Experiments with face databases demonstrate the effectiveness of these modified implementations. The aim of this first introductory chapter is to formulate the problems of face detection and facial feature detection, to specify the thesis objectives and to overview the thesis structure.

10 2 1.1 Problem statement Face Detection (FD) denotes the general problem of determining the locations and sizes of human faces presented in digital images [3], [19]. Based on the localized face, the problem of Facial Feature Detection (FFD) is to find the exact location of facial features, such as mouth and eyes corners, lip contour, jaw contour, and the shape of the entire face [21]. Face and facial feature detection are difficult problems, due to the large variations a face can have in a scene caused by factors such as intra-subject variations in pose, scale, expression, color, illumination, background clutter, presence of accessories, occlusions, hair, hats, eyeglasses, beard etc [20]. The easiest case considers a single frontal face and uncluttered background and solutions for this case are available. However, most realistic images generally contain multiple faces and faces with (some of) the following variations: Pose: the camera angle of an image can vary to the extreme of the image being rotated a total of 180 degrees from an upright frontal view. The orientation of the face in the image can be e.g. frontal, rotated to 45 degree, profile, upside-down etc. The pose also can occlude some of the facial features of interest. Presence or absence of facial components: facial components such as beard, moustache or glasses influence the detection precision. Expression: facial expressions change the appearance of a face. Occlusion: faces may be partially obscured behind objects, or other faces. Image orientation: the rotation of the observed image can directly affect the possible locations of faces. Imaging conditions: factors such as lighting and camera response characteristics affect the interpretation of the image, in the later processes. Ideal FD / FFD systems should identify and locate human face and facial features in any images with cluttered background, regardless of their position, scale, in-plane rotation, orientation, pose (out-of-plane rotation) and illumination. Such kinds of systems have been an ultimate goal for many researchers. A large amount of methods and algorithms have been already developed and implemented. Even though some of them have shown high performance, the ultimate goal is still far from being completely achieved yet. The recent research efforts have been focused on increasing the accuracy, speed and robustness against facial variations of existing algorithm and implementations either by utilizing better training sets or by extending current methods or seeking for alternative approaches.

11 3 1.2 Thesis objectives The objective of the work presented in this thesis is twofold. First, it aims at implementing a demo system capable of accommodating, comparing and visualizing different face detection algorithms working on various face image databases. Second, it aims at studying the active appearance models for their applicability to facial feature detection so to implement an AAM based algorithm for the purposes of facial animation. Experiments with databases containing face image sequences, including training and test sets selection and manual facial landmark marking are also within the scope of the thesis. 1.3 Thesis organization The thesis is organized into five chapters. Chapter 1 introduces the topic and specifies the thesis objectives. Chapter 2 presents a brief overview of recent state-of-the-art methods for face and facial feature detection, as well as potential applications of these. Chapter 3 concentrates on Viola and Jones AdaBoost and cascade FD methods, different extensions and implementations. The same chapter describes the developed demo system and the experiments performed on selected face databases. Chapter 4 overviews the basics of AAM and its applicability to FFD. Novel approaches for training set optimization together with experimental results proving their superiority, are presented as well. Chapter 5 summarizes the outcomes of the thesis project, and makes conclusions and recommendations for future work.

12 4 2. An overview of techniques for face and facial feature detection 2.1 Techniques for face detection First FD attempts date back to 1970s. At that time simple rule-based approaches were used [3]. These approaches made simplifying initial assumptions such as frontal view of the faces and uncluttered background. Later, the simple rule-based methods evolved to more complex ones, e.g. methods based on features and employing neural networks. In overall, the methods differ by the following aspects: representation, search strategy, post processing, precision, scale invariance, computational complexity. They can be broadly classified into four categories listed in Table 2.1 [3]. Note however, that many FD methods combine different strategies and hence their categories might overlap. Table 2.1 Main FD methods Methods Knowledge based methods Feature invariant methods Template matching methods Appearance or image based methods Brief description Encode human knowledge of what constitutes a typical face, usually, the relationships between facial features. Mainly used for face localization. Locate facial features which are invariant under different conditions and using these to locate face. Aim to find structural features of a face that exist even when the pose, viewpoint, or lighting conditions vary. Several standard patterns of face are stored and correlation computed with a test image for detection. Those patterns are used to describe the face as a whole or the facial features separately. Models are learnt from face images by training and learnt models are used to perform detection. In knowledge based FD [22], methods are developed based on the rules derived from the researcher s knowledge of human faces. It is easy to come up with simple rules to describe the features of a face and their relationships. The development of a knowledge-based FD system implies that a series of rules are defined prior to implementation and the system is defined by the scope and accuracy of the rule set. The rules for such systems generally are either defined as strict or loose. Strict: the rules are so strict that a comparatively low detection rate is observed. Loose: the rules loosely defined the data and lead to large false detection rates. Most prevalent characteristic of a knowledge-based technique is the absence of training. The system is bound by the rules which have been carefully defined. For this reason there is an upper bound limit on their

13 effectiveness in a detection task. In contrast to the knowledge based top-down approach, researchers have been trying to find invariant features of faces to be used for detection [4], [5], [6], [7], [8]. The underlying assumption is based on the observation that humans can effortlessly detect faces and objects in different poses and lighting conditions and, so, there must exist properties or features which are invariant over these varying conditions. Numerous methods have been proposed to first detect facial features and then to infer the presence of a face. Facial features such as eyebrows, eyes, nose, mouth, and hair-line are commonly extracted using edge detectors. Based on the extracted features, a statistical model is built to describe their relationships and to verify the existence of a face. One problem with these feature-based algorithms is that the image features can be severely corrupted due to illumination, noise, and occlusion. Feature boundaries can be weakened for faces, while shadows can cause numerous strong edges which together render perceptual grouping algorithms useless. Locating useful classification features in an image is a complex task, depending on the type of feature used there can be an extremely large sample space. Feature invariant methods are designed so that only the critical features required for detection are used. This is usually done in the form of exhaustive feature evaluation exercises with boosting or bagging algorithms. Depending on the type of a learning function used systems that evaluate localized facial features are generally quicker and more robust than their pixel based counterparts. In template matching FD [9], [10], a standard face pattern (usually frontal) is manually predefined or parameterised by a function. Given an input image, the correlation values with the standard patterns are computed for the face contour, eyes, nose, and mouth independently. The existence of a face is determined based on the correlation values. This approach has the advantage of being simple to implement. However, it has proven to be inadequate for FD since it cannot effectively deal with variation in scale, pose, and shape. Multi-resolution, multi-scale, sub-templates, and deformable templates have subsequently been proposed to achieve scale and shape invariance. Consider the largely probabilistic task off FD in an image, at multiple resolutions the image must be permuted. The human face although it may change in colour, generally will not change (by a great deal) in shape, so it is feasible to model the characteristics of a human face, thus building a template, this template can then be used to speed up evaluation for faces at multiple scales. The application of a template is similar to knowledge-based methods [22], knowledge of the human face characteristics can either be learnt and a dynamic template can be constructed, or it can be predefined. The application of a template can be at many scales due to the simplistic nature of most common templates simply related to symmetry of facial features and relative distances. However, template matching alone is generally weak and this is also often used in conjunction with other feature invariant techniques to strengthen the inference. The information to follow will focus only on techniques where template matching is the basis for face. Rather than using templates for, which are generally build based upon expert assumptions about the structure of the object being located, appearance based FD [11], [12] employ a machine learning paradigm which allows for effective models/templates 5

14 to be built from a test set of data. The obvious constrain of such systems being that they are particularly reliant on the learning techniques used and the sample sets provided to them, yet they are able to provide equivalent if not the best results in comparison to previously identified techniques. Contrasted to the template matching methods where templates are predefined by experts, the templates in appearance-based methods are learned from examples in images. In general, appearance-based methods rely on techniques from statistical analysis and machine learning to find the relevant characteristics of face and non-face images. The learned characteristics are in the form of distribution models or discriminant functions that are consequently used for FD. Meanwhile, dimensionality reduction is usually carried out for the sake of computation and detection efficiency. 6

15 7 2.2 Facial feature detection algorithms Facial feature detection (FFD) aims at searching for the exact location of facial features, such as eyes, nose, mouth, ears etc. within a given region in an image or image sequence [21]. FFD algorithms output the coordinates of pre-defined facial feature points, which can be connected to draw the contour of facial features. FFD can be regarded as post-processing towards a facial image within the region of interests defined by any pre-processing methods, e.g. manually localized facial region or output from FD. FFD methods can be broadly classified into the following five categories listed in Table 2.2 below: Table 2.2 Main FFD methods Methods Brief description Summarize rules according to the characteristics of typical facial Knowledge based features, transform the input image to intensify the target features, find the candidates or region of interest. Construct a geometric model with variable parameters according to the shape of facial features, define evaluation functions to Geometric based measure the difference between regions of interest and the model. Continuously update the model parameters to minimize the evaluation function until the model eventually converges on the facial feature. Build colour model for facial features based on statistics. Colour based Completely search the potential regions, select candidates according to comparison between the colour information of the region and the facial colour model. Map the sub-windows within facial feature region to points in a high dimensional space. A set of such points can represents Appearance facial features of the same type, the distribution models can be based deduced by statistics. Facial features can be located by matching the potential regions and the models. Association information based Based on local information of individual facial features, the location of facial features within a face can be used to minimize the searching area. In knowledge based FFD, the knowledge is the experiential description of normal facial features. A human facial image has some obvious characteristics, e.g. the facial region contains the two eyes, nose and mouth, and these normally have lower intensity level in contrast with the regions around, the eyes are symmetric, and nose and mouth are roughly on this symmetrical axis, etc. In order to utilize these basic features for FFD, the input image is transformed to emphasize the desired features, and to filter out the candidate points or regions. The difficulties of this approach are related with the

16 accuracy, universality and adaptivity of the description used. Among the knowledge based methods, Yang and Huang [22] proposed the Mosaic Image approach which divides the image into panes to localize facial features. In another approach proposed by Kotropoulos [23], instead of panes, flexible rectangular units are used to better simulate the human facial shape. In Geometric Mapping [24], [25] approach, the intensity or intensity function sum of X, Y direction is calculated; the specific points that have changed in different directions are summarized, and locations combined to identify the feature location. Similarly, Feng and Yuen [26] proposed Variance Projection Function (VPF) for the mapping. In a threshold-based approach proposed by Zhang [27], the pupils of the eyes are found via a thresholded image, and the rest of features are found by filtering and edge tracing. In a generalized symmetry approach, Reisfeld et al. [28], [29] defined the so-called Generalized Symmetry Transform (GST). It relies on the strong symmetry of human eyes and the geometric distribution of facial features being robust against rotation, different expression and lighting condition etc. In geometric based FFD, a snake-based approach proposed by Kass [30] utilizes edge detection and image segmentation for FFD [31], [32]. In variable template approach, Yuille [33] used parameterised variable templates to locate eyes and mouth. The Point Distributed Models (PDM) approach, proposed by Cootes [34], is a parameterised shape description model, which applies a set of discrete control points to depict the object shape, and use PCA to set up the kinetic models with restrictions for each control point to keep the deformation in the acceptable range [46]. The application of PDM in FFD can be found also in Lanitis work utilizing facial models with 152 control points [36]. Compared with snakes, the PDM approach adds more feature information to the models, reduces the sensitivity of the model against noise and local deformation, but for the cost of higher computational complexity. In a colour based FFD, Phung [37] used the so-called cave seeking approach. Image areas recognized as skin by their color are examined, and the caves found are classified as facial features according to their size, shape, position, etc. In a skin colour modelling approach, Fu used Self-Growing Probabilistic Decision Neural Network (SPDNN) [38] in YES colour space to build models for the E and S weights of the pupils of eyes. Other researchers applied YCbCr space for skin colour modelling. In appearance based FFD, Waite used neural network approach to localize eyeballs [39]. The idea is based on the fact that, compared with the whole eyes, the eye micro-features (e.g. the left and right eye corners and upper and bottom eye socket and nearby regions) are invariant, therefore, the segments of intensity images of the micro-features nearby regions can be used to build NN separately. Similarly, Reinders utilized the grads vector as NN inputs [40]. Then, the idea is to search the target area by different NN, and to filter and combine the results to locate the features. In the classical PCA approach, Karhunen-Loeve transform (KLT) is used to map high dimensional vectors presenting the human face, to a few so-called eigenface vectors in lower dimensional sub-space, to optimally decompose and reconstruct the human facial image [46]. Cootes proposed to use multiple PCA models to assist the definition of the initial parameters of PDM [41]. In a Support Vector Machine (SVM) approach, Pan used the square-shape scan window, and considered eyebrow and eyes as an entire object in 8

17 9 order to reduce the interference of the eyebrows while identifying the eyes [42]. Li proposed a 2-layer SVM approach, in which the idea is to filter out candidate points by SVM with linear core, and use SVM with polynomial core to make the final judgement [43]. In contrast to the above mentioned methods association information based FFD, reduces the number of candidate points by the relatively fixed facial feature positions. Kin and Cipolla used probability network approach, with a 3-layer Bayesian probability network to build facial model [44]. They first search the feature candidate points by combining Gaussian filter and edge detector, and then utilize the relations between adjacent points, matching vertical or horizontal pairs, leading to a more precise classification of four facial regions (up, down, left and right). In a Gabor Wavelet Network (GWN) based approach, Feris used 2-layer GWN tree model where first layer is meant for the whole face and the other layer is meant for the individual facial features [45]. A GWN tree model is built for every training image, and each facial feature is labeled, a collection of training samples set up a facial database. While searching the new image, the most similar whole-face GWN model is selected from the database, the search is started from the labeled points within this specific model and the accurate location of the facial features is found by matching the corresponding labeled features in the model. Table 2.3 below shows a comparison of different FFD methods in terms of computational complexity, accuracy and robustness. Table 2.3 FFD methods comparison and evaluation Robustness FFD methods classification Complexity Accuracy Image quality requirement Influence of pose, expression and lighting Mosaic Image Complex Relatively Knowledge Geometric Mapping Simple low based Thresholding Simple High High Generalized symmetry Complex Relatively high Geometric Snake Complex High based Variable template High High ASM Relatively complex Relatively High Colour based Simple Low Relatively Low high Appearance Neural networks PCA Complex High Low Relatively Low based SVM Association Probability network Relatively Relatively Low information based Gabor Wavelet Network complex High low Relatively Low

18 Applications The application of FD and FFD techniques can be summarized as follows [3], [21]: (1) First step in any fully automatic face or facial expression recognition system. (2) First step in surveillance systems targeting face pose estimation and human body movement tracking. (3) Automatic Target Recognition (ATR) applications or generic object detection or recognition applications. (4) Human-machine interaction systems. Furthermore, accurate localization of facial features would enable various further applications such as face recognition, gesture recognition, expression recognition, face image compression and reconstruction, facial animation. The FD and FDD play an important role in 3D applications, such as 3D visual communications and 3DTV [20]. Knowledge about motion of human face and body, nature and limits of human motions can be used in order to make the processing more efficient. Case-oriented algorithms can perform better than general purpose algorithms. For 3D display systems, detection and tracking of observer's eyes and observer s view points are necessary to render the correct view according to the observer position. The face and facial feature localization and tracking algorithms, robust face position estimation and tracking are also important for improving the Human-Computer Interaction (HCI) and facial animation applications.

19 11 3. AdaBoost and cascade face detection There are two main factors which determine the effectiveness of an FD system: the system s detection accuracy and its processing speed. Although the detection accuracy has been improved through many novel approaches during the last ten years, the speed is still a problem impeding the wide use of FD system in real time applications. One of the biggest step toward improving the processing speed and making the real time implementation possible has been the introduction of the AdaBoost and cascade FD, proposed by Viola and Jones [4], [5]. In this chapter, the basics of AdaBoost and cascade algorithms, their extensions and available implementations are briefly described. Then, a particular implementation is described in details, focusing on topics such as training/test databases assembling; modification of an existing Matlab implementation to include multiple detection elimination module, and GUI-based FD demo. FD implementations in OpenCV are also investigated. 3.1 Algorithm basics The Viola and Jones technique achieves fast and robust FD based on three key contributions, as listed below: (1) A new image representation called Integral Image [16]. It allows a very fast computation of the features to be used by the detector. (2) A simple and efficient classifier which is built using the AdaBoost learning algorithm [17], [18]. It allows selecting a small number of critical visual features from a very large set of potential features. (3) A method for combining classifiers in a cascade [4]. Due to this cascade, background regions occupying most of the image areas are quickly discarded during first stages, while promising face-like regions are processed thoroughly during later stages. Viola and Jones technique relies on simple rectangular features, reminiscent to Haar basis functions [13]. These features are equivalent to intensity difference readings and are quite easy to compute. There are three feature types used with varying numbers of sub-rectangles: two/two rectangle, one/three and one/four rectangle feature types (these are described in more details in Subchapter 3.1.1). Using rectangular features in an image instead of pixels provides a number of benefits, namely a sort of ad-hoc domain knowledge is implied as well as a speed increase is achieved over pixel based systems. The calculation of the features is facilitated by the use of an image representation, so-called integral image. It allows calculating the area of any rectangle in an image in exactly eight references. The integral image itself can be calculated in one pass of the sample image, which is also a factor helping to the speed up the algorithm. The integral image is similar to a summed area table, used in computer graphics but its use is applied in pixel area evaluation.

20 12 Haar-like features of different scales form a large feature. In order to restrict it to a small number of critical features, the training stage utilizes an adaptive boosting algorithm (AdaBoost) [18]. Inference is enhanced with the use of AdaBoost where a small set of features is selected from a large set, and in doing so, a strong hypothesis is formed, in this case resulting in a strong classifier. The computational efficiency is improved not only by having a reduced set of features and training the corresponding classifiers, but also by the use of a degenerative tree of classifiers, leading to a cascade structure [4]. This degenerative tree, sometimes referred to as a decision stump, chains weak classifiers from general to more specific ones. That is, the first few classifiers are general enough to discount an image sub window and save on the time of further observations by the more specific classifiers down the chain, this can save large amount of computation Features and integral image In the Viola and Jones system, a simple feature set is used, with relation to the feature sets described in the paper of Papageorgiou et al. [13]. Viola and Jones emphasize the fact that the use of feature-based instead of a pixel-based system is important due, especially for FD, to the benefit of ad-hoc domain encoding. Features can be used to represent and distinguish between both facial information and background in a sample image. Figure 3.1 Four basis Haar-like features a, b, c, d and example overlaid on real facial image Top row in Fig. 3.1 shows the first and second features selected by AdaBoost. The first feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on the observation that the eyes region is often darker than the cheeks. The second feature compares the intensities in the eye regions to the intensity across the bridge of the nose [5]. In their simplest form, the features can be thought of as pixel intensity set evaluations. The sum of the luminance of the pixels in the white region of the feature is subtracted from the sum of the luminance in the remaining gray section. This difference value is used as the feature value, and can be combined to form a weak hypothesis on regions of the image. Within the implementation, four types of Haar-like features are chosen, the first with a horizontal division, the second a vertical, the third containing

21 13 two vertical divisions and the last containing both a horizontal and vertical division. The features are called Haar-like because of their resemblance to Haar-basis functions (Haar wavelets) [13]. Having types of features chosen, what follows is to find a way of their fast computation. The integral image representation is such an efficient way. As described in [4] and [16], it is a form of summed area table and is constructed by simply taking the sum of the luminance values above and to the left of a pixel in an image. This, it is effectively the double integral of the sample image, first along the rows then along the columns, as illustrated by Fig (a) (b) Figure 3.2 Integral image (a) The integral image at location (x, y) contains the sum of the pixels above and to the left. (b)the sum of the pixels within rectangle D can be computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle A. The value at location 2 is A+B, at location 3 is A+C, and at location 4 is A+B+C+D. The sum within D can be computed as 4+1-(2+3). The brilliance in using an integral image to speed up the feature extraction lies in the fact that any rectangle in an image can be calculated from that image s integral image, in only four indexes to the integral image while the calculation of the integral image itself is done in only one pass of the image.

22 AdaBoost core method Adaptive Boosting (AdaBoost) [60], is a machine learning algorithm, first formulated by Freund and Schapire [18]. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. Otherwise, it is less susceptible to the overfitting problem than most learning algorithms. AdaBoost calls a weak classifier repeatedly in a series of rounds t = 1, 2, T. For each call, a distribution of weights Dt is updated to indicate the importance of examples in the data set for the classification. On each round, the weights of each incorrectly classified example are increased (or alternatively, the weights of each correctly classified example are decreased), so that the new classifier focuses more on those examples. In the FD training algorithm, AdaBoost allows the designer to combine weak and simple learners to form an accurate and complex overall classifier, as shown in Fig Training set AdaBoost One strong classifier Family of weak classifiers Figure 3.3 Basic scheme of AdaBoost and its main goals (1) Selecting a few sets of features representing as much as possible faces. (2)Training a strong final classifier with a linear combination of these best features. The following description of AdaBoost is based on the works [17] and [18]. In this particular example the weighting is modified slightly and this is done in order to favor the classification of the face class. Where the deviation from the classical implementation is shown by setting the initial weights to one over the database size, the weighting is set such that the face examples will have a higher weight or importance. This is shown as an initial step in the algorithm. The core of idea behind the use of AdaBoost is the application of a weight distribution to the sample set and the modification of the distribution during each iteration of the algorithm, where the weights are first normalized and then adjusted. At the beginning, the weight distribution is flat, but after each iteration of the algorithm each of the weak learners returns a hypothesis and the weight distribution is modified.

23 15 AdaBoost algorithm for FD application [17], [18]: Given example images (, y ),, ( x, y ) where y =0,1 for negative and x1 1 positive examples respectively 1 Initialize weights ω 1, i = 1, for y =0,1 respectively, where m and l are the i 2m 2l number of negatives and positives respectively. For t = 1,, T: 1. Normalize the weights, n n ω ω (3.3), so that ω i is a probability t, i t, i n j = ω 1 t, j distribution. 2. For each feature, j, train a classifier h j which is restricted to using a single feature. The errow is evaluated with respect to ω i, e j = ωt, i h j ( xi ) yi. 3. Choose the classifier, ht, with the lowest error et. 1 e 4. Update the weights: ω ω i (3.4) t+ 1, i = t, iβt Where e = 0 if example x is classified correctly, = 1 otherwise, and i i i e i β ε t t =. 1 ε t The final strong classifier is: T T 1 1 α t ht ( x) α t C( x) = t= 1 2 t= 1 0 otherwise (3.5) where 1 α t = log β t T hypotheses are constructed each using a single feature. The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors.

24 Cascade of classifiers FD is a rare event detection, within any single image an overwhelming majority of sub-images are negative, the target patterns occur at very much lower frequency than non-targets. Consider an image of size containing a single face. While executing the detection algorithm, sub-windows of size pixels need to be explored. Computationally, huge speed-ups are possible if the sparsity of faces in input sub-windows can be exploited. It is best to remove as many non-face sub-windows from consideration as possible at the very early stages. Viola and Jones used a series of classifiers to achieve this goal by using initially simple features, and selecting increasingly more complex features in later stages. Simpler classifiers are used to reject the majority of sub-windows before more complex classifier are called to focus the subsequent processing on promising regions. The overall form of the detection process is that of a degenerate decision tree, Viola and Jones call it a cascade structure [4], which can implement a coarse-to-fine search strategy. Fig. 3.4 depicts the cascade for FD. Figure 3.4 Schematic depiction of the detection cascade Green circle: sub-window potentially contains a face Red circle: non-face sub-window Stage: strong classifier trained by AdaBoost algorithm A positive result from the first classifier triggers the evaluation of a second classifier which has also been adjusted to achieve higher detection rates. A positive result from the second classifier triggers a third classifier, and so on. A negative outcome at any point leads to the immediate rejection of the sub-window. Viola and Jones final detector [5] is a 38-layer (stage) cascade of classifiers which include a total of 6060 features. The speed of the cascaded detector is directly related to the number of sub-windows and features evaluated per scanned sub-window. they claimed that in practical test, a large majority of the sub-windows are discarded by the first and two stages of the cascade, and an average of 8 features out of 6060 are evaluated per sub-window, which enables the cascaded detector process a pixel image in just about seconds, that is roughly 15 frames per second, much faster than any previous methods.

25 Extensions and software packages Floatboost methods In a theoretical setting, AdaBoost can be regarded as a procedure minimizing an upper error bound which is an exponential function of the margin on the training set [18]. However, the ultimate goal in pattern classification applications is to achieve a minimum error rate. A strong classifier learned by AdaBoost may not necessarily be best for this criterion. On the other hand, AdaBoost needs an effective procedure for learning the optimal weak classifier, such as the log posterior ratio, which requires estimation of densities in the input data space. When the dimensionality is high, this is a difficult problem. To overcome these problems, Li et al. [61], [63] proposed so-called FloatBoost to be incorporated into AdaBoost. FloatBoost learning uses a backtrack mechanism after each iteration of AdaBoost learning to minimize the error rate directly, rather than minimizing an exponential function of the margin as in the traditional AdaBoost algorithms. The idea of Floating Search has been originally proposed in [62] for feature selection. A backtrack mechanism allows deletion of those weak classifiers that are non-effective or unfavourable in terms of the error rate. This leads to a strong classifier consisting of fewer weak classifiers. Because deletions in backtrack is performed according to the error rate, an improvement in classification error is also obtained. Floating Search [62] is originally aimed to deal with non-monotonicity of straight sequential feature selection, non-monotonicity means that adding an additional feature may lead to drop in performance. When a new feature is added, backtracks are performed to delete those features that cause performance drops. Limitations of sequential feature selection are thus amended, and improvement is gained for the cost of increased computation due to the extended search. In addition, a statistical model is provided for learning weak classifiers and effective feature selection in high dimensional feature space. A base set of weak classifiers, defined as the log posterior ratio, are derived based on an over-complete set of scalar features. The weak classifiers in FloatBoost are formed out of simple features. In each stage, the weak classifier that reduces error the most is selected. If any previously selected classifiers contribute to error reduction less than the latest selected, these classifiers are removed. In general, Adaboost is very fast, accurate and simple to implement, but it is greedy search through feature space with highly constrained features, it needs a considerably long training time. In [61], [63] Li et al. showed that FloatBoost finds a more potent set of weak classifiers through a less greedy search, and yields a strong classifier consisting of fewer weak classifiers yet achieves lower error rates. Though it results in a faster and more accurate classifier in run-time, FloatBoost requires longer training times, reported as 5 times longer than traditional AdaBoost.

26 OpenCV face detector Another extension of Viola and Jones methods has been developed by Lienhart and Maydt [6]. They introduced a novel feature set designed for detecting in-plane rotation faces (cf. Fig. 3.5). In addition, their work presented analyses on different boosting algorithms, (i.e. Discrete, Real, and Gentle AdaBoost) and compared the performance between stumps and regression tree and also analyzed the effect of training data size. The ideas in the so-referred publication have been implemented in a face detector software package in Intel's Open Source Computer Vision Library (OpenCV) [66]. OpenCV is a collection of C functions and C++ classes that implement some popular algorithms of image processing and computer vision. Figure 3.5 Extended set of Haar-like features Lienhart and Maydt have made the distinction that feature based systems, as apposed to raw pixel based ones are important in reducing the in-class variability while increasing the out-of-class variability. They also identified that when feature pools are combined with a selection method, such as gentle AdaBoost, the capacity of the learning algorithm can be increased. In their technique, the over-complete Haar-like features have been modified. Based on the existing feature set, each of the feature is rotated in both a positive and negative direction, varying by 45 degrees, 14 of these rotated features were then selected. It was found that decrease of approximately ten percents in false alarm rates is achieved when compared to the original technique by Viola and Jones. The technique also evaluated a number of boosting algorithms and implemented to so-called gentle AdaBoost, which in comparison places less focus on the attempt to define examples, which are generally outliers. This is quite different to the discrete boosting process by AdaBoost [17], [18], yet proved very effective in reducing the average number of features to be calculated. In our experiments, the basic function of the OpenCV face detector [6], [66], [48] was tested under VC environment, the cascade was trained [48] by the face database assembled during the development of this thesis and the performance was evaluated. The OpenCV face detector was applied to FFD in the later stage of the project (see Subchapter 4.4).