Recognizing focus areas using isophote pupil location

Transcription

1 Recognizing focus areas using isophote pupil location Gijs Kruitbosch Universiteit van Amsterdam June 27, 2008 Course: Bachelor Thesis Project 2008 BA Kunstmatige Intelligentie Universiteit van Amsterdam Spui 21, 1012 WX Amsterdam The Netherlands Supervisors: Theo Gevers and Nicu Sebe and Intelligent Sensory Information Systems Informatics Institute Universiteit van Amsterdam Kruislaan 403, 1098 SJ Amsterdam The Netherlands

2 Abstract In several subject areas within computer science, such as usability research and alternative input devices, eye gaze tracking plays an important role. While there are several existing approaches (eg. [17, 16, 15]), most of them are either invasive by requiring the attachment of sensors, cameras or controllers to the user s head, or are prohibitively expensive, such as those using infrared corneal reflection techniques and stereovision. In this paper, I introduce a method that tracks the eye s gaze using only a webcam. It relies on face detection using boosted cascade classifiers, as proposed in [20], 3D reconstruction using POSIT, as described in [5], and pupil detection using isophote centers as discussed in [19]. It also uses the isophotes to detect the eye corners, and Lucas Kanade tracking, as developed in [13], to keep track of the eye corners once found. This method was implemented in a proof of concept application using OpenCV. This proof of concept has been evaluated, discovering several important issues with the approach used. These concern not only the pupil tracking itself, but also the effective localization of a face in 3D. Based on the problems discovered, several approaches for further improvement are suggested. In conclusion, while this implementation did not achieve an accuracy anywhere near sufficient to be usable in general real-world applications, I remain hopeful as to its eventual usefulness in the field. Acknowledgments This research could not have happened without the aid of Theo Gevers and Nicu Sebe, my supervisors, or their PhD student Roberto Valenti. Nor would I have been as motivated without the help of Leo Dorst and Andrea Haker. For helping the honours program being finalized in a longer thesis, I am indebted to Bert Bredeweg. Finally, I am grateful to the Mozilla Foundation for facilitating my attendance of the SightCity conference in Frankfurt, where I was able to discuss the issues I faced with several people who had confronted them before. 1

3 Contents 1 Introduction Earlier approaches to eyegaze tracking Current approach Research goals 4 3 Background Theory Human Eyes Pinhole Camera Model D Transformation Approach Face Detection Pupil Detection Face Location Determination Eyecorner Detection Geometrical Constraints Aggregating over multiple frames Tracking using Lucas Kanade optical flow estimation Eye Gaze Determination Screen Intersect Location Ray-plane intersection Point in rectangle Screen coordinate Synthesis Implementation Limitations Efficiency Discussion Face localization in 3D Eye corner detection Pupil detection Proposed Improvements Alternative face localization Alternative eye corner detection and improvements in pupil detection Conclusion 19 2

4 1 Introduction Humans have always, to a very large extent, relied on their eyesight when going about their business in the world. Because of this importance as a sensor, detecting the focus of a person s eyes will in many instances give a strong indication of the focus of the person themselves. This focus function works in two directions: visual focus helps someone decide where to focus their attention, and attention helps determine someone s visual focus. While interviews, surveys and introspection can give indirect indications, these approaches are obviously limited. Therefore, it can sometimes be desirable to record the focus of someone s gaze directly, rather than asking them to comment on it. For more background on the complex interactions between attention, thought and the actual visual focus exhibited by the pupil movements, I refer the reader to [7]. 1.1 Earlier approaches to eyegaze tracking Various approaches have been used to determine the focus of someone s gaze. Originally, the most oft-used approach was based on measuring the electric potential differences of the skin area around the eyes. Needless to say, this was quite intrusive, and did not allow for completely free head movement because of the wires attached to the sensors. A second approach that can be used is to have the subject wear special contact lenses or otherwise manipulate the surface of the eye, and to deduce its movement using a coil and electromagnetism. This method is very accurate [23], but has two strong disadvantages in that it causes a large amount of discomfort to the subject, as well as not tracking head movement, making it unsuitable to detect the actual point of focus (which is also influenced, of course, by the pose of the head of the subject). Then it is also possible to use photo or video material of the eyes alone to accomplish this task, but this suffers from the same problem as the contact lens method, as it is unable to determine head movement and therefore can not give any indication of the actual point at which the user is focusing. Additionally, this method requires a head-mounted camera, which is also often uncomfortable. Finally, a popular more recent method is to capture video material of the user s face and use (infrared) light reflection off the cornea of the eye (for an illustration of the anatomy of the eye, please refer to figure 1). Because of the shape of the eye, several reflections can be visible if a relatively highresolution camera is used, and the difference in location between these reflections and the pupil location can account for both rotation and translation of the actual eye (for a more in-depth look at the biology of the human eye, please refer to section 3.1). This approach is very accurate, but usually at the expense of requiring head stabilisation, which then adds more discomfort for the subject. 1.2 Current approach Currently, researchers at the Universiteit van Amsterdam are working on pupil location detection based on isophotes. In principle, this approach attempts to use isocenters (the centerpoints of curved isophotes) to locate the pupil. For a more detailed discussion of this approach, I refer the reader to section 4.2. Using various techniques to deduce the other needed data (such as head position, and the position of the pupil relative to the eyecorners and face) it would be possible to use this method of eyegaze detection without having to rely on (typically expensive) infrared equipment. This is the approach that I attempted to use for this thesis. The remainder of this thesis is organized as follows: first I will discuss the research goals of the project (section 2), then some of the basic background theory (section 3), and then a detailed description of the approach I used (section 4), with some notes about the actual implementation in section 5. I will then discuss the functionality and problems of this implementation in section 6. Finally, some ideas for improvements and some conclusions about the project are discussed in sections 7 and 8, respectively. 3

5 2 Research goals In this project, I attempted to investigate the possibilities of using only a webcam to do eyetracking, without using stereovision, infrared lights, headrests, or other tools that are either expensive or invasive for the user. My main research question was: To what extent is it possible to identify areas of the screen on which the user is focusing, by locating the user s pupil location using an ordinary webcam? Several auxiliary questions were considered: Is it possible to reliably use the detection of the location of the pupil and the face to obtain a gaze direction? Using the gaze direction, is it possibly to reliably determine which area of the screen is being viewed? Can this be done in realtime, with an ordinary webcam? When attempting to answer these questions, the goal was to create an implementation that would do eyetracking with just a webcam. Ideally, it would give back a point on the screen which would be accurate to a reasonable degree, and it would work in realtime. 3 Background Theory In this section, I will attempt to outline some of the most basic theory that is required to understand the concept, approach and implementation of eyetracking. First I will discuss some of the biological aspects of the human eye, then some theory behind the pinhole camera model used in computer vision, and finally some background on the 3D transformations between world models involved in the two things that see here: the webcam and the subject looking at the computer. 3.1 Human Eyes In humans, the eyes are one of the most relied-upon senses. In the human eye, light enters through the lens in the pupil, and is projected onto the retina. The muscles around the eye are able to determine the optical power (focus length and so on) of the eye by controlling the curvature of the lens. Using these muscles, humans are able to focus their eyes. On the retina, several types of cells (traditionally called rods and cones ) function as light receptors. The cones are best able to distinguish colour in high-intensity light, the rods are best able to distinguish between dim and achromatic light, roughly corresponding to humans day- and nighttime vision, respectively. When observing objects in situations with sufficient light, humans attempt to use their fovea (see figure 1), an area in the retina with a high density of cones and no rods. This area provides the sharpest vision. In order to focus on objects in different locations in the world, humans are able to move their eyes while keeping their head in the same position. Note that this is not as self-evident as it seems, for many animals are not able to accomplish this. For these movements, humans use six different muscles, allowing the eye six degrees of freedom: translation in three directions, and rotation over the three different axes. I will not detail the specifics of the movements of the eyes here, but will merely point out that several different movements of the eyes may be distinguished: saccades, in which the eye shifts 4

6 Figure 1: Schematic drawing of the human eye focus abruptly from one point to another, smooth pursuits, in which the eye focus tracks a moving object, and fixation, the adjustment of the eye focusing on a particular stationary point (where the pupil need not necessarily remain entirely still). A rather old, but still quite useful introduction on the topic of eye movements may be found in [2]. In terms of range, the horizontal field of view that both eyes are capable of viewing spans 114 degrees [10]. This is smaller than the field of view of the individual eyes, as both eyes are capable of viewing a small area in the field of view that the other eye cannot reach. 3.2 Pinhole Camera Model In terms of optics, the pinhole camera model is an ideal model of the traditional pinhole camera. In the model, the pinhole is seen as a single point in space, without any lenses being involved. This allows one to treat the transformation from the 3D points in space to the 2D points on the image plane as a simple projection, which can be represented as a matrix multiplication, as shown in equation 1. x 2d = P X 3d (1) where P is the projection (camera) matrix. Of course, the fact that in a normal camera, there really is a lens, and the fact that the resulting 2D image in a digital camera is discrete (that is, it is pixellated) rather than continuous, mean that one has to be careful when applying this model. A more detailed description of why this is the case, what distortion effects can be caused by various lenses, and how to remedy them, is outside the scope of this paper, but may be found in existing literature, eg. [6] and [9]. 5

7 I will use the pinhole camera model in the attempt to build the eyetracker, without accounting for any lens discrepancies myself. This explicitly runs the risk of not being very accurate, but there are several factors which I decided made endeavouring to compensate for the inaccuracies of the pinhole model unprofitable: In an average webcam, the noise and light conditions will usually provide problems in effectively identifying features. The discrepancy caused by lens distortion would, in comparison, be relatively minor. The average person will use their computer more or less centrally focused on the screen, above which the webcam will usually be attached. In the middle of the image, the distortions caused by the pinhole model are even smaller, and so they are even less important. The resolution of a webcam is so small that any distortion is consequently also small. This is, in fact, another point because of which using the pinhole model is not entirely realistic, as the difference between the discrete webcam image and the continuous image the pinhole model assumes becomes even more pronounced. However, there is no way to solve the basic problem that a webcam does not have a very high resolution without violating the assumptions of my research goals (section 2) where I explicitly specified that I wanted to use basic equipment D Transformation It is important to realize that in the theory and implementation of the eye tracker, I will deal with two 3D coordinate systems: the camera coordinate system; the face (subject) coordinate system; and two 2D coordinate systems: the computer screen coordinate system; the webcam image coordinate system. It is important to always keep in mind with which coordinate system calculations are made. Roughly, one can say that the observing of the subject s viewing of an object by the webcam goes from 2D points in the webcam image, using 3D model coordinates that correspond to the size of a human face, to 3D points in space (in the camera coordinate system) for the face and eye positions. For the actual gaze, we start with a 3D vector in the face model coordinates that is transformed to a 3D vector (point) in space, using the deduced transformation to the camera coordinate system. The ray along this vector is then intersected with the screen plane, obtaining a 3D intersection point. This point then needs to be transformed into a 2D point in screen coordinates. I will explain the specifics of transforming between these different coordinate systems in the next section. 4 Approach My eventual goal is to use the sequence of frames produced by the webcam to deduce a point at which the subject that is visible on these frames is looking. In principle, there are two factors involved that determine the point on the screen a subject is looking at. These are the position and rotation of the head, relative to the screen, and the direction of the gaze of the user s eyes. So, from the 2D webcam image, it is necessary to deduce the 3D position and rotation of the head, and the direction of the gaze of the eyes. 6

8 Figure 2: The 3D coordinate systems: the camera (red dot at the origin) coordinate system and X, Y, Z axes, with the face (in green) and the screen (in blue). The face shows its eyes (red) and center (green) with its own coordinate system (grey, unlabelled). However, it is also necessary to determine the position of the screen relative to the webcam, in order to deduce where on the screen the user is looking. First, I will focus on the detection of the face and pupils, and how we can use these to determine the location of the user s head in 3D. Then I will focus on how we can deduce the direction of the gaze of the user s eyes. Finally, I will consider how we can combine this information to make an informed decision about where the user is actually looking. 4.1 Face Detection First, we must detect the user s head. In this case, the pupil detection code which I received for my work already used a cascade of boosted classifiers working with Haar-like features, as introduced in the seminal paper by Viola and Jones [20], but there are certainly other possibilities. The detector I used is built-in in the OpenCV graphics library. It uses learned classifications of simple Haar-like features. Haar-like features represent oriented contrasts in small image regions, such as dark region left, light region right. Using multiple such features (a cascade) allows one to represent complex structures such as the human face. The face detector is first trained on a set of positive examples (faces) and negative examples (random image noise). The face detector returns a rectangle that encases the face in the 2D image, and therefore gives us rough 2D positions of the face. This is illustrated in figure Pupil Detection Just like for the face detection procedure, the code I received already contained a method to detect pupil locations. The pupil detection uses a region in what is roughly the upper half of the face, to look for the eyes and their pupils. In this region, it uses isophotes. These are areas of equal intensity in the image. It then calculates the curvature of these isophotes, and finally tries to find the centers of these curvatures. For each curvature, a vote is registered for its center, depending on the curvedness. Because isophote density is greater around the edges of objects, and the curvedness is influenced by this density, isophotes along object edges have a larger say in where isocenters are found. The votes are summed, and 7

9 Figure 3: The face (red square) as detected by the boosted cascade classifier. the point with the highest number of votes is used to determine the isocenter using a simple mean-shift algorithm. Using this method turns out to be a very workable way of finding the pupil positions. For a more extensive discussion of this method, I refer the reader to the paper by Valenti et al. [19] 4.3 Face Location Determination Now that we know the location of the face, and two points on the face (the pupils), we can attempt to locate the face in 3D. For this we need one additional parameter: the focus length of the camera. We use the center of the face detected as representing a point at roughly the same depth as the eyes, in the nose. Now that we have 3 points and the focus length, we are able to use the POSIT algorithm. POSIT reconstructs the 3D position of a rigid body model (one where the distances between the points in the model do not change) given both the 3D model points and the 2D image points corresponding to these model points, and the focus length used to produce the image. The POSIT algorithm assumes that the differences between the Z coordinates of the object model points are very small compared to the distance between the camera and the object model points. This assumption allows for the use of a Scaled Orthographic Projection (SOP) rather than a true perspective projection. So, the approximate image coordinates x i and y i of an object point i with camera world coordinates X i, Y i and Z i in a scaled orthographic projection are: x i = fx i /Z 0, y i = fy i /Z 0 compared to the normal perspective projection: x i = fx i /Z i, y i = fy i /Z i This simplification allows the calculation of an approximate pose. Then, a simplification of the original object model can be used, where the different object points are at the same Z coordinate (but still on their original line of sight from the camera). If we use this deformation of the object model with the approximate pose we calculated earlier, using the scaled orthographic projection model, we should get back the same points. If we do, we ve found a correct pose. If not, we repeat the previous steps with the newly found image points. It can be shown that iteration using this method converges on the actual pose of the object, given that the distance between the camera and the object is sufficiently large compared to the distance between the object points themselves. For a full explanation of the algorithm and the proofs associated with it, please refer to the original paper [5]. 8

10 4.4 Eyecorner Detection When using isocenters to deduce the image points corresponding to the pupils, the pupils are the most strongly present amongst the different isocenters, but the eye corners are also usually found. Hence, using the same algorithm described in section 4.2, it is possible to find several candidate points for the eye corners. Below, I will outline how it is possible to distinguish between the eye corners and other isocenters. This can be done using some geometrical constraints, by aggregating data from several frames, and by tracking the actual eye corner once it has been found Geometrical Constraints To distinguish the actual eye corners from the other isocenters, some simple geometrical constraints are used. A formula representing the line segment connecting the two pupils is deduced, and the y coordinate of the eye is limited to being close to this line (where close is defined relative to the size of the image and the vertical distance between the two pupils). The x coordinate is limited to being half the distance between the two pupils away from the nearest pupil. This provides an effective window, no matter where the eyes are looking. For an illustration of these restrictions, please see figure 4. Figure 4: Illustration of the geometrical constraints (blue) on the isocenters that were recognized as eye corners (green). The pupils are shown in red, and the red square is the face detected by the boosted cascade classifier as described in section 4.1. Furthermore, the distance between the two eye corners of each eye needs to be approximately the same. Ideally, the latter constraint would take the rotation of the face into account, and set tighter boundaries on the sizes of the eye corners, but that subtlety has not been represented in the approach taken here due to the relative inaccuracy of the rotation and translation information found using POSIT (please also see section 6 for a more complete discussion of the weaknesses of this system). Because of this inaccuracy, it was not deemed useful to try to represent the rotation in the scale of the eyes: it would most likely lead to a loss in accuracy on the eye corner detection due to the rotation often being incorrect. 9

11 Figure 5: Occasionally, isocenters are found in the eyebrows and other odd locations Aggregating over multiple frames Using the constraints outlined above, several isocenters remain as valid eye corner options. To eliminate more false positives, the isocenters found over several frames are compared, and only those are retained which remain consistent across several frames. It is important to apply some leniency in comparing isocenters across frames: they need not be exactly the same, due to image noise and slight head adjustments. In practice, I found that looking for an isocenter that was within the graphical constraints, and which reoccurred within a continuous sequence of 5 frames, was an effective approach. Here, a reoccurrance was deemed to exist if an isocenter was found was less than 5 pixels away from the first frame s isocenter (approximately 0.78% of the image s width). This approach removes almost all of the incidental isocenters occurring in odd locations such as the eyebrows or cheeks, some examples of which can be seen in figure 5. However, it has the additional disadvantage of losing track of the eye corners when the face changes position, or if an eye corner does not appear as an isocenter all the time. In order to mitigate this, tracking of the eye corner features is used Tracking using Lucas Kanade optical flow estimation The tracking is done using the Lucas Kanade method for optical flow estimation. A full discussion of this algorithm is outside the scope of this thesis, but I will try to explain the basic procedure in this section. For a full discussion, I refer the reader to the seminal paper by Lucas and Kanade, 1981 [13]. The Lucas Kanade method, like other optical flow estimation algorithms, tries to deduce the movement that occurred between two images. Suppose, for simplicity, that we are considering a onedimensional image F that was translated by some h to produce image G. In order to find h, the algorithm finds the derivative of F. It then assumes that the derivative can be approximated linearly for reasonably small h, so that: G(x) = F (x + h) F (x) + hf (x) We can formulate the difference between F (x + h) and G over the entire curve as an L2 norm: E = x [F (x + h) G(x)] 2 10

12 Then, to find the h which minimizes this difference norm, we can set: 0 = δe δh δ [F (x) + hf (x) G(x)] 2 δh x x 2F (x)[f (x) + hf (x) G(x)] Using the above, we can deduce: x h F (x)[g(x) F (x)] x F (x) 2 which can be implemented in an iterated fashion using: h 0 = 0, h k+1 = h k + x F (x + h k )[G(x) F (x + h k )] x F (x + h k ) 2. Usually, a weighting function is used to account for the fact that at some points, the assumption that F (x) is linear holds better than at others. A weighting function allows those points to play a larger part in determining h (and conversely, to give less weight to points where the assumption of linearity (so F (x) being close to 0) does not hold). This approach can be generalized to multiple dimensions, using vectors for x and h, and a gradient operator rather than the derivative. The algorithm can also be generalized to take into account shear, rotation, and so on. In this case, we use the algorithm to obtain the coordinates of the eye corners in some frame a, given the coordinates of the eye corners (obtained through the isocenter procedure with aggregration and geometrical limits) in an earlier frame b, along with the image data of frames a and b, allowing the tracking of the eye corner features using this Lucas Kanade method, until the approach outlined above finds the eye corners again after the head stabilizes. 4.5 Eye Gaze Determination When the position of the pupil as well as the positions of the eye corners for at least one eye are all known, it is possible to deduce a gaze vector. This can be done using the distance between the pupil and the eye corners, and correlating this with the visual field of the average human. Figure 6: Determining the relative position of the pupil (point D) compared to the eye corners (A and C). 11

13 In order to do this accurately even if the face (and hence the eye) is rotated, the difference vector from the pupil to the eye corner left of it in the image is split into a vector (AB) along the line (AC) that intersects both eye corners, and a vector (BD) perpendicular to this line (see figure 6). In other words, we project the pupil point from the image coordinate space in 2D to a new coordinate space in 2D, in which the line segment between the eye corners is on the x axis, and the left eye corner is the origin. The size of these vectors is compared to the size of the eye, and a corresponding rotation will be applied to a unit vector along the Z axis. Formally, given a horizontal field of view of 114 degrees, centered around the vector that is orthogonal to the front of the face, the horizontal angle of the view α in degrees is given by: α = 114 AB AC = 114 AB AC 57 where AB is the distance from the pupil to the left eye corner along the line AC (see figure 6). Similarly, for the vertical angle of the gaze, a view of 90 degrees is used. This value is not necessarily very accurate, but there has been little to no research in this area, and no conclusive data was available. The vertical size of the eye is deemed to be roughly 2 5 of the horizontal size, with the gaze at the center of the view if the pupil is positioned on the line segment between the two eye corners. The same principle is applied, so the vertical angle β in degrees is given by: β = 45 DB 0.4AC (where DB is the distance between the pupil (D) to the line between the eye corners (AC). In this equation, we assume DB would be negative if the pupil were below the line segment connecting the two eye corners) Using these rotations with the unit vector along the Z axis, we are able to determine a gaze vector g. An example of some transformation angles and the resulting vector is visualized in figure 7. Using the rotation angles α and β, it is possible to build two simple rotation matrices, and multiply the normal vector with them. For the rotation about the Y axis (so the horizontal rotation) this is: R 1 = cos α 0 sin α sin α 0 cos α And for the rotation about the X axis (so the vertical rotation): R 2 = 0 cos β sin β 0 sin β cos β Which then allows us to compute a gaze vector g using the simple unit vector v: 0 v = 0 1 g = R 1 R 2 v Using this gaze vector, we can determine where the user s gaze intersects with the screen. 12

14 Figure 7: The transformation of the vector [0, 0, 1] to gaze vector g using angles α and β for rotation about the Y and X axes, respectively. Also compare with figure 2 to understand which 3D coordinate model is used. 4.6 Screen Intersect Location Now that we have a vector in the face coordinate system, and know the transformation matrix to go from the face coordinate system to the camera coordinate system, we can use this to obtain the same vector in the camera coordinate system. We can then compute an intersection with the screen by determining the point at which the vector intersects the plane in which the screen lies (assuming that the screen is flat) Ray-plane intersection The following is a classic way of doing ray-plane intersection in 3D. It has been described extensively in literature. For more background, please refer to [8]. Consider the plane in which our screen lies as defined by the classic plane equation: 0 = Ax + By + Cz + D (2) Here, A, B, and C are the unit normal from the plane. In our case, this unit normal is: 0 n = 0. (3) 1 That is, we assume that the screen is positioned in such a way that the normal of its plane is the same as the camera Z axis. In other words, the camera being at the origin, gazes along the Z axis, with the screen and the camera in the X-Y plane. This is not at all required, and arbitrary translations and rotations could be applied to the formulas given here. However, very many laptops these days come with a camera preinstalled on top of the screen, and therefore satisfy this precise criterion already. Because the plane we are interested in is the X-Y plane, we also know that the distance to the origin D is 0. From equations 2 and 3, this means our plane equation is the following: 0 = 0x + 0y + 1z + 0 = z (4) 13

15 This fits our intuition that a point is in the plane iff its z coordinate is zero. For the vector, we define an origin, v 0 = [x 0, y 0, z 0 ] and a point on the ray along the vector v d = [x d, y d, z d ]. Now we can parametrize the ray as such: Substituting this into equation 2 produces: v(t) = v 0 + tv d (5) 0 = A(x 0 + tx d ) + B(y 0 + ty d ) + C(z 0 + tz d ) + D 0 = Ax 0 + By 0 + Cy 0 + D + tax d + tby d + tcz d (Ax 0 + By 0 + Cy 0 + D) = tax d + tby d + tcz d (Ax 0 + By 0 + Cy 0 + D) = t(ax d + By d + Cz d ) (Ax 0 + By 0 + Cy 0 + D) (Ax d + By d + Cz d ) = t All these variables are known, so we can compute t, substitute it into v(t) and obtain an intersection point. In our specific case, the equation is actually much simpler, as substituting equation 5 into equation 4 produces: 0 = z 0 + tz d which clearly saves a lot of tedious computation. Having obtained the intersection point (if any) of the ray with the plane, we need to assess where this point is on the screen Point in rectangle In our case, we only care about points in the rectangle that is the screen. So, at first we test if the point is inside this rectangle. In order to do this, we do a naive projection that retains the topology of the rectangle by dropping one of the coordinates from our vectors (this approach to quickly project vectors to 2D is outlined in [8] as well). We determine the dominant coordinate in the normal of the plane in which the rectangle lies. In the specific case outlined above, where the screen is in the viewing plane, that is, the camera Z axis is the normal of the screen plane, that is the z coordinate (recall that our normal was [0, 0, 1], from equation 4). This coordinate is then removed from all of our points. When dropping one of the coordinates, we end up with 2D coordinates for the four corners of the screen, c 0...c 3 where c n = (c nx, c ny ), and the point p = (p x, p y ). In order to deduce whether the point is in this (arbitrarily rotated) rectangle, the following algorithm was used: 1. For each corner c n define: c m as the next point in clockwise order, so m = (n + 1) mod 4; l n as the line through c n and c m ; f n as the linear formula describing the line l n ; d n as the vertical distance between the line l n and p, or if l n is vertical, the horizontal distance. (refer to figure 8 to see a visual representation of this situation. Note that the screen orientation and proportion in the figure are not realistic: the normal computer screen would be wider than it is high. However, in the interest of the example, these dimensions are more convenient because d 1 and/or d 3 would otherwise be overly long) 14

16 2. Compute f n (x): f n (x) = ( c n y c my c nx c mx )x + (c ny ( c n y c my c nx c nx )c nx ) 3. For point p = (p x, p y ) compute d n = f n (p x ) p y, or, if l n was vertical, compute d n = c nx p x. This is the vertical, or if l n was vertical, horizontal, distance between point p and l n. The sign tells whether p is above or below (or to the left or the right) of the line. 4. p is inside the rectangle if and only if the signs of d 0 and d 2 are opposite and the signs of d 1 and d 3 are opposite. 5. If any of d 0...d 3 are 0, the point is on that line, but not necessarily between the line s defining screen corners. To check for the latter property, check the other pair of values. If those two values have opposite signs, or one of them is 0, the point is inside the rectangle. Figure 8: Diagram showing the similar triangles formed by l 0...l 3 (black), d 0...d 3 (blue and green) and the normals from l 0...l 3 to p (in red) Screen coordinate Because the topology of the rectangle was retained, we can use basic geometry with d 0...d 3 to compute where the point is, compared to the corners, and to compute the screen coordinates. First, note that the screen coordinates are proportional to the normals from lines l 0...l 3 to p. Hence, we are interested in the proportions of these normals. Fortunately, the triangles formed by the lines l 0...l 3, the line segments corresponding to d 0...d 3 and the normals from lines l 0...l 3 to p are similar, because point p forms the intersection between these 15

17 straight lines, and the opposite lines of the rectangle are parallel. Hence, the proportion of the different d 0...d 3 values is equivalent to that of the normals from l 0...l 3, which allows us to easily calculate the screen coordinates of the point p. This concept should be obvious from figure Synthesis I have now treated all the different parts needed to go from the webcam image to a position on the screen. I will try to recap a bit to explain how the different parts fit together. The steps to go from the webcam image to the screen coordinate are as follows: 1. Detect the face (subsection 4.1). This returns the 2D image coordinates of the face. 2. Detect the pupils (subsection 4.2) in the area of the face where the eyes should be found, using the coordinates obtained in the previous step. This returns the 2D image coordinates of the pupils. 3. Using the position of the eyes and the center of the face, compute the location of the face (subsection 4.3). This returns a transformation matrix from the 3D model coordinate system of the face to the 3D camera coordinate system for the webcam, composed of a rotation matrix and a translation vector. 4. Using residual data from locating the pupils, determine the location of the eye corners (subsection 4.4). This returns the 2D image coordinates of the eye corners. 5. With the information about both the pupils and the eye corners, compute the gaze of the eye in terms of the model of the face (subsection 4.5), and convert the vector coordinates to the coordinate system of the camera rather than the model of the face. This returns a 3D vector that represents the user s gaze in the camera coordinate system. 6. Calculate the intersection point of the user s gaze found in the previous step with the screen, and its coordinates in the screen coordinate system (subsection 4.6). In the next section, I will detail how these steps were implemented in my proof of concept application. 5 Implementation I have implemented the method outlined in section 4 using C++ and the OpenCV computer vision library [1]. In order to show the screen coordinate at which the user is looking, I have also used the Macintoshspecific Carbon API, so as to obtain a transparent overlay window[11] as big as the screen on which the application draws a clearly visible red dot. The current implementation is therefore mac-specific, however, it would be very little effort to port the mac-specific code to the Windows platform. All the OpenCV code, and the actual algorithms and synthesis of the data described in section 4 should be cross-platform. 5.1 Limitations The current implementation has some limitations in terms of generalizability beyond that of the platformspecifity, and apart from the problems found with the approach implemented (which are discussed in section 6). For one, the implementation currently hardcodes the size of the screen to be 29 by 13 centimeters, and the camera view axis to be perpendicular to this, with the camera centered at 1 centimeter above the 16

18 screen. These distances correspond to those found in the Apple Macbook laptop. They would need to be adjusted for other laptops or setups. Another limitation is the fact that the focus length is currently hardcoded. It would be better to automatically deduce this parameter using some form of calibration. However, this was not the focus of this research, and no time has been invested in doing this. Because the Apple Macbook has a fixedfocus camera, hardcoding this value was not deemed to be a problem in normal usage and testing of the application. Finally, the size of the human face was hardcoded to values which were found experimentally using measurements done on the author s face. Clearly, these would need to be adjusted to well-referenced averages as found by studies into the human anatomy. Unfortunately, the author has, despite serious effort, been unable to locate such well-referenced averages. 5.2 Efficiency The current proof of concept was not written primarily with efficiency in mind. Hence, its performance could easily be improved upon. However, in the current setup, processing one frame takes approximately 130 milliseconds on average, using a webcam image with a 640 by 480 pixel resolution, on an Apple Macbook running a 2.0 GHz Intel Core 2 Duo processor (but with a single core implementation). This would boil down to approximately 7.7 frames per second. While this is definitely not stunning, it is not as unreasonable as it might have been, considering the number of different algorithms and tasks implemented and in use. 6 Discussion Unfortunately, the implementation does not perform as well as might have been hoped. There are several different problems that interfere with the accuracy of the eye tracker, each of which I will consider in turn. The different problems, when singled out, are not always major, but the combination of all of them means that in its current state, the eye tracker cannot be used for serious applications. 6.1 Face localization in 3D The most obvious and visible problem is that of doing face localization in 3D. That is to say, to use POSIT to obtain a transformation matrix from the original face model to the real world. The implementation uses the pupils and the center of the face in order to do this. In practice, there are several problems with this approach: The center of the face detected shifts as the face turns. This means that it is actually not possible to make the point track an actual feature on the face, such as the tip of the nose. While it may correspond to this feature in one face position, it will no longer do so when the face turns but a few degrees. The face detector does not work when turning too far to the left or right. This is inherent in using this boosted cascade classifier, and therefore unavoidable when using that approach, but the effects were stronger than anticipated. Depending on the direction from which the largest amount of light originates, it is sometimes not possible to turn more than approximately 10 degrees in the opposite direction, as the side of the face being turned away is also in the shadow, and the face is therefore no longer recognized. Both the center of the face and the pupils move irrespective of actual facial movement. The center of the face moves because of the fact that the rectangle indicated to be the face by the face 17

19 detector shifts a few pixels every frame, even if the actual user s face remains quite still. The pupils move, of course, due to eye movement, but also because of noise in the webcam image influencing the isocenter detector. The combination of these movements means that the assumption that these three points can be treated as a rigid body is violated. As a result, the POSIT estimation of the pose also shifts very frequently. The POSIT pose estimation is unreliable. The cause of this lies partially in the previous point, but quite apart from that, the three points given to it seem to be insufficient for it to make a reliable estimate of the rotation of the face in particular. While the translation vector it deduces is usually reasonably correct, the rotation is not. It is not exactly known what causes this problem. It may be a problem in the OpenCV implementation, but this seems unlikely given its ubiquitous usage. 6.2 Eye corner detection Another problem is that of locating the eye corners. For one thing, the theory behind this is unclear: To date, there has been no decisive explanation as to why the eye corners are present as isocenters. Several possible reasons include the curvature of the face around the eyes, the curvature of the opposite side of the pupil, and the shape of the tear glands. Especially the latter might have an effect, given that the inner eye corners are detected more often than the outer ones. Regardless of how it works, the fact remains that this approach is also producing a small problem. The eye corners that are detected are often already a few pixels off when they are detected as isocenters. This is not very significant when attempting to distinguish the horizontal position of the pupil, but for the vertical position it makes much more difference, as the visible area of the eye, in the vertical direction, is simply much smaller, and the effect of these few pixels is all the more noticeable. Then the next problem is that the Lucas Kanade tracking is prone to allow the corners to glide along the bottom of the eyes, especially when they were not entirely in the eye s corner to begin with. This can easily be explained by the fact that the window used by the tracking is so small that when the point it is tracking is not exactly in the corner, it will only pick up the difference between the eye and the bottom eyelid as being defining for that feature. This feature, occurs roughly in the same way along the entire edge of the eye. A similar thing happens if the corner is initially detected a little bit outside of the eye, when the tracking moves it along the side of the head (until it passes the geometrical constraint boundary, at which point it is reset). 6.3 Pupil detection The pupil detection, too, seems to sometimes produce the wrong results, selecting one of the eye s corners instead of the pupil as the most prominent isocenter, leading to a wrong estimation of the location of the pupil, which massively throws off the gaze direction to either of the two corners of the eyes. Clearly, this should not be allowed to happen. 7 Proposed Improvements In order to fix the problems outlined in the previous section, several solutions are proposed in this section. Some of the problems may also be solved by using camera equipment with a higher resolution, or stereo vision, but the first brings extra monetary costs, and the second has been treated extensively in literature (eg. [12, 14]), and in addition to that poses new problems related to calibration and correlation of the two images. Disregarding these options, however, there are still several ways in which the current result may be improved. 18

20 7.1 Alternative face localization In order to improve the localization of the face, several steps may be taken: Use a different pose estimation algorithm. POSIT is not the only algorithm available to do 3D reconstruction, and the origins of POSIT date back to 1989 [4]. By now, various refinements and alternatives have been proposed, eg. [21, 3, 24]. Use a more stable feature set, such as the average between the two eye corners for each eye, and the average between the last 3 frames for the nose. Using these averages, the points fed into POSIT will vary less, which should help stabilize the rotation matrix for POSIT. Use POSIT with more features, such as the mouth or the ears (if available). This will allow a more robust estimation of rotation by POSIT, and this, too, would be able to help the accuracy of the transformation matrix obtained through POSIT. Use a different classifier for the face position in 2D, that is not as noisy as the boosted cascade classifier. There are many possible ways of doing this (eg. [18, 22]). Using something other than the boosted cascade classifier may prove profitable in terms of stability, though care must be taken in retaining the speed of the current implementation. 7.2 Alternative eye corner detection and improvements in pupil detection In order to reliably detect eye corners, several other well-known options are available instead of the makeshift isocenter approach used for this thesis, eg. [25]. Additionally, some methods to detect faces use actual face models, from which it would be possible to infer the eye corner position as well. Alternatively, because we know the position of at least one point in the eye (namely, the pupil, using the isocenter approach), it would be possible to use more naive (and faster) methods to infer the corner of the eyes from there (such as edge detection in the area between the two pupils). Finally, it may be possible to use a more geometrically oriented approach of finding the right isocenters that correspond to the corners and the pupil. Because we know the relative ordering between the three centers, and can estimate a distance between the two eye corners from the size of the face detected as well as data from previous frames, it may be possible to use these to more elaborately score different combinations of isocenters, and select the optimal one. This approach may improve over the current situation because it combines the finding of the different points, ensuring a little bit more consistency between the data. 8 Conclusion I have outlined and implemented a method to do eye tracking using just an ordinary webcam. The approach uses a boosted cascade classifier for face recognition, and isocenters for locating the pupils and eye corners. The pupil and face data is combined to do 3D reconstruction and obtain a transformation matrix from the camera model to the face model. The pupil and eye corner data is combined to obtain a gaze direction, which is transformed using the aforementioned matrix so as to obtain a vector for the user s gaze in the camera model. Using this vector, it is possible to calculate an intersect point on the screen, and display this point to the user. This method proved to suffer from various problems, including severe problems in determining the 3D rotation of the user s face, and accurate localization of the eye corners and pupils. Several suggestions to resolve these issues have been proposed, such as using alternative algorithms for 3D reconstruction, using more data points, or using alternative methods to process the isocenter data so as to obtain more accurate locations for the pupils and eye corners. 19

21 Although the aim of obtaining a fully functional eye tracker using only a webcam was not achieved in the timeline for this thesis, promising steps have been made in the development of such a system. Using the suggestions for improvement outlined in the previous section, I am confident that it would in fact be possible to do eye tracking using just a webcam. References [1] G. Bradski. The OpenCV Library. Dr. Dobb s Journal, Computer Security, November [2] R. H. S. Carpenter. Movements of the eye. Pion, London, [3] P. David, D. DeMenthon, R. Duraiswami, and H. Samet. Simultaneous pose and correspondence determination using line features. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, volume 2, [4] D. F. DeMenthon and L. S. Davis. New exact and approximate solutions of the three-point perspective problem. University of Maryland Tech Notes, October [5] D. F. DeMenthon and L. S. Davis. Model-based object pose in 25 lines of code. International Journal of Computer Vision, 15: , [6] Frédéric Devernay and Olivier Faugeras. Straight lines have to be straight. Machine Vision and Applications, 13:14 24, [7] Andrew T. Duchowski. Eye Tracking Methodology. Springer, second edition, [8] A. S. Glassner. An Introduction to Ray Tracing. Morgan Kaufmann, [9] J. Heikkila and O. Silven. A four-step camera calibration procedure with implicit image correction. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages , [10] Ian P. Howard and Brian J. Rogers. Binocular Vision and Stereopsis. Oxford University Press, USA, [11] Apple Inc. Using overlay windows. In Quartz Programming Guide for QuickDraw Developers, chapter 7, pages Apple Inc., [12] Shinjiro Kawato and Nobuji Tetsutani. Detection and tracking of eyes for gaze-camera control. Image and Vision Computing, 22: , October [13] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of Imaging Understanding Workshop, pages , [14] Y. Matsumoto and A. Zelinsky. An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, page 499, [15] C. H. Morimoto, D. Koons, A. Amir, and M. Flickner. Pupil detection and tracking using multiple light sources. Image and Vision Computing, 18: , March [16] Takehiko Ohno and Naoki Mukawa. A free-head, simple calibration, gaze tracking system that enables gaze-based interaction. In Proceedings of the 2004 symposium on Eye tracking research & applications, pages , San Antonio, Texas, ACM. 20