Computer Vision Techniques for Man-Machine Interaction James L. Crowley I.N.P. Grenoble 46 Ave Félix Viallet 38031 Grenoble France email: jlc@imag.fr Abstract Computer vision offers many new possibilities for making machines aware and responsive to man. Manipulating objects is a natural means by which man operates on and shapes his environment. By tracking the hands of a person manipulating objects, computer vision allows any convenient object, including fingers, to be used as a computer input devices. In the first part of this paper we describe experiments with techniques for watching the hands and recognizing gestures. Vision of the face is an important aspect of human-to-human communication. Computer vision makes it possible to track and recognize faces, detect speech acts, estimate focus of attention and recognize emotions through facial expressions. When combined with real time image processing and active control of camera parameters, these techniques can greatly reduce the communications bandwidth required for video-phone and video-conference communications. In the second part of this paper we describe techniques for detecting, tracking, and interpreting faces. By following the movements of humans within a room or building, computer vision makes it possible to automatically migrate communications and access to information technology. In the final section we mention work on detecting and tracking full body motion. All of these systems use relatively simple techniques to describe the appearance of people in images. Such techniques are easily programmed to operate in real time on widely available personal computers. Each of the techniques has been integrated into a continuously operating system using a reactive architecture. 1 Looking at people: perception for man-machine interaction One of the effects of the continued exponential growth in available computing power has been an exponential decrease in the cost of hardware for real time computer vision. This trend has been accelerated by the recent integration of image acquisition and processing hardware for multi-media applications in personal computers. Lowered cost has meant more wide-spread experimentation in real time computer vision, creating a rapid evolution in robustness and reliability and the development of architectures for integrated vision systems [Crowley et al 1994]. Man-machine interaction provides a fertile applications domain for this technological evolution. The barrier between physical objects (paper, pencils, calculators) and their electronic counterparts limits both the integration of computing into human tasks, and the population willing to adapt to the required input devices. Computer vision, coupled with video projection using low cost devices, makes it possible for a human to use any convenient object, including fingers, as digital input devices. Computer vision can permit a machine to track, identify and watch the face of a user. This offers the possibility of reducing bandwidth for video-telephone applications, for following the attention of a user by tracking his fixation point, and for exploiting facial expression as an additional information channel between man and machine. Traditional computer vision techniques have been oriented toward using contrast contours (edges) to describe polyhedral objects. This approach has proved fragile even for man-made
objects in a laboratory environment, and inappropriate for watching deformable non-polyhedric objects such as hands and faces. Thus man-machine interaction requires computer vision scientists to "go back to basics" to design techniques adapted to the problem. The following sections describe experiments with techniques for watching hands and faces. 2 Looking at hands: gesture as an input device Human gesture serves three functional roles [Cadoz 94]: semiotic, ergotic, and epistemic. The semiotic function of gesture is to communicate meaningful information. The structure of a semiotic gesture is conventional and commonly results from shared cultural experience. The good-bye gesture, the American sign language, the operational gestures used to guide airplanes on the ground, and even the vulgar finger, each illustrates the semiotic function of gesture. The ergotic function of gesture is associated with the notion of work. It corresponds to the capacity of humans to manipulate the real world, to create artifacts, or to change the state of the environment by direct manipulation. Shaping pottery from clay, wiping dust, etc. result from ergotic gestures. The epistemic function of gesture allows humans to learn from the environment through tactile experience. By moving your hand over an object, you appreciate its structure, you may discover the material it is made of, as well as other properties. All three functions may be augmented using an instrument or tool. Examples include a handkerchief for the semiotic good-bye gesture, a turn-table for the ergotic shape-up gesture of pottery, or a dedicated artifact to explore the world. In Human Computer Interaction, gesture has been primarily exploited for its ergotic function: typing on a keyboard, moving a mouse and clicking buttons. The epistemic role of gesture has emerged effectively from pen computing and virtual reality: ergotic gestures applied to an electronic pen, to a data-glove or to a body-suit are transformed into meaningful expressions for the computer system. Special purpose interaction languages have been defined, typically 2-D pen gestures as in the Apple Newton, or 3-D hand gestures to navigate in virtual spaces or to control objects remotely. Mice, data-gloves, and body-suits are artificial add-on s that wire the user down to the computer. They are not real end-user instruments (as a hammer would be), but convenient tricks for computer scientists to sense human gesture. Computer vision can transform ordinary artifacts and even body parts into effective input devices. We are exploring the integration of appearance-based computer vision techniques to non-intrusively observe human gesture in a fast and robust manner. As a working problem, we are studying such techniques in the context of a digital desk [Wellner et al 93]. A digital desk, illustrated in figure 1, combines a projected computer screen with a real physical desk. The projection is easily obtained using an overhead projector and a liquid-crystal "datashow" working with standard overhead projector. A video-camera is set up to watch the workspace such that the surface of the projected image and the surface of the imaged area coincide. The transformation between the projected image and the observed image is a projection between two planes, and thus is easily described by an affine transformation. The workspace is populated by a manipulating "hand" and a number of physical and virtual objects which can be manipulated. Both physical and virtual devices act as tools whose manipulation is a communication channel between the user and the computer. The identity of the object which is manipulated carries a strong semiotic message. The manner in which the object is manipulated provides both semiotic and ergotic information. Virtual objects are generated internally by the system and projected onto the workspace. Examples include cursors and other visual feedback symbols as well as shapes and words which may have meaning to the user. Virtual objects are easily created, but they lack the tactile (epistemic) feedback provided by physical objects. The vision system must be able to detect,
track and recognize the user's hands as well as the tools which he manipulates. Tools may be virtual objects, or any convenient physical object which has been previously presented to the system. The system must also be able to extract meaning from the way in which tools are manipulated. Camera Projector Tracking Trigger Figure 1. A Digital Desk combines a video projector and a computer vision system to integrate manipulation of virtual and physical objects. To recognize gestures, we describe the space of possible appearances which the hands and tools can present. This space is reduced by registering and normalizing a window around the hand and tool using multiple visual processes. The space is further reduced by compressing the set of all possible appearances to a subspace defined by its principal components. Registering and normalizing the images of the hand and objects requires tracking. We have experimented with a number of different approaches to tracking hands and recognizing their gestures, including the techniques described above for face tracking. These include color histogram matching, cross-correlation with images, and active contours. As with faces, by combining several techniques in such a way that they can be dynamically initiated and controlled, we arrive at a system which is both flexible and robust. Figure 2. A photo of a hand on a desk (left), the probability of skin (middle), the largest connected component of the thresholded probability of skin (right). A gesture can be defined as a combination of a trajectory in a physical work-space and a sequence of configurations of the hand and tool. The physical trajectory of a hand is the timesampled sequence of positions of the center of the hand in the workspace, (x(t), y(t)). To express the sequence of hand and tool configurations as a trajectory requires defining a parametric representation for such configuration. Such a representation is provided by the principal components of a large collection of possible appearances. Appearance based vision provides a technique to express the hand-tool configuration in terms of a multi-dimensional parameter vector. For each tool to be manipulated, an "appearance space" is built up from a large sample of images of the hand manipulating the tool, as for example in Figure 2. The appearance space, contains one dimension for each pixel. In order to align the dimensions, the images are normalized in position and scale, which requires a reliable tracking process as described in the previous section. The possible set of appearances for a hand manipulating a tool is described as a manifold in the appearance space. The appearance space
can be reduced to a minimal subspace composed of an average image and a set of orthogonal basis images using principal components analysis (PCA ), as shown in Figure 3. Figure 3. A set of hand configurations. The principal components of a appearance space represent the eigen-vectors of a square matrix which is formed by computing the inner product of all pairs of images in the set. The information provided by each principal components is given by the associated principal value. The principal components can be normalized so that their sum is 1. A convenient method to determine how many dimension to conserve is to choose the M largest components such that their sum is above a threshold, such as 0.95. The threshold determines the percentage of information conserved. Figure 4 Principal Components of images from Figure 3. Figure 5. A Hand gesture as a trajectory in a space defined by two principal components. Figure 5 gives an example of a gesture in the appearance space defined as a trajectory in principal components space. To create this figure, the 11 dimensions which represent the view space have been projected onto the first two principal components. The projection of each observed image is represented by a cross. The sequence of hand-tool configurations is described as a trajectory in appearance space. A gesture space may be defined by the concatenation of the trajectory in appearance space and the physical work space. Alternatively, in many cases, the trajectory in physical workspace may be
used to express ergotic parameters, while the trajectory in appearance space provides semiotic "meaning". Among the advantages of an appearance based approach using PCA is that it can be applied to deformable objects, such as hands and faces. We have found that trajectories in principal component space provide a simple and robust means for encoding both hand gestures, facial expression, and lip movements. 3. Looking at faces: making the machine aware of the user Locating and normalizing a face is a processes of tracking. A variety of methods for detecting and registering the position and scale of a face have been demonstrated in laboratory environments. However, each of these methods can fail in naturally occurring circumstances. A reliable tracking system for registration can be obtained by integrating and coordinating several complementary tracking processes [Crowley-Berard 97]. Integration and coordination are performed using a synchronous architecture in which a supervisor activates and controls visual processes in cyclic manner [Crowley et al 1994]. In collaboration with France Telecom, we have developed a system that achieves robust face tracking by integrating a number of simple visual processes. Control of individual processes is made possible by the inclusion of a confidence factor accompanying each observation. Fusion of the results is made possible by the determination of an error estimate (a covariance matrix) for each observation. Composing a system from a redundant ensemble of processes permits the overall system to automatically adapt to a variety of operational circumstances. The system initializes itself by detecting the blinking of eyes, as shown in Figure 6. Because eyes blink in unison, blinking provides a simple means to detect and locate the presence of a face in an image. Blink detection operates intermittently, succeeding for one image every 5 to 10 seconds. However, when a blink is detected, detection provides reliable and precise determination of the location and size of a face. Thus blink detection is ideally suited to initializing other means of tracking such as color histogram detection of skin and correlation masks for tracking of mouth and eyes. Blink detection also provides a simple means to confirm (or contradict) the detection of a face. Figure 6. A blink face image with the low resolution difference, bounding boxes of changed regions, and detected face position. Once a face has been detected, it is possible to follow it by a number of techniques. One such techniques is to detect the pixels with the same color (hue and saturation) of skin. Such detection can be performed using color histogram matching. Color histograms have been used in image processing for decades, particularly for segmenting multi-spectral satellite images, and medical images. Schiele and Waibel [Schiele-Waibel 95] have demonstrated that skin could be reliably detected by normalizing the color vector by dividing out the luminance component. A 2-D joint histogram of the luminance normalized color components (r, g) can be computed from a patch of an image known to be a sample of skin. For color components (R, G, B):
R G r = R+G+B g = R+G+B The histogram of normalized color gives the number of occurrences for each normalized color pair (r, g). This histogram must be periodically re-initialized to compensate for changes in ambient light, or differences in skin color of different users. Such a color sample is captured automatically whenever eye blink has been detected with a sufficient confidence. A normalized color histogram h(r, g) based on a sample of N pixels, gives the conditional probability of observing a color vector given that the pixel is an image of skin. Using Bayes rule, we convert this to the conditional probability of skin given the color vector, p(skin C ). This allows us to construct a probability image in which each pixel is replaced by the probability that it is the projection of skin, as shown in Figure 7. The center of gravity from the probability of skin gives the estimate of the position of the face. The bounding rectangles gives an estimate of size. A confidence factor is the computed by comparing the detected bounding box to an ideal width and height, using a Normal probability law. The average width and height and the covariance matrix are obtained from test observations selected by hand. Figure 7a. Bounding box and center of gravity superimposed on an image of a person in front of a computer screen. Figure 7b. In this image, each pixel from image 1a has been replaced by the probability that the pixel is skin. Figure 7c. The bounding box of the face region and the center of gravity. Detection of a face by color is fast and reliable, but not always precise. Detection by blinking is precise, but requires capturing an image pair during a blink. Correlation can be used to complete these two techniques and to hold the face centered in the image as the head moves. Energy normalized cross-correlation tracking can be shown to be optimum in the presence of additive Gaussian noise [Crowley-Martin 95]. The dominant noise in the case of face detection is neither Gaussian nor additive. However, when assisted by other detection processes, correlation tracking provides a technique which is inexpensive, relatively reliable, and formally analyzable. Correlation tracking processes are initiated by blink detection. The template for correlation is taken from the estimated position of the eyes. The search region for each tracker is estimated from the expected speed of the users movements measured in pixels per frame. This value can be kept quite small if the frame rate is kept high. Each reference template is a small neighborhood, W(m, n), of the image P(i, j) obtained during initialization just after blink detection. In subsequent images, the reference template is compared to an image neighborhood (i, j). This comparison can be made by an energy normalised inner product (NCC) or a sum of squared difference (SSD). N N SSD(i, j) = (Pk(i+m,j+n) W(m, n)) 2 m=0 n=0 Comparison by SSD is observed to give slightly better results, but is more sensitive to changes in background lighting. This sensitivity is not a problem because the system continually reinitializes the template whenever blink is detected.
The estimated position of the target is determined by finding the position (i, j) at which the SSD measure is a closest to zero. The actual center position can be determined by adding the half size of the mask to the corner position (i, j). By keeping the search region small, the system can operate at a processing rate of 25 Hz. Figure 8 shows a typical map of the SSD values obtained when a template for the eye is convolved with a face. The local image of SSD values is inverted to provide the CF and Covariance. The covariance of the detection is estimated from the second moment of the inverted SSD values. A sharp correlation peaks give a small covariance, while a larger correlation gives a larger spread in covariance. The confidence is estimated from the value of the minimum SSD. When this confidence measure drops below a threshold the tracking processes is halted or re-initialized. Figure 8a. Correlation template is taken from eye (detected from blink detection) Figure 8b. Eye is detected in later images by sum of squared difference (SSD). Figure 8c. Map of values from Sum of Squared difference. (white represents 0) The perceptual processes of eye blink detection, color histogram matching, correlation tracking, and sound localization are complementary. Each process fails under different circumstances, and produces a different precision for a different computational cost. The fact that all three processes produce a confidence factor makes it possible to coordinate the different processes in order to maximize confidence and precision while minimizing computing cost. Figure 9 shows an example of the number of calls per second for the processes as the tracking moves through different states (labeled 1, 2 and 3). 14 12 10 8 6 4 2 0 2 3 1 2 1 2 1 2 1 6 11 16 21 26 31 36 41 46 51 Blink SSD Color Figure 9. Cycles per second, for each of three processes, as the supervisor steps through the 3 states. Square is Blink Detection, o is correlation, + is histogram. Computer vision provides a means to greatly reduce the bandwidth requirements for video communications while enhancing ease of use by permitting the system to keep the user centered in the image. Images of a talking face, transmitted for video-communications, are highly repetitive. In such images, the face undergoes a limited set of deformations, as the subject speaks and gestures with his body movements. These deformations can be captured to form an orthogonal "basis space" of images. Such a space permits each individual image to be coded and transmitted as a relatively small vector of coefficients. As few as 15 such coefficients
(coded as 60 bytes) can be sufficient for quite realistic reconstruction of a talking face, provided that the face is registered and normalized in position and size. 4. Looking at walking Walking is a periodic movement. The boundary contour of a walking person deforms in a cyclic and predictable manner. This deformation can be described with respect to a vector of boundary points using Principal components analysis (PCA). With this technique, the principal components of the covariance matrix from the vector of boundary points are used to define the axes of an orthogonal space. A rhythmic movement, such as a person walking, can be described as a closed contour in this space. This contour captures the sequence of configurations of boundary points typical of walking. An example of a real time system for detecting and tracking humans walking using PCA of the contour points has been developed at the University of Leeds [Baumberg-Hogg 94]. This system begins processing by subtracting out an average background image from each image. Subtraction brings out regions where individuals are moving in the scene. However, a region where an image is different from the background can result from many sources: a passing cloud, rain, a passing car, a dog or a walking person. Each such source can result in a very large number of possible forms, making it difficult to detect only walking figures. The Leeds system solves this problem by comparing the contour of a detected region to a model for a walking human using principal component analysis. The system isolates regions of a scene by subtracting each image from an average "background image". The subtracted image is then compared to a threshold to produce a binary image (composed of 1's and 0's). Objects which were not present in the back-ground image, appear as regions of pixels with the value of 1. Such regions are described as a vector of boundary points. An example of the reconstructed contour superimposed on the images of a person walking is shown in figure 10. Figure 10. Reconstructed contour superimposed on an image of a person walking. The Contour is reconstructed as a weighted sum of principal components (Figures provided by D. Hogg, U. Leeds). Any rhythmic movement, such as walking, running, sitting or crawling can be described by the projection of sequences of boundary point sets into a principal components space. Although the mathematical foundations of the technique are somewhat unusual, the technique itself is easily implemented on a standard workstation or personal computer and can operate in real time if the computer is equipped with a sufficiently fast image acquisition card. 5 Conclusions For many years, applications of computer vision and image processing have been held back by the high cost and limited availability of hardware with sufficient digitizing bandwidth and computing power. In addition, restrictions on computing power led some researchers to employ edge based techniques which ultimately provided fragile and unreliable. This situation is now changing dramatically with the widespread introduction of computer workstations, and even
personal computers, equipped with on-board image digitizers, and with sufficient computing power for real time processing. This has made possible the use of appearance based approaches to real time computer vision [Murase-Nayar 95], [Turk-Pentland 91]. Appearance based computer vision is expensive in memory and computing power. However, these techniques are based on a solid theoretical foundation which makes it possible to predict failure rates and failure conditions. Furthermore, these techniques tend to have constant or linear computational complexities, which make it possible to employ them to real time systems. The continued exponential growth in computer power has now reached the necessary levels to make possible inexpensive use of such techniques. Research in detecting, tracking and understanding the actions of people for surveillance applications has turned out to have important spin-offs for human computer interaction. Techniques originally designed to monitor movements in a secure site, can also be used to provide communications and access to information technology, to reduce band-width requirements for video-telephones, and to allow computer users to employ any convenient physical object as a computer entry device. References [Baumberg-Hogg 94] A. M. Baumberg and D. Hogg, "Learning Flexible Models from Image Sequences", European Conference on Computer Vision (ECCV '94), Stockholm, May 94 [Cadoz 94] C. Cadoz,, Les réalités virtuelles, Dominos, Flammarion, 1994. [Crowley et al 1994] J. L. Crowley and Bedrune, J. M. "Integration and Control of Reactive Visual Processes", 1994 European Conference on Computer Vision, (ECCV-'94), Stockholm, may 94. [Crowley-Martin 95]. J. L. Crowley and J. Martin, "Experimental Comparison of Correlation Techniques", IAS-4, International Conference on Intelligent Autonomous Systems, Karlsruhe, 1995. [Crowley-Berard 97] J. L. Crowley and F. Berard, "Multi-Modal Tracking of Faces for Video Communications", IEEE Conference on Computer Vision and Pattern Recognition, CVPR '97, St. Juan, Puerto Rico, June 1997. [Crowley-Martin 97] J. L. Crowley and J. Martin, "Visual Processes for Tracking and Recognition of Hand Gestures", International Workshop on Percpetual User Interfaces, Banf, Canada, October 1997. [Krueger 91] M. Krueger, Artifical Reality II, Addison Wesley, 1991. [Murase-Nayar 95] H. Murase and S. K. Nayar, "Visual Learning and Recognition of 3-D Objects from Appearance", IJCV, Vol 14, No. 1, January 1995. [Schiele-Waibel] B. Schiele and Waibel, A. "Gaze Tracking Based on Face Color", International Workshop on Face and Gesture Recognition, Zurich, 1995. [Turk-Pentland 91] M. Turk, and A. P. Pentland. Eigenfaces for Recognition Journal of Cognitive Neuroscience, 3(1):71-86, 1991. [Wellner et al 93] P. Wellner, Mackay, W. and Gold, R., Computer-augmented environments : back to the real world. Special Issue of Communications of the ACM, Vol.36 No.7., 1993.