Head and Facial Animation Tracking using Appearance-Adaptive Models and Particle Filters

Transcription

1 Head and Facial Animation Tracking using Appearance-Adaptive Models and Particle Filters F. Dornaika and F. Davoine CNRS HEUDIASYC Compiègne University of Technology 625 Compiègne Cedex, FRANCE {dornaika, Abstract This paper introduces two frameworks for head and facial animation tracking. The first framework introduces a particle-filter tracker capable of tracking the 3D head pose using a statistical facial texture model. The second framework introduces an appearance-adaptive tracker capable of tracking the 3D head pose and the facial animations in real-time. This framework has the merits of both deterministic and stochastic approaches. It consists of an online adaptive observation model of the face texture together with an adaptive transition motion model. The latter is based on a registration technique between the appearance model and the incoming observation. The second framework extends the concept of Online Appearance Models to the case of tracking 3D non-rigid face motion (3D head pose and facial animations). Tracking long video sequences demonstrated the effectiveness of the developed methods. Accurate tracking was obtained even in the presence of perturbing factors such as illumination changes, significant head pose and facial expression variations as well as occlusions. 1. Introduction Object tracking is required by many vision applications, especially in video technology and visual interface systems. The ability to track facial motion is useful in applications such as face-based biometric person authentication, expression analysis, and human computer interaction (HCI). Visual tracking of faces can be utilized for recognizing basic activities from videos. Detecting and tracking faces in video sequences is a challenging task because faces are non-rigid and their images have a high degree of variability. A huge research effort has already been devoted to detecting and tracking of facial features in 2D and 3D (e.g., [3, 4, 6, 9]). Visual tracking can be performed by feature-based or appearance-based approaches. Moreover, tracking approaches can be divided into two groups: deterministic and stochastic. Probabilistic video analysis has recently gained significant attention in the computer vision community through the use of stochastic sampling techniques. Visual tracking based on probabilistic analysis is formulated from a Bayesian perspective as a problem of estimating some degree of belief in the state of an object at the current time given a sequence of observations. Particle filtering methods [2] (also known as Sequential Monte Carlo (SMC) methods) were independently used and proposed by several research groups. These algorithms provide flexible tracking frameworks as they are neither limited to linear systems nor require the noise to be Gaussian and proved to be more robust to distracting clutter as the randomly sampled particles allow to maintain several competing hypotheses of the hidden state. These algorithms have gained prevalence in the video tracking literature due in part to the CONDENSATION algorithm [8]. Classical 3D vision techniques provide tools for computing the 3D pose and the possible facial animations from images. However, such trackers very often suffer from the drifting problem (error accumulation) since facial features do not have enough stable local appearance due to many factors. Within deterministic approaches, appearance-based techniques have been widely used in object tracking. These techniques have the advantage that they are easy to implement and are generally more robust than feature-based methods. To overcome the problem of appearance changes, recent works on faces adopted statistical facial textures. For example, the Active Appearance Models have been proposed as a powerful tool for analyzing facial images [5, 11]. In [1], the author proposed a framework that tracks face and facial features using Active Appearance Models. While statistical appearance-based tracking methods are promis-

2 ing with respect to some aspects, they are depending on the imaging conditions under which the learning is performed. Thus, by changing these conditions, one should repeat the whole learning process, which can be very tedious. Recently, 2D tracking approaches have adopted online appearance models (OAMs) [1, 16]. In [16], a deterministic and stochastic approach was developed to track the 2D motion of faces using an affine transform. The developed approach adopts adaptive observation models since the object appearance is learned during the tracking. Thus, unlike tracking approaches using statistical texture modelling OAMs are expected to offer a lot of flexibility. This paper has two main contributions. First, we develop a particle filter-based approach for tracking the 3D head pose using a statistical facial texture model. Second, we propose a framework for tracking the 3D head pose and the facial animations in real-time using an online appearance model where both the observation and transition models are adaptive. The second framework extends the concept of OAMs to the case of tracking 3D non-rigid face motion (3D head pose and facial animation). The rest of the paper is organized as follows. Section 2 describes the deformable 3D face model. Section 3 states the tracking problem. Section 4 describes the particle filter-based 3D head tracking. Section 5 describes the head and facial animation tracking using appearance-adaptive models and particle filters. Section 6 presents some experimental results. 2. Modelling faces 2.1. A deformable 3D model Building a generic 3D face model is a challenging task. Indeed, such a model should account for the differences between different specific human faces as well as between different facial expressions. This modelling was explored in the computer graphics, computer vision, and model-based image coding communities. In our study, we use the 3D face model Candide. This 3D deformable wireframe model was first developed for the purpose of model-based image coding and computer animation. The 3D shape of this deformable 3D wireframe model is directly recorded in coordinate form. The 3D face model is given by the 3D coordinates of the vertices P i, i = 1,..., n where n is the number of vertices. Thus, the shape up to a global scale can be fully described by the 3n-vector g the concatenation of the 3D coordinates of all vertices P i. The vector g can be written as: g = g + S τ s + A τ a (1) where g is the standard shape of the model, and the columns of S and A are the Shape and Animation Units, respectively. A Shape Unit provides a way to deform the 3D wireframe such as to adapt the eye width, the head width, the eye separation distance etc. Thus, the term S τ s accounts for shape variability (inter-person variability) while the term A τ a accounts for the facial animation (intra-person variability). The shape and animation variabilities can be approximated well enough for practical purposes by this linear relation. Also, we assume that the two kinds of variability are independent. In this study, we use 12 modes for the Shape Units matrix and six modes for the Animation Units matrix. Without loss of generality, we have chosen the following Action Units: 1) Jaw drop, 2) Lip stretcher, 3) Lip corner depressor, 4) Upper lip raiser, 5) Eyebrow lowerer, 6) Outer eyebrow raiser. These Action Units are enough to cover most common facial actions (mouth and eyebrow movements). In Equation (1), the 3D shape is expressed in a local coordinate system. However, one should relate the 3D coordinates to the image coordinate system. To this end, we adopt the weak perspective projection model. We neglect the perspective effects since the depth variation of the face can be considered as small compared to its absolute depth. Therefore, the mapping between the 3D face model and the image is given by a 2 4 matrix, M, encapsulating both the 3D head pose and the camera parameters. Thus, a 3D vertex P i = (X i, Y i, Z i ) T g will be projected onto the image point p i = (u i, v i ) T given by: (u i, v i ) T = M (X i, Y i, Z i, 1) T (2) For a given person, τ s is constant. Estimating τ s can be carried out using either feature-based or featureless approaches. Thus, the state of the 3D model is given by the 3D head pose (three rotations and three translations) and the control vector τ a. This is given by the state vector b: b = [θ x, θ y, θ z, t x, t y, t z, τ a T ] T (3) 2.2. Shape-free facial images A face texture is represented as a shape-free texture (geometrically normalized image). The geometry of this image is obtained by projecting the standard shape g (wireframe) using a standard 3D pose (frontal view) onto an image with a given resolution. The texture of this geometrically normalized image is obtained by texture mapping from the triangular 2D mesh in the input image using a piece-wise affine transform, W. Mathematically, the warping process applied to an input image

3 y is denoted by: x(b) = W(y, b) (4) where x denotes the shape-free texture and b denotes the geometrical parameters. Here images are represented by one-dimensional vectors. Two resolution levels have been used for the shape-free textures, encoded by 131 and 5392 pixels. Figure 1 illustrates the warping process applied to two input images. In this example, the resolution of the shape-free images is 5392 pixels. Regarding photometric transformations, a zeromean-unit-variance normalization is used to partially compensate for contrast variations. The complete image transformation is implemented as follows: (i) transfer the texture y using the piece-wise affine transform associated with the geometric parameters b = [θ x, θ y, θ z, t x, t y, t z, τ a T ] T, and (ii) perform zero-meanunit-variance normalization on the obtained patch. 4. 3D head tracking with a particle filter In this Section, we are interested in tracking the 3D head pose. Therefore, the state vector b is given by b = [θ x, θ y, θ z, t x, t y, t z ] T. In this particular case, the animation parameters τ a can be set to zero. Almost all previous works using particle filters have focused on tracking 2D objects or motions such as 2D contours, 2D blobs, 2D affine transforms [18], and 2D ellipses [13, 15]. Here (in this Section), we propose a particle filtering method that tracks the six degrees of freedom associated with the head motion where the face model is given the Candide model. Particle filtering (or Sequential Monte Carlo method) is an inference process which can be considered as a generalization of the Kalman filter. It aims at estimating the unknown state b t from a set of noisy observations (images), y 1:t = {y 1,, y t } arriving in a sequential fashion. Two important components of this approach are the state transition and observation models whose most general forms can be given by: State transition model b t = F t (b t 1, U t ) (5) Observation model y t = G t (b t, V t ) (6) (a) (b) Figure 1: (a) Two input images with correct adaptation. (b) The corresponding shape-free facial images. 3. The tracking problem Given a video sequence depicting a moving face, the tracking consists in estimating (for each frame) the 3D head pose as well as the facial animations encoded by the control vector τ a. In other words, one would like to estimate the vector b (Eq.(3)) for each frame t. In a tracking context (deterministic and stochastic trackers), the model parameters associated with the current frame will be handed over to the next frame. where U t is the system noise, F t is the kinematics, V t is the observation noise, and G t models the observer. The particle filter approximates the posterior distribution p(b t y 1:t ) by a set of weighted particles {b (j) t, w (j) t } J j=1. Each element b(j) t represents the hypothetical state of the object and w (j) t is the corresponding discrete probability. Then, the state estimate can be set to the minimum mean square error (MMSE) estimate or the maximum a posteriori (MAP). We use the following simple state transition model: b t = b t 1 + U t (7) In this model, U t is a random vector having a centred normal distribution, N(, Σ). The covariance matrix Σ is learned off-line from the state vector differences b t b t 1 associated with previously tracked video sequences. Since image data y are represented as shape-free textures x (the warped texture), we can set the observation likelihood p(y t b t ) to p(x t b t ). The observation likelihood p(x t b t ) quantifies the consistence of the texture x(b t ) with the statistical texture model represented by texture modes (eigenvectors). For this purpose, we use a likelihood measure such the

4 one proposed in [12]: ( p(x t b t ) = c exp 1 2 M ξ 2 i λ 2 i=1 i ) ( exp e ) 2ρ (8) where e is the reconstruction error, λ i s are the eigenvalues associated with the first M eigenvectors, and ρ is the arithmetic average of the remaining eigenvalues. This likelihood measure takes into account two distances (i) the DFFS (distance-from-feature-space), and (ii) the DIFS (distance-in-feature-space). Maximizing this likelihood is equivalent to minimizing the Mahalanobis distance over the original textures. A particle filter algorithm proceeds as follows: Initialize a sample set S = {b (j), 1}J j=1 according to some prior distribution p(b ). For t = 1, 2, For j = 1, 2,, J Resample S t 1 (b (j) t 1, 1). Predict the sample b (j) t to obtain a new sample from b (j) t 1 by drawing U (j) t and computing according to Eq. (7). Compute the geometrically normalized texture x(b (j) t ) according to Eq. (4). Update the weight using w (j) t according to Eq. (8). End Normalize the weights w (j) t = p(x t b (j) t ) = w (j) t / J i=1 w(i) t ; j = 1, 2,..., J End During filtering, due to the resampling step, samples with a high weight may be chosen several times while others with relatively low weights may not be chosen at all. Note that the initial distribution p(b ) can be either a Dirac or Gaussian distribution centred on a solution provided by a detector algorithm or manually specified. 5. Appearance-adaptive models and particle filters In this Section, we consider the 3D head pose as well as the facial animations, that is, the state vector b is given by b = [θ x, θ y, θ z, t x, t y, t z, τ a T ] T Motivations The efficiency of the stochastic tracking algorithm presented above depends on many factors. However, the main factor which limits the efficiency of stochastic tracking algorithms is the lack of suitable state transition models. Indeed, there are two ways for handling the transition model. (i) The first is to learn state transition models directly from training videos. For example, in [14] the authors use Expectation-Maximization based on the CONDENSATION algorithm to learn multiclass dynamics associated with a juggled ball. However, such models may not necessarily succeed when presented with testing videos featuring different types of motions. (ii) The second is to use a fixed model with fixed noise variance for simplicity, that is, the predicted state is simply the previous state (or a shifted version of it) to which a random noise with fixed variance is added (this methodology was adopted in Section 4). If the variance is very small, it is very hard to model rapid movements; if the variance is large, it is computationally inefficient since many more particles are needed to accommodate large noise variance. In addition to the problems associated with the state transition model, the observation model has its own limitations. For example, if the observation model (observation likelihood) is built upon a statistical texture model, any significant change in the imaging conditions will make the corresponding learned observation model useless and one should build a new observation model based on a new statistical texture model. For all these factors, we develop a new tracking framework capable of coping with the limitations mentioned above. Our approach is to make both observation and state transition models adaptive in the framework of a particle filter, with provisions for handling outliers embedded. The main features of the developed approach are: Adaptive observation model. We adopt an appearance-based approach. Our approach adopts the concept of online appearance model (OAM) where the appearance is learned on-line from the tracked video sequence [16, 17, 1]. However, in our case, we extend this concept to the case of tracking the 3D non-rigid face motion (3D head pose and facial animation). Therefore, the observation model is adaptive as the appearance of the texture. Adaptive state transition model. Instead of using a fixed state transition model, we use an adaptivevelocity model, where the adaptive motion velocity is predicted using a registration technique between the incoming observation and the current appearance configuration. We also use an adaptive noise component whose magnitude is a function of the registration error. We vary the number of particles based on the noise component.

5 Handling occlusion. Occlusion and large image variations are handled using robust statistics. We robustify the likelihood measurement and the adaptive velocity estimate by downweighting the outlier pixels Adaptive observation model Now we consider the shape-free texture x t. Remember that x t is a transformed version of the observation y t (see Eq. (4)). The appearance model at time t, A t, is a time-varying one that models the appearances present in all observations x up to time t 1. For each frame, the observation is simply the warped texture associated with the computed geometric parameters b t. We use the hat symbol for the tracked parameters and textures. For a given frame t, ˆb t represents the computed geometric parameters and ˆx t the corresponding texture patch, that is, ˆx t = x(ˆb t ) = W(y t, ˆb t ) The appearance model A t obeys a Gaussian with a center µ and a variance σ. Notice that µ and σ are vectors consisting of d pixels (d is the size of x) that are assumed to be independent of each other. In summary, the observation likelihood is written as p(y t b t ) = p(x t b t ) = d N(x i ; µ i, σ i ) (9) i=1 where N(x; µ i, σ i ) is a normal density: [ ( )] x N(x; µ i, σ i ) = (2πσi 2 ) 1/2 µi exp ρ σ i (1) where ρ(x) is given by ρ(x) = 1 2 x2. We assume that A t summarizes the past observations under an exponential envelop with a forgetting factor α. When the appearance is tracked for the current input image, i.e. the texture ˆx t is available, we can compute the updated appearance and use it to track in the next frame. It can be shown that the appearance model parameters, i.e, µ and σ can be updated using the following equations (see [1] for more details on OAMs): µ t+1 = α µ t + (1 α) ˆx t (11) σ 2 t+1 = α σ 2 t + (1 α) (ˆx t µ t ) 2 (12) In the above equations, all µ s and σ 2 s are vectorized and the operation is element-wise. This technique, also called recursive filtering, is simple, time-efficient and therefore, suitable for real-time applications. Note that µ is initialized with the first patch x. However, Eq.(12) is not used until the number of frames reaches a certain value (e.g., the first 4 frames). For these frames, the classical variance is used, that is, Eq.(12) is utilized with α being set to 1 1 t Adaptive transition model Instead of using a fixed function F to predict the transition state from time (t 1) to time t, we use the following adaptive transition model: b t = ˆb t 1 + b t + U t (13) where b t is the shift in the geometric parameters and U t is the random noise. Our basic idea allowing to recover the solution b t or equivalently the deterministic part of Eq.(13) is to use region-based registration techniques. In other words, the current input image y t is registered with the current appearance model A t. For this purpose, we minimize the Mahalanobis distance between the warped texture and the current appearance mean, d ( ) 2 xi µ i min e(b t ) = min D(x(b t ), µ t ) = b t b t σ i=1 i (14) Note the appearance parameters µ t and σ t are known. The above criterion can be minimized using iterative first-order linear approximation. Gradient-descent registration We assume that there exists a b t = ˆb t 1 + b t such that the warped texture will be very close to the appearance mean, i.e, W(y t, b t ) µ t Approximating W(y t, b t ) via a first-order Taylor series expansion around ˆb t 1 yields W(y t, b t ) W(y t, ˆb t 1 ) + G t (b t ˆb t 1 ) where G t is the gradient matrix. above two equations we have: µ t = W(y t, ˆb t 1 ) + G t (b t ˆb t 1 ) By combining the Therefore, the shift in the parameter space is given by: b t = b t ˆb t 1 = G t (W(y t, ˆb t 1 ) µ t ) (15) In practice, the solution b t (or equivalently the shift b t ) is estimated by running several iterations until the error cannot be improved. We proceed as follows. Starting from b = ˆb t 1, we compute the error vector (W(y t, ˆb t 1 ) µ t ) and the corresponding Mahalanobis

6 distance e(b) (given by Eq.(14)). We find a shift b by multiplying the error vector with the negative pseudoinverse of the gradient matrix using Eq.(15). The vector b gives a displacement in the search space for which the error, e, can be minimized. We compute a new parameter vector and a new error: b = b + ρ b (16) e = e(b ) where ρ is a positive real. If e < e, we update b according to Eq.( 16) and the process is iterated until convergence. If e e, we try smaller update steps having the same direction (i.e., smaller ρ is used). Convergence is declared when the error cannot be improved anymore. Computing the gradient matrix is given by: G = W(y t, b t) b The gradient matrix = x t b It is approximated by numerical differences. Once the solution ˆb t becomes available for a given frame, it is possible to compute the gradient matrix from the associated input image. We use the following: The j th column of G (j = 1,..., dim(b)): G j = W(y t, b t) b j can be estimated using differences G j W(y t, b t) W(y t, b t + δ q j ) δ where δ is a suitable step size and q j is a vector with all elements zero except the j th element that equals one. To gain more accuracy, the j th column of G is estimated using several steps around the current value b j, and then averaging over all these, we get our final G j as G j = 1 K K/2 k= K/2,k W(y t, b t ) W(y t, b t + k δ j q j ) k δ j where δ j is the smallest perturbation associated with the parameter b j and K is the number of steps (in our experiments, K is set to 8). Notice that the computation of the gradient matrix G t at time t is carried out using the estimated geometric parameters ˆb t 1 and the associated input image y t 1 since the adaptation for the time t has not been computed. It is worthwhile noting that the gradient matrix is computed for each time step. The advantage is twofold. First, a varying gradient matrix is able to accommodate appearance changes. Second, it will be closer to the exact gradient matrix since it is computed for the current geometric configuration (3D head pose and facial animations) whereas a fixed gradient matrix can be a source of errors for some kinds of motions such as out-of-plane motions Handling outliers and occlusions We assume that occlusion and large image differences can be treated as outliers. Outlier pixels cannot be explained by the underlying process (the current appearance model A t ) and their influences on the estimation process should be reduced. Robust statistics provide such mechanisms [7]. The mechanism will have impact on three items: (i) the likelihood measure, (ii) the gradient descent method, and (iii) the update of the online appearance model A t. We use the ˆρ function defined as follows: { 1 ˆρ(x) = 2 x2 if x c c x 1 2 c2 if x > c where x is the value of a pixel in the patch x, normalized by the mean and the variance of the appearance at the same pixel, i.e. µ i and σ i. The constant c controls the outlier rate. In our experiment, we take c = 3 based on experimental experience. If x > c is satisfied, we declare the corresponding pixel an outlier. Likelihood measure To make the likelihood measure robust, we replace the one-dimensional normal density N(x; µ i, σ i ) by [ ( )] x ˆN(x; µ i, σ i ) = (2πσi 2 ) 1/2 µi exp ˆρ Gradient method To downweight the influence of the outlier pixels in the registration technique, we introduce a d d diagonal matrix L t with its i th diagonal element being L t (i) = η(x i ) where x i is the i th element of the difference image (W(y t, ˆb t 1 ) µ t ) normalized by the corresponding variance σ i and η(x) = 1 { dˆρ(x) 1 if x c x dx = c x if x > c Therefore, the shift used in the gradient-descent registration becomes b t = b t ˆb t 1 = G t L t (W(y t, ˆb t 1 ) µ t ) (17) Note that the quadratic error function (Eq.(15)) can also be replaced by the robust function ˆρ. σ i

7 Appearance update Once the solution b t is ready, the corresponding patch x will be used to update the appearance. For non-outlier pixels the update equations are given by Eqs. (11) and (12); for outlier pixels the corresponding means and variances are not updated. This mechanism is very useful for preventing occlusions from deteriorating the online appearance model. Occlusions Occlusion is declared if the percentage of outliers exceeds a certain threshold A tracking algorithm Tracking the 3D head pose and the facial animations is performed as follows. Starting from the solution, ˆb t 1, associated with the previous frame, we predict the state using Eq.(13) in which the deterministic part of the prediction, i.e. ˆb t 1 + b t, is computed by the registration technique and the noise variance was set as a monotically increasing function of the registration error obtained at convergence. Once a set of particles is obtained, the MAP of Eq.(9) is again chosen to be the solution of the current frame. As can be seen, unlike the classical particle filtering, the propagation concerns the MAP solution only and not the whole particle set. It is worthwhile noting that although the solutions provided by the deterministic and stochastic parts of Eq.(13) have utilized the same observation model, there are some differences. The deterministic solution is obtained by a directed continuous search starting from the solution associated with the previous frame. The stochastic solution is obtained by diffusing the deterministic solution in order to obtain possible refinement. 6. Experimental results D head tracking with a particle filter Figure 2 displays the tracking results associated with several frames of a long test sequence. It corresponds to the particle-filter-based tracking algorithm using a statistical texture model. The number of particles is set to 3. For each frame in this figure, only the MAP solution is displayed. The statistical facial texture is built with 33 training face images and the number of the principal eigenvectors is 2. The bottom plot displays the weights associated with the bottom-right image Head and facial animation tracking with OAMs Figure 3 displays the head and facial animation tracking results associated with a 8-frame-long (only four Particles Figure 2: Particle-filter-based 3D head tracking with a statistical facial texture model. The bottom plot displays the weights associated with the bottom-right image. frames are shown). These results correspond to the realtime tracker based on appearance-adaptive model (described in Section 5). The sequence features quite large head pose variations as well as large facial animations. The sequence is of resolution 64x48 pixels. As can be seen with the very little prior information, the 3D motion of the face as well as the facial actions associated with the mouth and the eyebrows are accurately recovered. The upper left corner shows the current appearance (µ t ) and the current shape-free texture (ˆx t ). Figure 4 displays the estimated values of the 3D head pose parameters (the three rotations and the three translations) as well as the lower lip depressor, and the inner brow lowerer as a function of the frames of the same sequence. Figure 5 displays the face and facial animation results associated with a 62-frame-long sequence. The left column displays the tracking results of the realtime tracker using the appearance-adaptive model with a

8 fixed gradient matrix computed at the first video frame. The right column displays the tracking results when a time-varying gradient matrix has been used. We have noticed, for this sequence, that whenever the face performs an out-of-plane motion the tracker with a timevarying gradient is more accurate than the one using a fixed gradient. Moreover, in other experiments, the tracker using a fixed gradient matrix has totally lost the track. Figure 6 displays the face and facial animation results associated with another 42-frame-long sequence featuring two occlusions (only the second occlusion is displayed). The two occlusions are caused by putting the hand in front of the face. The frames 218 and 265 shows the start and the end of the second occlusion, respectively. On a 2.2GHz PC, a non-optimized C code of the developed appearance-adaptive algorithm runs around 14 frames-per-second assuming that the patch resolution is 131 pixels, K is 8, and the average number of particles is 6. By dropping the stochastic diffusion stage, the algorithm runs 2 frames-per-second. 7 Conclusion In this paper, we have proposed two tracking methods. The first method is fully stochastic and uses a particle filter with an observation likelihood based on statistical facial textures. This method has been utilized for 3D head tracking. The second method combines the merits of both stochastic and deterministic methods and is capable of tracking the head and facial animation. It employs an Online Appearance Model where both the observation and transition models are adaptive. The deterministic part exploits a directed continuous search aiming at minimizing the discrepancy between the upcoming observation and the current appearance model. Tracking long video sequences demonstrated the effectiveness of the developed methods. Accurate tracking was obtained even in the presence of perturbing factors such as illumination changes, significant head pose and facial expression variations as well as occlusions. Currently, we are investigating the recognition of facial expressions and gestures from the tracked parameters. Figure 3: Our real-time framework for tracking the 3D head pose and the facial animations with an appearanceadaptive model. The sequence length is 8 frames.

9 Pitch Yaw Deg. Deg Roll Scale Deg X Translation 1 Y Translation Pixels Pixels Lower lip depressor Inner brow lowerer Figure 5: 94, 21, 376, and 569 (from top to bottom). The left column displays the head and facial animation tracking using the appearance-adaptive framework with a fixed gradient. The right column displays the tracking results when a time-varying gradient has been used (see text) Figure 4: The tracked parameters as a function of time associated with the 8-frame-long sequence. The first six plots display the six degrees of freedom of the 3D head pose. The two bottom plots display the lower lip depressor and inner brow parameters, respectively. Frame 218 Frame 265 Frame 392 Figure 6: Tracking another test sequence featuring two occlusions. 218, 265, and 392 are displayed. Only the second occlusion is shown. The frames 218 and 265 shows the start and the end of the second occlusion, respectively. As can be seen, the head is accurately tracked even when a sustained occlusion occurs.

10 References [1] J. Ahlberg. An active model for facial feature tracking. EURASIP Journal on Applied Signal Processing, 22(6): , June 22. [2] S. Arulampalam, S.R. Maskell, N.J. Gordon, and T. Clapp. A tutorial on particle filters for on-line nonlinear/non-gaussian Bayesian tracking. IEEE Trans. on Signal Processing, 5(2): , 22. [3] S. Birchfield. Elliptical head tracking using intensity gradients and color histograms. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, [4] M.L. Cascia, S. Sclaroff, and V. Athitsos. Fast, reliable head tracking under varying illumination: An approach based on registration of texturemapped 3D models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4): , 2. [5] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6): , 21. [6] S.B. Gokturk, J.Y. Bouguet, and R. Grzeszczuk. A data-driven model for monocular face tracking. In Proc. IEEE International Conference on Computer Vision, 21. [7] P.J. Huber. Robust Statistics. Wiley, 23. Transactions on Pattern Analysis and Machine Intelligence, 19(7):696 71, [13] H. Nait-Cherif and S.J. McKenna. Head tracking and action recognition in a smart meeting room. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, March 23. [14] B. North, A. Blake, M. Isard, and J. Rittscher. Learning and classification of complex dynamics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9): , 2. [15] K. Nummiaro, E. Koller-Meier, and L. Van Gool. Object tracking with an adaptive color-based particle filter. In Symposium for Pattern Recognition of the DAGM, September 22. [16] S. Zhou, R. Chellappa, and B. Mogghaddam. Adaptive visual tracking and recognition using particle filters. In IEEE International Conference on Multimedia and Expo, 23. [17] S. Zhou, R. Chellappa, and B. Mogghaddam. Adaptive visual tracking and recognition using appearance-adaptive models in particle filters. IEEE Transactions on Image Processing, 24. [18] S. Zhou, V. Krueger, and R. Chellappa. Probabilistic recognition of human faces from video. Computer Vision and Image Understinding, 91(1-2): , 23. [8] M. Isard and A. Black. Contour tracking by stochastic propogation of conditioanl density. In Proc. European Conference on Computer Vision, [9] T.S. Jebara and A. Pentland. Parameterized structure from motion for 3D adaptative feedback tracking of faces. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, [1] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi. Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1): , 23. [11] I. Matthews and S. Baker. Active appearance models revisited. Technical Report CMU-RI-TR-3-2, The Robotics Institute, Carnegie Mellon University, 22. [12] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. IEEE