Face Algnment through Subspace Constraned Mean-Shfts Jason M. Saragh, Smon Lucey, Jeffrey F. Cohn The Robotcs Insttute, Carnege Mellon Unversty Pttsburgh, PA 15213, USA {jsaragh,slucey,jeffcohn}@cs.cmu.edu Abstract Deformable model fttng has been actvely pursued n the computer vson communty for over a decade. As a result, numerous approaches have been proposed wth varyng degrees of success. A class of approaches that has shown substantal promse s one that makes ndependent predctons regardng locatons of the model s landmarks, whch are combned by enforcng a pror over ther jont moton. A common theme n nnovatons to ths approach s the replacement of the dstrbuton of probable landmark locatons, obtaned from each local detector, wth smpler parametrc forms. Ths smplfcaton substtutes the true objectve wth a smoothed verson of tself, reducng senstvty to local mnma and outlyng detectons. In ths work, a prncpled optmzaton strategy s proposed where a nonparametrc representaton of the landmark dstrbutons s maxmzed wthn a herarchy of smoothed estmates. The resultng update equatons are remnscent of mean-shft but wth a subspace constrant placed on the shape s varablty. Ths approach s shown to outperform other exstng methods on the task of generc face fttng. 1. Introducton Deformable model fttng s the problem of regsterng a parametrzed shape model to an mage such that ts landmarks correspond to consstent locatons on the object of nterest. It s a dffcult problem as t nvolves an optmzaton n hgh dmensons, where appearance can vary greatly between nstances of the object due to lghtng condtons, mage nose, resoluton and ntrnsc sources of varablty. Many approaches have been proposed for ths wth varyng degrees of success. Of these, one of the most promsng s one that uses a patch-based representaton and assumes mage observatons made for each landmark are condtonally ndependent [2, 3, 4, 5, 16]. Ths leads to better generalzaton wth lmted data compared to holstc representatons [10, 11, 14, 15], snce t needs only account for local correlatons between pxel values. However, t suffers from detecton ambgutes as a drect result of ts local representaton. As such, care should be taken n combnng detecton results from the varous local detectors n order to steer optmzaton towards the desred soluton. Our key contrbuton n ths paper les n the realzaton that a number of popular optmzaton strateges are all, n some way, smplfyng the dstrbuton of landmark locatons obtaned from each local detector usng a parametrc representaton. The motvaton of ths smplfcaton s to ensure that the approxmate objectve functon: () exhbts propertes that make optmzaton effcent and numercally stable, and () stll approxmately preserve the true certanty/uncertanty assocated wth each local detector. The queston then remans: how should one smplfy these local dstrbutons n order to satsfy () and ()? We address ths by usng a nonparametrc representaton that leads to an optmzaton n the form of subspace constraned mean-shfts. 2. Background 2.1. Constraned Local Models Most fttng methods employ a lnear approxmaton to how the shape of a non-rgd object deforms, coned the pont dstrbuton model (PDM) [2]. It models non-rgd shape varatons lnearly and composes t wth a global rgd transformaton, placng the shape n the mage frame: x = sr( x + Φ q)+t, (1) where x denotes the 2D-locaton of the PDM s th landmark and p = {s, R, t, q} denotes the parameters of the PDM, whch consst of a global scalng s, a rotaton R, a translaton t and a set of non-rgd parameters q. In recent years, an approach to that utlzes an ensemble of local detectors (see [2, 3, 4, 5, 16]) has attracted some nterest as t crcumvents many of the drawbacks of holstc approaches, such as modelng complexty and senstvty to lghtng changes. In ths work, we wll refer to these methods collectvely as constraned local models (CLM) 1. 1 Ths term should not be confused wth the work n [5] whch s a partcular nstance of CLM n our nomenclature.
optmzaton proceeds by maxmzng: n p({l = algned} n p) = p(l = algned x ) (4) Fgure 1. Illustraton of CLM fttng and ts two components: () an exhaustve local search for feature locatons to get the response maps {p(l = algned I,x)} n, and () an optmzaton strategy to maxmze the responses of the PDM constraned landmarks. All nstantatons of CLMs can be consdered to be pursung the same two goals: () perform an exhaustve local search for each PDM landmark around ther current estmate usng some knd of feature detector, and () optmze the PDM parameters such that the detecton responses over all of ts landmarks are jontly maxmzed. Fgure 1 llustrates the components of CLM fttng. Exhaustve Local Search: In the frst step of CLM fttng, a lkelhood map s generated for each landmark poston by applyng local detectors to constraned regons around the current estmate. A number of feature detectors have been proposed for ths purpose. One of the smplest, proposed n [16], s the lnear logstc regressor whch gves the followng response map for the th landmark 2 : p(l = algned I,x) = 1 1 + exp{αc (I; x)+β}, (2) where l s a dscrete random varable denotng whether the th landmark s correctly algned or not, I s the mage, x s a 2D locaton n the mage, and C s a lnear classfer: C (I; x) =w T [ I(y1 );... ; I(y m ) ] + b, (3) wth {y } m Ω x (.e. an mage patch). An advantage of usng ths classfer s that the map can be computed usng effcent convoluton operatons. Other feature detectors have also been used to great effect, such as the Gaussan lkelhood [2] and the Haar-based boosted classfer [3]. Optmzaton: Once the response maps for each landmark have been found, by assumng condtonal ndependence, 2 Not all CLM nstances requre a probablstc output from the local detectors. Some, for example [2, 5], only requre a smlarty measure or a match score. However, these matchng scores can be nterpreted as the result of applyng a monotonc functon to the generatng probablty. For example, the Mahalanobs dstance used n [2] s the negatve log of the Gaussan lkelhood. In the nterest of clarty and succnctness, dscussons n ths work assume that responses are probabltes. wth respect to the PDM parameters p, where x s parameterzed as n Equaton (1) and dependence on the mage I s dropped for succnctness. It should be noted that some forms of CLMs pose Equaton (4) as mnmzng the summaton of local energy responses (see 2.2). The man dffculty n ths optmzaton s how to avod local optma whlst affordng an effcent evaluaton. Treatng Equaton (4) as a generc optmzaton problem, one may be tempted to utlze general purpose optmzaton strateges here. However, as the responses are typcally nosy, these optmzaton strateges have a tendency to be unstable. The smplex based method used n [4] has been shown to perform reasonably for ths task snce t s a gradent-free based generc optmzer, whch renders t somewhat nsenstve to measurement nose. However, convergence may be slow when usng ths method, especally for a complex PDM wth a large number of parameters. 2.2. Optmzaton Strateges In ths secton, a revew of current methods for CLM optmzaton s presented. These methods ental replacng the true response maps, {p(l x)} n, wth smpler parametrc forms and performng optmzaton over these nstead of the orgnal response maps. As these parametrc densty estmates are a knd of smoothed verson of the orgnal responses, senstvty to local mnma s generally reduced. Actve Shape Models: The smplest optmzaton strategy for CLM fttng s that used n the Actve Shape Model (ASM) [2]. The method entals frst fndng the locaton wthn each response map for whch the maxmum was attaned: µ = [ µ 1 ;...; µ n ]. The objectve of the optmzaton procedure s then to mnmze the weghted least squares dfference between the PDM and the coordnates of the peak responses: Q(p) = n w x µ 2, (5) where the weghts {w } n reflect the confdence over peak response coordnates and are typcally set to some functon of the responses at {µ } n, makng t more resstant towards such thngs as partal occluson, where occluded landmarks wll be more weakly weghted. Equaton (5) s teratvely mnmzed by takng a frst order Taylor expanson of the PDM s landmarks: x x c + J p, (6)
and solvng for the parameter update: ( n ) 1 p = w J T J n w J T (µ x c ), (7) whch s then appled addtvely to the current parameters: p p + p. Here, J =[J 1 ;...; J n ] s the Jacoban and x c = [ x c 1 ;...; xc n] s the current shape estmate. From the probablstc perspectve ntroduced n 2.1, the ASM s optmzaton procedure s equvalent to approxmatng the response maps wth an sotropc Gaussan estmator: p(l = algned x) N (x; µ,σ 2 I), (8) where w = σ 2. Wth ths approxmaton, takng the negatve log of the lkelhood n Equaton (4) results n the objectve n Equaton (5). Convex Quadratc Fttng: Although the approxmaton descrbed above s smple and effcent, n some cases t may be a poor estmate of the true response map. Frstly, the landmark detectors, such as the lnear classfer descrbed n 2.1, are usually mperfect n the sense that the maxmum of the response may not always concde wth the correct landmark locaton. Secondly, as the features used n detecton consst of small mage patches they often contan lmted structure, leadng to detecton ambgutes. The smplest example of ths s the aperture problem, where detecton confdence across the edge s better than along t (see example response maps for the nose brdge and chn n Fgure 2). To account for these problems, a method coned convex quadratc fttng (CQF) has been proposed recently [16]. The method fts a convex quadratc functon to the negatve log of the response map. Ths s equvalent to approxmatng the response map wth a full covarance Gaussan: p(l = algned x) N (x; µ, Σ ). (9) The mean and covarance are maxmum lkelhood estmates gven the response map: Σ = α x (x µ )(x µ ) T ; µ = α x x, x Ψ x c x Ψ x c (10) where Ψ x c s a 2D-rectangular grd centered at the current landmark estmate x c (.e. the search wndow), and: α x = p(l = algned x) y Ψ xc p(l = algned y). (11) Wth ths approxmaton, the objectve can be wrtten as the mnmzaton of: n Q( p) = x c + J p µ 2, (12) Σ 1 Fgure 2. Response maps, p(l = algned x), and ther approxmatons used n varous methods, for the outer left eye corner, the nose brdge and chn. Red crosses on the response maps denote the true landmark locatons. The GMM approxmaton has fve cluster centers. The KDE approxmatons are shown for σ 2 {20, 5, 1}. the soluton of whch s gven by: ( n ) 1 n p = J T Σ 1 J J T Σ 1 (µ x c ). (13) A Gaussan Mxture Model Estmate: Although the response map approxmaton n CQF may overcome some of the drawbacks of ASM, ts process of estmaton can be poor n some cases. In partcular, when the response map s strongly multmodal, such an approxmaton smoothes over the varous modes (see the example response map for the eye corner n Fgure 2). To account for ths, n [8] a Gaussan mxture model (GMM) was used to approxmate the response maps: p(l = algned x) K π k N (x; µ k, Σ k ), (14) k=1 where K denotes the number of modes and {π k } K k=1 are the mxng coeffcents for the GMM of the th PDM landmark. Treatng the mode membershp for each landmark, {z } n, as hdden varables, the maxmum lkelhood soluton can be found usng the expectaton-maxmzaton (EM) algorthm, whch maxmzes: p({l } n p) = n k=1 K p (z = k, l x ). (15) The E-step of the EM algorthm nvolves computng the posteror dstrbuton over the latent varables {z } n : p(z = k l, x )= where p(z = k) =π k and: p(z = k) p(l z = k, x ) K j=1 p(z = j) p(l z = j, x ), (16) p(l = algned z = k, x )=N (x ; µ k, Σ k ). (17)
In the M-step, the expectaton of the negatve log of the complete data s mnmzed: { n }] Q(p) =E q(z) [ log p(l = algned,z x ), (18) where q(z) = n p (z l = algned, x ). Lnearzng the shape model as n Equaton (6), ths Q-functon takes the form: n K Q( p) w k J p y k 2 + const, (19) Σ 1 k k=1 where w k = p (z = k l = algned, x ) and y k = µ k x c, the soluton of whch s gven by: ( n ) 1 K p = w k J T Σ 1 k J n K w k J T Σ 1 k y k. k=1 k=1 (20) Although the GMM s a better approxmaton of the response map compared to the Gaussan approxmaton n CQF, t exhbts two major drawbacks. Frstly, the process of estmatng the GMM parameters from the response maps s a nonlnear optmzaton n tself. It s only locally convergent and requres the number of modes to be chosen a-pror. As GMM fttng s requred for each PDM landmark, t consttutes a large computaton overhead. Although some approxmatons can be made, they are generally suboptmal. For example, n [8], the modes are chosen as the K-largest responses n the map. The covarances are parametrzed sotropcally, wth ther varance heurstcally set as the scaled dstance to the closest mode n the prevous teraton of the CLM fttng algorthm. Such an approxmaton allows an effcent estmate of the GMM parameters wthout the need for a costly EM procedure at the cost of a poorer approxmaton of the true response map. The second drawback of the GMM response map approxmaton s that the approxmated objectve n Equaton (15) s multmodal. As such, CLM fttng wth the GMM smplfcaton s prone to termnatng n local optma. Although good results were reported n [8], n that work the PDM was parameterzed usng a mxture model as opposed to the more typcal Gaussan parameterzaton, whch places a stronger pror on the way the shape can vary. 3. Subspace Constraned Mean-Shfts Rather than approxmatng the response maps for each PDM landmark usng parametrc models, we consder here the use of a nonparametrc representaton. In partcular, we propose the use of a homoscedastc kernel densty estmate (KDE) wth an sotropc Gaussan kernel: p(l = algned x) α µ N (x; µ,σ 2 I), (21) µ Ψ x c where α µ s the normalzed true detector response defned n Equaton (11). Wth ths representaton the kernel centers are fxed as defned through Ψ x c (.e. the grd nodes of the search wndow). The mxng weghts, α µ, can be obtaned drectly from the true response map. Snce the response s an estmate of the probablty that a partcular locaton n the mage s the algned landmark locaton, such a choce for the mxng coeffcents s reasonable. Compared to parametrc representatons, KDE has the advantage that no nonlnear optmzaton s requred to learn the parameters of ts representaton. The only remanng free parameter s the varance of the Gaussan kernel, σ 2, whch regulates the smoothness of the approxmaton. Snce one of the man problems wth a GMM based representaton s the computatonal complexty and suboptmal nature of fttng a mxture model to the response maps, f σ 2 s set a-pror, then optmzng over the KDE can be expected to be more stable and effcent. Maxmzng the objectve n Equaton (4) wth a KDE representatons s nontrval as the objectve s nonlnear and typcally multmodal. However, n the case where no shape pror s placed on the way the PDM s landmarks can vary, the problem reverts to ndependent maxmzatons of the KDE for each landmark locaton separately. Ths s because the landmark detectons are assumed to be ndependent, condtoned on the PDM s parameterzaton. A common approach for maxmzaton over a KDE s to use the well known mean-shft algorthm [1]. It conssts of fxed pont teratons of the form: ( ) (τ +1) x α µ N x (τ ) ; µ,σ 2 I ( )µ, µ Ψ x c y Ψ x α c y N x (τ ) ; y,σ 2 I (22) where τ denotes the tme-step n the teratve process. Ths fxed pont teraton scheme fnds a mode of the KDE, where an mprovement s guaranteed at each step by vrtue of ts nterpretaton as a lower bound maxmzaton [6]. Compared to other optmzaton strateges, mean-shft s an attractve choce as t does not use a step sze parameter or a lne search. Equaton (22) s smply appled teratvely untl some convergence crteron s met. To ncorporate the shape model constrant nto the optmzaton procedure, one mght consder a two step strategy: () compute the mean-shft update for each landmark, and () constran the mean-shfted landmarks to adhere to the PDM s parameterzaton usng a least-squares ft: Q(p) = n x x (τ +1) 2. (23) Ths s remnscent of the ASM optmzaton strategy, where the locaton of the response map s peak s replaced wth the mean-shfted estmate. Although such a strategy s attractve n ts smplcty, t s unclear how t relates to the global
Algorthm 1 Subspace Constraned Mean-Shfts Requre: I and p. 1: whle not converged(p) do 2: Compute responses {Eqn. (2)} 3: Lnearze shape model {Eqn. (6)} 4: Precompute pseudo-nverse of Jacoban (J ) 5: Intalze parameter updates: p 0 6: whle not converged( p) do 7: Compute mean-shfted landmarks {Eqn. (22)} 8: Apply subspace constrant {Eqn. (24)} 9: end whle 10: Update parameters: p p + p 11: end whle 12: return p Fgure 3. Illustraton of a the use of a precomputed grd for effcent mean-shft. Kernel evaluatons are precomputed between c and all other nodes n the grd. To approxmate the true kernel evaluaton, x s assumed to concde wth c and the lkelhood of any response map grd locaton can be attaned by a table lookup. objectve n Equaton (4). Gven the form of the KDE representaton n Equaton (21), one can treat t smply as a GMM. As such, the dscussons n 2.2 on GMMs are drectly applcable here, replacng the number of canddates K wth the number of grd nodes n the search wndow Ψ x c, the mxture weghts π k wth α µ, and the covarances Σ k wth the scaled dentty σ 2 I. When usng the lnearzed shape model n Equaton (6) and maxmzng the global objectve n Equaton (4) usng the EM algorthm, the soluton for the so called Q- functon of the M-step takes the form: [ p = J (τ +1) x 1 x c 1 ;... ; x (τ +1) n ] x c n, (24) where J (τ +1) denotes the pseudo-nverse of J, and x s the mean shfted update for the th landmark gven n Equaton (22). Ths s smply the Gauss Newton update for the least squares PDM constrant n Equaton (23). As such, under a lnearzed shape model, the two step strategy for maxmzng the objectve n Equaton (4) wth a KDE representaton shares the propertes of a general EM optmzaton, namely: provably mprovng and convergent. The complete fttng procedure, whch we wll refer to as subspace constraned mean-shfts (SCMS), s outlned n Algorthm 1. In the followng, two further nnovatons are proposed, whch address dffcultes regardng local optma and the computatonal expense of kernel evaluatons. Kernel Wdth Relaxaton: The response map approxmatons dscussed n 2.2 can be though of as a form of smoothng. Ths explans the relatve performance of the varous methods. The Gaussan approxmatons smooth the most but approxmate the true response map the poorest, whereas smoothng effected by the GMM s not as aggressve but exhbts of a degree of senstvty towards local optma. One mght consder usng the Gaussan and GMM approxmatons n tandem, where the Gaussan approxma- ton s used to get wthn the convergence basn of the GMM approxmaton. However, such an approach s nelegant and affords no guarantee that the mode of the Gaussan approxmaton les wthn the convergence basn of the GMM s. Wth the KDE approxmaton n SCMS a more elegant approach can be devsed, whereby the complexty of the response map estmate s drectly controlled by the varance of the Gaussan kernel (see Fgure 2). The gudng prncple here s smlar to that of optmzng on a Gaussan pyramd. It can be shown that when usng Gaussan kernels, there exsts a σ 2 < such that the KDE s unmodal, regardless of the dstrbuton of samples [13]. As σ 2 s reduced, modes dvde and smoothness of the objectve s terran decreases. However, t s lkely that the optmum of the objectve at a larger σ 2 s closest to the desred mode of the objectve wth a smaller σ 2, promotng ts convergence to the correct mode. As such, the polcy under whch σ 2 s reduced acts to gude optmzaton towards the global optmum of the true objectve. Drawng parallels wth exstng methods, as σ 2 the SCMS update approaches the soluton of a homoscedastc Gaussan approxmated objectve functon. As σ 2 s reduced, the KDE approxmaton resembles a GMM approxmaton, where the approxmaton for smaller σ 2 settngs s smlar to a GMM approxmaton wth more modes. Precomputed Grd: In the KDE representaton of the response maps, the kernel centers are placed at the grd nodes defned by the search wndow. From the perspectve of GMM fttng, these kernels represent canddates for the true landmark locatons. Although no optmzaton s requred for determnng the number of modes, ther centers and mxng coeffcents, the number of canddates used here s much larger than what would typcally be used n a general GMM estmate (.e. GMM based representatons typcally use K<10, whereas the search wndow sze typcally has > 100 nodes). As such, the computaton of the posteror n Equaton (16) wll be more costly. However, f the var-
ProportonofImages 1 0.8 0.6 0.4 0.2 ASM(88ms) CQF(98ms) GMM(2410ms) KDE(121ms) MultPeFttngCurve ProportonofImages 1 0.8 0.6 0.4 0.2 XM2VTSFttngCurve ASM(84ms) CQF(93ms) GMM(2313ms) KDE(117ms) 0 0 0 2 4 6 8 10 0 2 4 6 8 10 ShapeRMSError ShapeRMSError (a) (b) Fgure 4. Fttng Curves for the ASM, CQF, GMM and KDE optmzaton strateges on the MultPe and XM2VTS databases. ance σ 2 s known a-pror, then some approxmatons can be made to sgnfcantly reduce computatonal complexty. The man overhead when computng the mean-shft update s n evaluatng the Gaussan kernel between the current landmark estmate and every grd node n the response map. Snce the grd locatons are fxed and σ 2 s assumed to be known, one mght choose to precompute the kernel for varous settngs of x. In partcular, a smple choce would be to precompute these values along a grd sampled at or above the resoluton of the response map grd Ψ x c. Durng fttng one smply fnds the locaton n ths grd closest to the current estmate of a PDM landmark and estmate the kernel evaluatons by assumng the landmark s actually placed at that node (see Fgure 3). Ths only nvolves a table lookup and can be performed effcently. The hgher the granularty of the grd the better the approxmaton wll be, at the cost of greater storage requrements but wthout a sgnfcant ncrease n computatonal complexty. Although such an approxmaton runs the strctly mprovng propertes of EM, we emprcally show n 4 that accurate fttng can stll be acheved wth ths approxmaton. In our mplementaton, we found that such an approxmaton reduced the average fttng tme by one half. 4. Experments Database Specfc Experments: We compared the varous CLM optmzatons strateges dscussed above on the problem of generc frontal face fttng on two databases: () the CMU Pose, Illumnaton and Expresson Database (Mult- Pe) [7], and () the XM2VTS database [12]. The MultPe database s annotated wth a 68-pont markup used as ground truth landmarks. We used 762 frontal face mages of 339 subjects. The XM2VTS database conssts of 2360 frontal face mages of 295 subjects for whch ground truth annotatons are publcly avalable but dfferent from the 68-pont markup we have for MultPe. XM2VTS contans neutral expresson only whereas MultPe contans sgnfcant expresson varatons. A 4-fold cross valdaton was performed on both MultPe and XM2VTS, separately, where the mages were parttoned nto three sets of nonoverlappng subject denttes. In each tral, three parttons were used for tranng and the remander for testng. On these databases we compared four types of optmzaton strateges: () ASM [2], () CQF [16], () GMM [8], and (v) the KDE method proposed n 3. For GMM, we emprcally set K =5and used the EM algorthm to estmate the parameters of the mxture model. For KDE, we used a varance relaxaton polcy of σ 2 = {20, 10, 5, 1} and a grd spacng of 0.1-pxels n ts effcent approxmaton. In all cases the lnear logstc regressor descrbed n 2.1 was used. The local experts were (11 11)-pxels n sze and the exhaustve local search was performed over a (15 15)- pxel wndow. As such, the only dfference between the varous methods compared here s ther optmzaton strategy. In all cases, the scale and locaton of the model was ntalzed by an off-the-shelf face detector, the rotaton and nonrgd parameters n Equaton (1) set to zero (.e. the mean shape), and the model ft untl the optmzaton converged. Results of these experments can be found n Fgure 4, where the graphs (fttng curves) show the proporton of mages at whch varous levels of maxmum perturbaton was exhbted, measured as the root-mean-squared (RMS) error between the ground truth landmarks and the resultng ft. The average fttng tmes for the varous methods on a 2.5GHz Intel Core 2 Duo processor are shown n the legend. The results show a consstent trend n the relatve performance of the four methods. Frstly, CQF has the capac-
ShapeRMSError 20 15 10 5 ASM GMM CQF KDE 0 0 1000 2000 Frame Fgure 5. Top row: Trackng results on the FGNet Talkng Face database for frames {0, 1230, 4200}. Clockwse from top left are fttng results for ASM, CQF, KDE and GMM. Bottom: Plot of shape RMS error from ground truth annotatons throughout the sequence. 3000 4000 5000 ty to sgnfcantly outperform ASM. As dscussed n 2.2 ths s due to CQF s ablty to account for drectonal uncertanty n the response maps as well as beng more robust towards outlyng responses. However, CQF has a tendency to over-smooth the response maps, leadng to lmted convergence accuracy. GMM shows an mprovement n accuracy over CQF as shown by the larger number of samples that converged to smaller shape RMS errors. However, t has the tendency to termnate n local optma due to ts multmodal objectve. Ths can be seen by ts poorer performance than CQF for reconstructons errors above 4.2- pxels RMS n MultPe and 5-pxels RMS n XM2VTS. In contrast, KDE s capable of attanng even better accuraces than GMM but stll retans a degree of robustness towards local optma, where ts performance over grossly msplaced ntalzatons s comparable to CQF. Fnally, despte the sgnfcant mprovement n performance, KDE exhbts only a modest ncrease n computatonal complexty compared to ASM and CQF. Ths s n contrast to GMM that requres much longer fttng tmes, manly due to the complexty of fttng a mxture model to the response maps. Out-of-Database Experments: Testng the performance of fttng algorthms on mages outsde of a partcular database s more meanngful as t gves a better ndcaton on how well the method generalzes. However, ths s rarely conducted as t requres the tedous process of annotatng new mages wth the PDM confguraton of the tranng set. Here, we utlze the freely avalable FGNet talkng face sequence 3. Quanttatve analyss on ths sequence s possble snce ground truth annotatons are avalable n the same format as that n XM2VTS. We ntalze the model usng a face detector n the frst frame and ft consecutve frames usng the PDM s confguraton n the prevous frame as an ntal estmate. The same model used n the database-specfc experments was used here, except that t was traned on all mages n XM2VTS. In Fgure 5, the shape RMS error for each frame s plotted for the four optmzaton strateges beng compared. The relatve performance of the varous strateges s smlar to that n the database-specfc experments, wth KDE yeldng the best performance. ASM and GMM are partcularly unstable on ths sequence, wth GMM loosng track at around frame 4200, and fals to recover untl the end of the sequence. Fnally, we performed a qualtatve analyss of KDE s performance on the Faces n the Wld database [9]. It contans mages taken under varyng lghtng, resoluton, mage nose and partal occluson. As before, the model was ntalzed usng a face detector and ft usng the XM2VTS 3 http://www-prma.nralpes.fr/fgnet/data/ 01-TalkngFace/talkng_face.html
Fgure 6. Example fttng results on the Faces n the Wld database usng a model traned usng the XM2VTS database. Top row: Male subjects. Mddle row: female subjects. Bottom row: partally occluded faces. traned model. Some fttng results are shown n Fgure 6. Results suggest that KDE exhbts a degree of robustness towards varatons typcally encountered n real mages. 5. Concluson The optmzaton strategy for deformable model fttng was nvestgated n ths work. Varous exstng methods were posed wthn a consstent probablstc framework where they were shown to make dfferent parametrc approxmatons to the true lkelhood maps of landmark locatons. A new approxmaton was then proposed that uses a nonparametrc representaton. Two further nnovatons were proposed n order to reduce computatonal complexty and avod local optma. The proposed method was shown to outperform three other optmzaton strateges on the task of generc face fttng. Future work wll nvolve nvestgatons nto the effects of dfferent local detectors types and shape prors on the optmzaton strateges. References [1] Y. Cheng. Mean Shft, Mode Seekng, and Clusterng. PAMI, 17(8):790 799, 1995. [2] T. F. Cootes and C. J. Taylor. Actve Shape Models - Smart Snakes. In BMVC, pages 266 275, 1992. [3] D. Crstnacce and T. Cootes. Boosted Actve Shape Models. In BMVC, volume 2, pages 880 889, 2007. [4] D. Crstnacce and T. F. Cootes. A Comparson of Shape Constraned Facal Feature Detectors. In FG, pages 375 380, 2004. [5] D. Crstnacce and T. F. Cootes. Feature Detecton and Trackng wth Constraned Local Models. In EMCV, pages 929 938, 2004. [6] M. Fashng and C. Tomas. Mean Shft as a Bound Optmzaton. PAMI, 27(3), 2005. [7] R. Gross, I. Matthews, S. Baker, and T. Kanade. The CMU Multple Pose, Illumnaton and Expresson (MultPIE) Database. Techncal report, Robotcs Insttute, Carnege Mellon Unversty, 2007. [8] L. Gu and T. Kanade. A Generatve Shape Regularzaton Model for Robust Face Algnment. In ECCV 08, 2008. [9] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Mller. Labeled Faces n the Wld: A Database for Studyng Face Recognton n Unconstraned Envronments. Techncal Report 07-49, Unversty of Massachusetts, Amherst, 2007. [10] X. Lu. Generc Face Algnment usng Boosted Appearance Model. In CVPR, pages 1 8, 2007. [11] I. Matthews and S. Baker. Actve Appearance Models Revsted. IJCV, 60:135 164, 2004. [12] K. Messer, J. Matas, J. Kttler, J. Lüttn, and G. Matre. XM2VTSDB: The Extended M2VTS Database. In AVBPA, pages 72 77, 1999. [13] M. A. C.-P. nán and C. K. I. Wllams. On the Number of Modes of a Gaussan Mxture. Lecture Notes n Computer Scence, 2695:625 640, 2003. [14] M. H. Nguyen and F. De la Torre Frade. Local Mnma Free Parameterzed Appearance Models. In CVPR, 2008. [15] J. Saragh and R. Goecke. A Nonlnear Dscrmnatve Approach to AAM Fttng. In ICCV, 2007. [16] Y. Wang, S. Lucey, and J. Cohn. Enforcng Convexty for Improved Algnment wth Constraned Local Models. In CVPR, 2008.