Deformable Part Descriptors for Fine-grained Recognition and Attribute Prediction

Size: px
Start display at page:

Download "Deformable Part Descriptors for Fine-grained Recognition and Attribute Prediction"


1 Deformable Part Descrptors for Fne-graned Recognton and Attrbute Predcton Nng Zhang Ryan Farrell,2 Forrest Iandola Trevor Darrell ICSI / UC Berkeley 2 Brgham Young Unversty 2 Abstract Recognzng objects n fne-graned domans can be extremely challengng due to the subtle dfferences between subcategores. Dscrmnatve markngs are often hghly localzed, leadng tradtonal object recognton approaches to struggle wth the large pose varaton often present n these domans. Pose-normalzaton seeks to algn tranng exemplars, ether pecewse by part or globally for the whole object, effectvely factorng out dfferences n pose and n vewng angle. Pror approaches reled on computatonally-expensve fter ensembles for part localzaton and requred extensve supervson. Ths paper proposes two pose-normalzed descrptors based on computatonally-effcent deformable part models. The frst leverages the semantcs nherent n strongly-supervsed DPM parts. The second explots weak semantc annotatons to learn cross-component correspondences, computng pose-normalzed descrptors from the latent parts of a weakly-supervsed DPM. These representatons enable poolng across pose and vewpont, n turn factatng tasks such as fne-graned recognton and attrbute predcton. Experments conducted on the Caltech-UCSD Brds 2 dataset and Berkeley Human Attrbute dataset demonstrate sgnfcant mprovements over state-of-art algorthms.. Introducton Despte the many mportant applcatons and domans under nvestgaton, fne-graned recognton remans very challengng. As descrbed n [23], what often dfferentates basc-level categores s the presence or absence of parts (e.g. an elephant has 4 legs and a trunk, whereas subordnate categores are more often dscrmnated by subtle varatons n the shape, sze and/or appearance propertes of these parts (e.g. elephant speces can be dstngushed by localzed cues such as ear shape and sze. Localzng and descrbng the object s parts therefore becomes central to uncoverng ts fne-graned dentty. Several approaches have been proposed for localzng and descrbng object parts n fne-graned domans. Posenormalzaton was proposed for fne-graned recognton by Farrell et al. [23] and extended n Zhang et al. [47]. Ths paradgm seeks to dscount varatons n pose, artculaton and camera vewng angle by localzng semantc object parts and extractng appearance features wth respect to those localzed parts. In these approaches part localzaton was accomplshed usng Poselets [, ], whch requre computatonally-expensve fter ensembles for part localzaton as well as extensve supervson. Parkh et al. [36, 37] provde an alternate way of detectng basc-level category objects such as dogs and cats by usng a Deformable Part Model [24] traned specfcally on the head. Once a head has been detected n a test mage, ths regon s used to ntalze a grab-cut [38] segmentaton and obtan a mask/shouette of the entre object. Classfcaton s then performed usng a combnaton of the head shape and features extracted from wthn the head/body mask. Whe ths method s relatvely effectve for front-facng cats and dogs, subjects n other domans (such as brds and people often exhbt greater varaton n pose and appearance and can be far more dffcult to segment from ther surroundngs (see examples n Fgure. In ths work, we ntroduce deformable part descrptors (DPD, a robust and effcent framework for posenormalzed descrpton based on DPM parts and demonstrate ts effectveness for both fne-graned recognton and attrbute predcton (see Fgure 2. We propose two deformable part descrptors: the DPD-strong leverages the semantcs nherent n strongly-supervsed DPM parts; the DPD-weak explots semantc annotatons to learn crosscomponent correspondences, computng pose-normalzed descrptors from weakly-supervsed DPM parts. We present state-of-the-art results n evaluatng our approach on standard fne-graned and doman-specfc attrbute datasets ncludng the Caltech-UCSD brds dataset [3] and Human Attrbutes dataset []. An end-to-end open-source mplementaton of our method has been released wth ths paper and s avaable at

2 Part-based One-vs-One Features (POOF proposed by Berg and Belhumeur [6] whch leverages robust keypont predcton learnng a descrptor coordnate frame for each par of keyponts. Approaches such as [4, 36, 37] use regon-level cues to estmate object segmentatons whch factate fne-graned classfcaton. Some technques use humans-n-the-loop, askng human annotators to clck on object parts, answer questons regardng object attrbutes [3], or mark the regon that best dfferentates two confusng categores [7]. Template-based methods have also been nvestgated by Yao et al. [45] and by Yang et al. [43]. Such approaches use a set of fxed-poston templates, thereby mtgatng much of the computatonal cost of sldng wndow approaches, yet lack the spatal flexbty to deal wth substantal pose varaton Attrbute Predcton Fgure. Robustness of Deformable Part Descrptors (DPDs. Whe they work well for generally homogeneous objects such as dogs and cats, head-seeded segmentaton-based descrptors often fa for other domans such as brds and humans. Examples of ths are shown n the mddle column. The ground truth segmentaton s overlad n red; the grab-cut based segmentaton [38], seeded wth the head as foreground and everythng outsde the boundng box as background, s overlad n whte. The rght column shows our part localzaton results usng the strongly-supervsed DPM. Best vewed n color. 2. Background 2.. Fne-graned categorzaton Fne-graned classfcaton has recently emerged as a topc of great nterest. A growng body of lterature has proposed varous technques and has addressed recognton on a large number of fne-graned domans. These domans nclude: dog breed classfcaton [26, 29, 37], subordnate categores of flowers [2,, 33, 34] and plants [4, 39], recognton of nvertebrates [32], fne-graned classfcaton on subsets of ImageNet [6] such as fung [5], and speces-level categorzaton of brds [23, 46]. We antcpate more work n the near future on man-made categores such as cars [4] and arplanes [3]. Followng recent work by Belhumeur et al. [5] on localzng fducal ponts n faces, Lu et al. [29] present an alternate approach to bud an exemplar-based geometrc and appearance models for detectng dog faces and localzng the facal keyponts. Another work n ths ven s that of Sfar et al. [39] who classfy leaves and flowers by descrbng appearance wth multple coordnate or vantage frames. A very recent and closely related contrbuton s the Attrbute-based representatons [22, 27, 28] present another promsng drecton, offerng the possbty of recognzng a novel category wth only a category descrpton (no magery. Relevant subsequent work on attrbutes ncludes that of Parkh and Grauman [35] explorng relatve attrbute strength and that of Berg et al. [7] and Duan et al. [9] propose automatc dscover of attrbutes to ad n fne-graned classfcaton. Perhaps the most closely related work on attrbute predcton s Bourdev et al. [], whch uses features extracted from Poselet [, 9] actvatons to predct nne bnary attrbutes on mages of humans. Another promsng alternatve to Poselet-based approaches for effectve transfer of pose annotaton from tranng mages to test mages s the Exemplar SVM proposed by Malsewcz et al. [3] Deformable Parts Model Inspred by the Pctoral Structures work of Fschler and Elschlager [25], the Deformable Parts Model (DPM of Felzenszwalb et al. [24] has become one of the most effectve and wdely-used object detecton approaches to date. The object s represented by a coarse root HOG fter and several hgher resoluton part fters. The DPM model uses a mxture of components to capture varaton n vewpont and/or pose (e.g. for a car, the three components mght correspond roughly to the front, sde and three-quarter vews. Strongly-supervsed DPM Whe the lmted supervson requred for the DPM s advantageous, the latent parts provde no semantc nformaton about the object whch makes pose-normalzaton challengng. Related work has used strong supervson to tran DPMs for human pose estmaton [4, 44], part localzaton [2] and object detecton [3]. Whe these methods focus on detecton, our objectve s dfferent, poolng semantc part features across components to derve a pose-normalzed representaton.

3 Test Image Part Localzaton Pose-normalzaton Strongly-supervsed DPM Classfcaton Class CN... Class C2 Class C Predcted Class: Lncoln Sparrow Ψpn Weakly-supervsed DPM Attrb αn... Attrb α2 Attrb α P(α[sMale] =.2 P(α2[hasHat] = P(αN[hasShorts] =.38 Fgure 2. Overvew. Both of the proposed Deformable Part Descrptors (DPDs are depcted above. The frst descrptor (top row apples a strongly-supervsed DPM [3] for part localzaton and then performs pose-normalzaton by poolng features from these nherently semantc parts. The pose-normalzed features Ψpn n Equaton 3 can be used for ether fne-graned recognton or attrbute predcton (as ndcated by the dotted arrows. The second descrptor employs a weakly-supervsed DPM [24] for part localzaton and then uses a learned semantc correspondence model to pool features from the component-specfc localzed latent parts nto semantc regons (the pose-normalzed descrptor Ψpn shared by all components. In the part to regon poolng (the black, yellow, magenta, and cyan lnes, wder lnes ndcate hgher weght. Best vewed n color. In ths work, we adopt a varant of the stronglysupervsed DPM [3]. We use object part annotatons and am to localze the semantc parts for pose-normalzed descrptors. Instead of ntalzng the fxed-sze part fters heurstcally (energy maxmzaton, we ntalze the part fters by usng semantc part annotatons. As n [3], we ntalze the mxture components by clusterng the annotated poses nstead of usng the aspect rato. After tranng the mxture of root fters, we process the tranng mages to get the component assgnment for each tranng example. Then, for each component c, the part fters of ths component are ntalzed at the average relatve locatons wth average sze among all the examples wth component assgnment c. Unlke [3], we do not mpose constrants on part overlap durng tranng. Notaton For both types of DPM, we use a latent SVM to tran the model parameters. The matchng score of a model β for a gven mage I s n X Spart (p, I, β + b S(I, β = Fβ Φ(I, p + = Spart (p, I, β = Fβ Φ(I, p d Φd (dx, dy ( where Fβ s the root fter, Fβ s the part fter, b s the bas term, Φ(I, p s the mage feature vector of the sub-wndow located at poston p, (dx, dy and d are respectvely s the dsplacement and dsplacement weght of the -th part, and Φd (dx, dy = (dx, dy, dx2, dy 2 s used to penalze the part dsplacement[24]. The overall score s computed usng the best possble placements,.e. fβ = maxz Z(I S(I, β where Z(I s the set of all possble latent values. The model s traned by teratvely assgnng the best possble placements and learnng the parameter β by stochastc gradent descent. Then, the part locatons are predcted by Z(I = argmax(z,β S(I, β s.t. O(p (I, bbox > δ Z(I = (c(i, p (I,, pn (I (2 where c(i s the component assgnment, p (I s the predcted boundng box, p (I s the predcted locaton of part p and O(p, bbox s the overlap between the predcted boundng box and ground truth boundng box. δ s a threshold to control predctons and we set t to be.65 n our experments. Gven the part localzatons, we show below how to form DPM-based pose-normalzed descrptors and pool them across components, n a manner analogous to what the Pose-poolng Kernel (PPK [47] does va Poselets. 3. Deformable Part Descrptors (DPD We use the deformable part model (DPM s demonstrated effectveness for detectng objects as a foundaton for our pose-normalzed approach to fne-graned recognton and attrbute predcton. We do ths as a faster and more relable alternatve to Poselets whch were prevously proposed for pose-normalzaton n both fne-graned recognton [23, 47] and attrbute predcton []. As noted earler,

4 we are not the frst to consder usng DPMs for fne-graned recognton. Parkh et al. [36, 37] traned a DPM model to locate a cat s or a dog s head and then used ths detecton both to descrbe the head appearance and to seed a segmentaton whch would recover the rest of the body (See Fgure. Our goal s to use DPM to localze the parts and pool the pose-normalzed mage features nduced by the part locatons. 3.. Strongly-supervsed DPD The underlyng prncple n pose-normalzaton s that one can decompose an object s appearance as observed n one mage and compare t to the same object (or object category as observed n a dfferent mage. Ths decomposton generally equates to localzng semantc parts and then descrbng them. From the DPM models, suppose that we have a total of C components ncludng mrror pars C = {c (, c (2,..., c (C }, where for odd values of j, c (j+ s the mrror component of c (j. Each component c (j has a set of parts P (j = {p (j, p(j 2,..., p(j P }, p(j denotng the th part for component c (j. We now defne a posenormalzed representaton wth R semantc poolng regons R = {r, r 2,..., r R }. We defne the pose-normalzed representaton as Ψ pn (I = [Ψ (I, r, Ψ (I, r,..., Ψ (I, r R ]. (3 where I, r l s the pooled mage feature for semantc regon r l and I, r s the mage feature nsde the root fter or boundng box and we concatenate the mage features together to get the fnal pose-normalzed representaton Ψ pn. To derve the pose-normalzed representaton for a gven detecton Z(I n Equaton 2 (from component c (j = c(i, we must fgure out a mappng S (j : P (j R. For each part p (j, we seek to determne whch pose-normalzed poolng regon or regons {r l } the features I, p (j should be mapped nto. For strongly-supervsed DPM, we use the semantc part annotatons and t s straghtforward to pool the correspondng part descrptor across dfferent components by settng the semantc regons the same as the semantc parts from the strong DPM,.e. R = P (j for all j {,..., C}. I, r l = I, p (j l j {,..., C} ( Weakly-supervsed DPD Usng a weak DPM, the latent parts of dfferent components have no explct semantc correlaton, nor s such correspondence guaranteed to exst. We have explored two optons for dealng wth ths: one based on a per-component appearance representaton wth no semantc parts; the other leverages addtonal annotatons to estmate semantc correspondence of the latent parts across components. The per-component representaton s very straghtforward. The tranng and test sets are parttoned nto subsets accordng to whch component fres on them. A separate classfcaton model s traned for each component usng the features extracted on that component s subset of tranng mages. Ths creates reasonable classfers, however, there are two shortcomngs. Frst, the parts carry no semantc nformaton (though ths s not nherently necessary. Second, the tranng data s fragmented across the C components, so models wl be accordngly weaker. Tranng fragmentaton s partcularly problematc for fne-graned recognton where you may only have 5-3 tranng examples total for each category. Experments usng ths model are ncluded n Secton 4.3. The second representaton, however, solves both of these problems. By provdng semantc annotatons at tranng tme, we can model semantc correspondence between the latent parts of dfferent components. In effect, we learn the semantc dentty of each latent part n the model and can thus pool features to a global pose-normalzed model, ndependent of whch component fres durng detecton. We model the pose-normalzaton as a weghted bpartte graph G = (P, R, W where w (j W ndcates the, the -th part of component c (j con- degree to whch p (j trbutes to r l. Ψ (I, r l = N P = w (j ( Ψ I, p (j where N = P w(j s used for normalzaton. W s a three dmensonal matrx wth sze P R C and w (j s modeled as a functon of the annotatons A. Annotatons a k A could be keyponts (as used to tran Poselets, rectangular regons (as used to tran Strong DPM or any other type of semantc labels. In our example, we use the keypont annotatons ncluded wth the CUB2 dataset [42] and H3D dataset []. More precsely, we defne w (j as: w (j = k= A ( ρ kl overlap a k, p (j where ρ kl [, ] ndcates a specfed semantc relevance for annotaton a k to regon r l (e.g.( left ear s relevant to head, left knee s not. The overlap a k, p (j functon encodes the dstrbuton of annotaton a k wthn part p (j. Let I jk be the set of tranng mages that have semantc annotaton a k and where c (j s the component whch fres. We can formally defne the fractonal overlap ( overlap a k, p (j = {I I jk a k (I p (j }. (7 I jk It s worth notng that the tranng set need not be exhaustvely labeled wth semantc annotatons. Fewer annotatons (5 (6

5 just mean smaller I jk and thus coarser predctons of the weghts w (j Classfcaton Gven the pose-normalzed representaton Ψ pn (I n Equaton 3, we employ a lnear SVM for the fnal classfcaton. For the strongly-supervsed DPD, we use the annotated semantc parts to get Ψ pn for tranng and part localzaton from Equaton 2 for testng; for the weaklysupervsed DPD, we utze Ψ pn from Equaton 2 for both tranng and testng. Due to the sparsty of tranng examples for certan poses, there wl be some test mages for whch a correct detecton cannot be found even gven a specfed object boundng box. Ths means that we don t have predctons for the part locatons n Equaton 2. In such cases, we can use the classfer traned on the feature descrptor nsde the boundng box I, r wthout any pose-normalzaton. 4. Experments In ths secton, we wl present a comparatve performance evaluaton of our proposed method. We conduct experments on the commonly used fne-graned benchmark Caltech-UCSD brd dataset [42] as well as the Berkeley Human Attrbute dataset from []. Our system can process several mages per second, leveragng the avaable fast DPM mplementaton n [2]. An open-source verson of our end-to-end DPD mplementaton s avaable at In contrast, the fastest avaable mplementaton of Poselet-based [47] reles on a C++ remplementaton of [] and takes approxmately seconds per frame on a comparable machne. 4.. Implementaton Detas Image Features and Classfcaton After predctng the part regons va DPM, we use kernel descrptors [8] to extract feature vectors for fnal classfcaton. Specfcally, we use four types of kernel descrptors: gradent, local bnary pattern (lbp, rgb color, and normalzed rgb color. Followng the settng n [8], we compute kernel descrptors on local mage patches of sze 6 x 6 over a dense regular grd of step sze 8 and then apply a spatal pyramd on top. We use vector quantzaton of these descrptors wth a -element codebook, concatenatng per-regon descrptor hstograms nto a sngle vector per mage. These vectors are provded as nput to a lnear support vector machne (lblnear wth power scaled features. Effcent Weak Poolng In Equaton 6, whe a fully optmal formulaton would lkely learn the best convex combnaton of ρ kl and utze sophstcated dstrbutons for all For results usng deep convolutonal features, see [8] for more detas. pars (a k, p (j, we make two smplfyng assumptons for effcency and smplcty. Frst, we defne ρ kl as an ndcator such that ρ kl {,, } and l ρ kl = ; n other words, each semantc part a k s assocated wth one or more regons r l (n a few cases t s shared between two, such as the shoulder between head and torso regons. Ths represents parttonng the semantc parts A amongst the poolng regons R. Second, to evaluate the overlap(a k, p (j functon n Equaton 7, we evaluate the weakly-supervsed detector on the semantcally annotated data, notng for each detecton where the varous parts {a k } fall. Across the annotated data we accumulate dstrbutons for each part p (j (we only nclude contrbutons toward p (j when c (j fres. Whe the optmal method would lkely model these spatal dstrbutons wth respect to p (j, we relax ths and smply record the fracton of semantcally annotated mages whch have a k, for whch c (j fres and where a k has hgh overlap wth (or falls wthn p (j Caltech-UCSD Brds-2 Dataset We conduct our experments on the 2-category Caltech UCSD brd dataset, one of the most wdely used and compettve fne-graned classfcaton benchmarks. Followng [23], we use two semantc parts for the brd dataset: head and body. We utze both the earler (CUB2-2 and current (CUB2-2 versons of the brd dataset for better comparson. CUB2-2 The ntal verson of the CUB2 dataset contans 633 mages of 2 dfferent brd speces n North Amerca. There are around 3 mages per class. We use the provded tran-test splt whch uses 5 mages per class for tranng and the balance for testng. Each mage n the dataset has a boundng box annotaton but no other annotaton s provded. In order to tran the strong DPM wth semantc parts, we manually annotate the part locatons for the tranng mages,.e. head and body boxes. If one of the parts s not vsble, we mark ts vsbty state as. Gven those annotatons on tranng mages, we tran a strong DPM wth 5 components, each wth these two parts. Due to the lack of addtonal annotatons, we are unable to evaluate the DPD-weak approach on ths dataset. Table shows the mean accuracy of our method on the CUB2-2 dataset. We also compare wth other publshed methods, ncludng MKL [3], random forest [46], TrCos [4], template matchng [43], segmentaton [] and the recently-publshed Bubblebank method [7]. One baselne s to use the same mage descrptors used by our methods nsde the boundng box but wthout any pose normalzaton,.e. KDES [8], whch yelds 26.4% mean accuracy, better than some prevous methods. Our approach acheves 34.5% mean accuracy, whch outperforms the best prevous

6 Method MKL [3] Random Forest [46] KDES [8] TrCos [4] Template matchng [43] Segmentaton [] Bubblebank [7] DPD-strong-2 Mean Accuracy(% %.7% 32.7% 2.5% 3.3% 3.% 2.6% HEAD 7.5% 2.7% 25.5% 2.8% 5.4%.% 3.9% 4.% BODY 2 Table. Results on CUB2-2 dataset. We note that [7] uses addtonal human labels. The best performance s acheved by the two-part strongly-supervsed DPD model (DPD-strong-2. Method KDES [8] Template matchng[43] Oracle DPD-weak-8 DPD-strong-2 DPD-strong-2-no-head DPD-strong-2-no-body Mean Accuracy(% Table 2. Results on CUB2-2 dataset usng KDES features. The best performance s acheved by the 8-part weakly-supervsed DPD model (DPD-weak-8. Oracle s akn to DPD-strong but uses the ground truth head and body locatons. result [7], and that requres human annotatons obtaned usng a crowdsourced onlne game to fnd the most dscrmnatve regons. CUB2-2 The CUB2-2 s the latest verson of the dataset, whch ncludes more hgh-qualty mages and has 5 part annotatons, e.g. beak, crest, throat, left-eye, left-wng, nape, etc. Ths dataset contans,788 mages of the same 2 brd speces as CUB2-2. We use the default tranng/test splt, whch gves us around 3 tranng examples per class. We tran both a weak DPM and a strong DPM to factate part localzaton. For the strong DPM, we tran a mxture of fve components, each wth two parts and we partton the keypont annotatons to generate the head and body part annotatons. Specfcally, we choose the semantc part annotatons as the mnmum rectangular regon to cover the part s specfed subset of keyponts. Ths strong DPM enables part localzaton wth the followng accuraces: the head part wth 45.% precson and 43.5% recall, the body part wth 77.5% precson and 75.2% recall. Here we mark a correct predcton when the predcted part box and the ground truth part box have an overlap over. For the weak DPM, we tran a mxture of three components, each wth 8 parts and usng poolng as descrbed n Equaton 5. The poolng weghts learned for the brd model are HEAD %.% 9.6%.2%.%.%.% TORSO 2.4% 2.6%.2% 2.9% 27.2%.3% 26.3%.2% LEGS.%.2% 32.%.2%.6% 8.7% 8.2% 29.9% Fgure 3. Learned Per-component Poolng Weghts. Pctured are examples of the weghtng models that are learned for the weaklysupervsed brd (above and human (below models (DPD-weak. Shown s just one of the components for each model. In each weght matrx, rows correspond to the semantc regons that the eght latent parts (columns are pooled to. We emphasze that the weghts are learned automatcally. Best vewed n color. vsualzed n the top part of Fgure 3. Table 2 presents our classfcaton results on the CUB2-2 dataset usng KDES [8] features. To compute the baselne method, we use only the boundng box regon, yeldng 42.53% mean accuracy. We also evaluate the template matchng method of [43], usng code released by the authors. Our DPD-strong acheves 5.5% mean accuracy and DPD-weak acheves 5.98% on ths dataset, outperformng other state-of-art methods. To understand the contrbuton of ndvdual DPM parts to the classfcaton accuracy, we blnd a DPD-strong to ndvdual semantc parts n an ablaton study. These results are ncluded n Table 2. We fnd that localzng the head part of the brd s especally mportant: removng the head part, the CUB2-2 accuracy falls to 43.5%. Addtonal experments on the mportance of ndvdual parts are avaable on our DPD project webpage. In terms of other publshed work on ths dataset, PPK [47] makes use of Poselet actvatons to generate pose-normalzed representaton and acheves a

7 Attrbute Freq SPM [] Poselets [] Per-component DPD-weak-8 DPD-strong-3 s male has long har has glasses has hat has t-shrt has long sleeves has shorts has jeans has long pants Mean AP Table 3. Results on the Human Attrbutes dataset. Freq s the label frequency (fracton of specfed attrbutes that are postve and Per-component gves the results when usng the per-component poolng method dscussed n the begnnng of Secton 3.2. The values above denote mean average precson. For the full Precson-Recall curves, see Fgure 4. Another baselne usng the same mage features but consderng only the boundng box regon acheves 66.58% mean AP. mean accuracy of 28.8%. The authors of POOF [6] recently reported performance of 56.89% enabled both by accurate part (keypont localzaton and through the tranng of thousands of classfers. We have expermented wth replacng the KDES features wth deep convolutonal features and have reported 64.96% performance (please see [8] for detas Human Attrbutes Dataset The human attrbutes dataset [] contans 835 mages collected from the H3D [] and PASCAL VOC 2 [2] datasets. There are nne attrbutes (e.g. has t-shrt, has jeans, and each one has a label of {-,, } respectvely meanng absent, present and unspecfed. We use the H3D data to tran a strong DPM wth 3 components and 3 semantc parts (head, torso and legs and a weak DPM wth 3 components and 8 parts. Example poolng weghts learned for the weak DPM are vsualzed n the lower part of Fgure 3. Smar to the CUB2-2 dataset, the semantc part annotatons are generated from the keypont annotatons avaable from H3D. We then test on both the tranng and test mages of the human attrbutes dataset to localze the parts and get pose-normalzed mage descrptors for attrbute predcton. Table 3 shows predcton results on these nne attrbutes. Two baselnes are ncluded for comparson: SPM whch uses spatal pyramd match nsde the boundng box and the Poselet approach of Bourdev, et al.; both results are taken from []. We also nclude the results of usng percomponent classfers, measurng average precson under the precson-recall curve. Precson-Recall curves for each of the nne attrbutes are shown n Fgure 4. We use mean AP nstead of mean accuracy for ths dataset because the percentage of postve examples for each attrbute s qute vared. From the table, we see that both DPD methods outperform pose-normalzaton usng Poselets and moreover, our method s approxmately 3 tmes more effcent. AP: 73.4 AP: 7.3 AP: 6. s male AP: 83.7 AP: 82.9 AP: 82.4 has hat has shorts AP: 64. AP: 59.4 AP: 45.5 has long har AP: 72.5 AP: 7. AP: 67.8 has t-shrt has jeans AP: 5.2 AP: 49.8 AP: 46. AP: 78. AP: 77. AP: 54.7 AP: 78. AP: 76.5 AP: 74.2 has glasses AP: 55.6 AP: 4.7 AP: 38. has long sleeves has long pants AP: 93.5 AP: 93. AP: 9.3 Fgure 4. Attrbute Predcton Precson-Recall Curves. Results are shown for each gven attrbute. The blue curve and mean average precson (AP scores are for the DPD-strong. The green curve and mean AP scores are for the DPD-weak. The red curve and mean AP scores are for the prevous start-of-the art technque of Bourdev et al. []. The dashed lne ndcates the frequency (fracton of postves. Best vewed n color. 5. Concluson In ths paper we have proposed Deformable Part Descrptors (DPDs, a pose-normalzed representaton based on DPMs. We descrbed two such pose-normalzed methods, respectvely applcable to strongly-supervsed and weakly-supervsed varants of deformable part models. The frst method explots the semantcs nherent n the stronglysupervsed DPM s parts, poolng them drectly to form a pose-normalzed descrptor. The second uses semantc an-

8 notatons to learn cross-component correspondences between parts of the weakly-supervsed DPM. These correspondences are then used to generate a pose-normalzed descrptor. We have evaluated the proposed DPD methods, surpassng the prevous state-of-the-art performance on standard datasets for both fne-graned recognton and attrbute predcton. In concluson, we outlne some drectons for future work. Frst, we suggest that a greater number of supervsed parts (as used n Azzpour et al. [3] would ncrease the descrptve power of the DPD model. However, to do ths, we would need to address the ssue of self-occluson. Second, learnng of cross-component part correspondences could be enhanced by consderng unconstraned convex combnatons for the semantc relevance coeffcents ρ kl and more optmal overlap( functons that address the spatal dstrbutons of the semantc annotatons wthn parts, nstead of smply consderng occurrence frequency. Acknowledgments Ths research s funded by NSF grants IIS-64 and IIS-22928, by DARPA s Mnds Eye and MSEE programs, and by the Toyota Motor Corporaton. The thrd author s supported by the NDSEG Fellowshp program. References [] A. Angelova and S. Zhu. Effcent object detecton and segmentaton for fnegraned recognton. In CVPR, 23. [2] A. Angelova, S. Zhu, and Y. Ln. Image segmentaton for large-scale subcategory flower recognton. In WACV, 23. [3] H. Azzpour and I. Laptev. Object Detecton Usng Strongly-Supervsed Deformable Part Models. In ECCV, 22. [4] P. N. Belhumeur, D. Chen, S. Fener, D. Jacobs, W. J. Kress, H. Lng, I. Lopez, R. Ramamoorth, S. Sheorey, S. Whte, and L. Zhang. Searchng the Worlds Herbara: a System for Vsual Identfcaton of Plant Speces. In ECCV, 28. [5] P. N. Belhumeur, D. W. Jacobs, D. J. Kregman, and N. Kumar. Localzng Parts of Faces Usng a Consensus of Exemplars. In CVPR, 2. [6] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs-one features for fnegraned categorzaton, face vercaton, and attrbute estmaton. In CVPR, 23. [7] T. Berg, A. Berg, and J. Shh. Automatc Attrbute Dscovery and Characterzaton from Nosy Web Data. In ECCV, 2. [8] L. Bo, X. Ren, and D. Fox. Kernel Descrptors for Vsual Recognton. In NIPS, 2. [9] L. Bourdev, S. Maj, T. Brox, and J. Malk. Detectng People Usng Mutually Consstent Poselet Actvatons. In ECCV, 2. [] L. Bourdev, S. Maj, and J. Malk. Descrbng People: Poselet-Based Approach to Attrbute Classfcaton. In ICCV, 2. [] L. Bourdev and J. Malk. Poselets: Body Part Detectors Traned Usng 3D Human Pose Annotatons. In ICCV, 29. [2] S. Branson, P. Perona, and S. Belonge. Strong Supervson From Weak Annotaton: Interactve Tranng of Deformable Part Models. In ICCV, 2. [3] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welnder, P. Perona, and S. Belonge. Vsual Recognton wth Humans n the Loop. In ECCV, 2. [4] Y. Cha, E. Rahtu, V. Lemptsky, L. V. Gool, and A. Zsserman. Trcos: A trevel class-dscrmnatve co-segmentaton method for mage classfcaton. In ECCV, 22. [5] J. Deng, A. Berg, K. L, and L. Fe-Fe. What Does Classfyng More Than, Image Categores Tell Us? In ECCV, 2. [6] J. Deng, W. Dong, R. Socher, L.-J. L, K. L, and L. Fe-Fe. ImageNet: A Large-Scale Herarchcal Image Database. In CVPR, 29. [7] J. Deng, J. Krause, and L. Fe-Fe. Fne-Graned Crowdsourcng for Fne- Graned Recognton. In CVPR, 23. [8] J. Donahue, Y. Ja, O. Vnyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutonal Actvaton Feature for Generc Vsual Recognton. In arxv, 23. [9] K. Duan, D. Parkh, D. Crandall, and K. Grauman. Dscoverng Localzed Attrbutes for Fne-graned Recognton. In CVPR, 22. [2] C. Dubout and F. Fleuret. Exact Acceleraton of Lnear Object Detectors. In ECCV, 22. [2] M. Everngham, L. Van Gool, C. K. I. Wlams, J. Wnn, and A. Zsserman. The Pascal Vsual Object Classes (VOC Challenge. IJCV, 88(2, June 2. [22] A. Farhad, I. Endres, D. Hoem, and D. Forsyth. Descrbng Objects by ther Attrbutes. In CVPR, 29. [23] R. Farrell, O. Oza, N. Zhang, V. I. Moraru, T. Darrell, and L. S. Davs. Brdlets: Subordnate Categorzaton usng Volumetrc Prmtves and Pose-normalzed Appearance. In ICCV, 2. [24] P. Felzenszwalb, R. Grshck, D. McAllester, and D. Ramanan. Object Detecton wth Dscrmnatvely Traned Part Based Models. PAMI, 32(3, 2. [25] M. A. Fschler and R. A. Elschlager. The representaton and matchng of pctoral structures. IEEE Transactons on Computers, 22(:67 92, January 973. [26] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fe-Fe. Novel dataset for fnegraned mage categorzaton. In FGVC Workshop, CVPR, 2. [27] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attrbute and Sme Classfers for Face Verfcaton. In ICCV, 29. [28] C. H. Lampert, H. Ncksch, and S. Harmelng. Learnng to Detect Unseen Object Classes by Between-Class Attrbute Transfer. In CVPR, 29. [29] J. Lu, A. Kanazawa, D. Jacobs, and P. Belhumeur. Dog Breed Classfcaton Usng Part Localzaton. In ECCV, 22. [3] S. Maj, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedald. Fne-Graned Vsual Classfcaton of Arcraft. Techncal report, 23. [3] T. Malsewcz, A. Gupta, and A. A. Efros. Ensemble of Exemplar-SVMs for Object Detecton and Beyond. In ICCV, 2. [32] G. Martnez-Munoz, N. Laros, E. Mortensen, W. Zhang, A. Yamamuro, R. Paasch, N. Payet, D. Lytle, L. Shapro, S. Todorovc, A. Moldenke, and T. Detterch. Dctonary-free categorzaton of very smar objects va stacked evdence trees. In CVPR, 29. [33] M.-E. Nsback and A. Zsserman. A Vsual Vocabulary for Flower Classfcaton. In CVPR, 26. [34] M.-E. Nsback and A. Zsserman. Automated Flower Classfcaton over a Large Number of Classes. In ICVGIP, 28. [35] D. Parkh and K. Grauman. Relatve Attrbutes. In ICCV, 2. [36] O. M. Parkh, A. Vedald, C. V. Jawahar, and A. Zsserman. The Truth About Cats and Dogs. In ICCV, 2. [37] O. M. Parkh, A. Vedald, A. Zsserman, and C. V. Jawahar. Cats and Dogs. In CVPR, 22. [38] C. Rother, V. Kolmogorov, and A. Blake. GrabCut -Interactve Foreground Extracton usng Iterated Graph Cuts. In SIGGRAPH, 24. [39] A. R. Sfar, N. Boujemaa, and D. Geman. Vantage Feature Frames For Fne- Graned Categorzaton. In CVPR, 23. [4] M. Stark, J. Krause, B. Pepk, D. Meger, J. J. Lttle, B. Schele, and D. Koller. Fne-Graned Categorzaton for 3D Scene Understandng. In BMVC, 22. [4] M. Sun and S. Savarese. Artculated Part-based Model for Jont Object Detecton and Pose Estmaton. In ICCV, 2. [42] P. Welnder, S. Branson, T. Mta, C. Wah, F. S. chroff, S. Belonge, and P. Perona. Caltech-UCSD Brds 2. Techncal Report CNS-TR-2-, Calforna Insttute of Technology, 2. [43] S. Yang, L. Bo, J. Wang, and L. Shapro. Unsupervsed Template Learnng for Fne-Graned Object Recognton. In NIPS, 22. [44] Y. Yang and D. Ramanan. Artculated Pose Estmaton usng Flexble Mxtures of Parts. In CVPR, 2. [45] B. Yao, G. Bradsk, and L. Fe-Fe. A Codebook-Free and Annotaton-Free Approach for Fne-graned Image Categorzaton. In CVPR, 22. [46] B. Yao, A. Khosla, and L. Fe-Fe. Combnng Randomzaton and Dscrmnaton for Fne-graned Image Categorzaton. In CVPR, 2. [47] N. Zhang, R. Farrell, and T. Darrell. Pose poolng kernels for sub-category recognton. In CVPR, 22.

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Dropout: A Simple Way to Prevent Neural Networks from Overfitting Journal of Machne Learnng Research 15 (2014) 1929-1958 Submtted 11/13; Publshed 6/14 Dropout: A Smple Way to Prevent Neural Networks from Overfttng Ntsh Srvastava Geoffrey Hnton Alex Krzhevsky Ilya Sutskever

More information

Person Re-identification by Probabilistic Relative Distance Comparison

Person Re-identification by Probabilistic Relative Distance Comparison Person Re-dentfcaton by Probablstc Relatve Dstance Comparson We-Sh Zheng 1,2, Shaogang Gong 2, and Tao Xang 2 1 School of Informaton Scence and Technology, Sun Yat-sen Unversty, Chna 2 School of Electronc

More information

Who are you with and Where are you going?

Who are you with and Where are you going? Who are you wth and Where are you gong? Kota Yamaguch Alexander C. Berg Lus E. Ortz Tamara L. Berg Stony Brook Unversty Stony Brook Unversty, NY 11794, USA {kyamagu, aberg, leortz, tlberg}

More information

Algebraic Point Set Surfaces

Algebraic Point Set Surfaces Algebrac Pont Set Surfaces Gae l Guennebaud Markus Gross ETH Zurch Fgure : Illustraton of the central features of our algebrac MLS framework From left to rght: effcent handlng of very complex pont sets,

More information

Human Tracking by Fast Mean Shift Mode Seeking

Human Tracking by Fast Mean Shift Mode Seeking JOURAL OF MULTIMEDIA, VOL. 1, O. 1, APRIL 2006 1 Human Trackng by Fast Mean Shft Mode Seekng [10 font sze blank 1] [10 font sze blank 2] C. Belezna Advanced Computer Vson GmbH - ACV, Venna, Austra Emal:

More information

Face Alignment through Subspace Constrained Mean-Shifts

Face Alignment through Subspace Constrained Mean-Shifts Face Algnment through Subspace Constraned Mean-Shfts Jason M. Saragh, Smon Lucey, Jeffrey F. Cohn The Robotcs Insttute, Carnege Mellon Unversty Pttsburgh, PA 15213, USA {jsaragh,slucey,jeffcohn}

More information

A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization

A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization T-ASL-0460-13 1 A Study of the Cosne Dstance-Based Mean Shft for Telephone Speech Darzaton Mohammed Senoussaou, Patrck Kenny, Themos Stafylaks and Perre Dumouchel Abstract Speaker clusterng s a crucal

More information

Mean Field Theory for Sigmoid Belief Networks. Abstract

Mean Field Theory for Sigmoid Belief Networks. Abstract Journal of Artæcal Intellgence Research 4 è1996è 61 76 Submtted 11è95; publshed 3è96 Mean Feld Theory for Sgmod Belef Networks Lawrence K. Saul Tomm Jaakkola Mchael I. Jordan Center for Bologcal and Computatonal

More information

Documentation for the TIMES Model PART I

Documentation for the TIMES Model PART I Energy Technology Systems Analyss Programme Documentaton for the TIMES Model PART I Aprl 2005 Authors: Rchard Loulou Uwe Remne Amt Kanuda Antt Lehtla Gary Goldsten 1 General

More information



More information

Boosting as a Regularized Path to a Maximum Margin Classifier

Boosting as a Regularized Path to a Maximum Margin Classifier Journal of Machne Learnng Research 5 (2004) 941 973 Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J.

More information

SVO: Fast Semi-Direct Monocular Visual Odometry

SVO: Fast Semi-Direct Monocular Visual Odometry SVO: Fast Sem-Drect Monocular Vsual Odometry Chrstan Forster, Mata Pzzol, Davde Scaramuzza Abstract We propose a sem-drect monocular vsual odometry algorthm that s precse, robust, and faster than current

More information

MANY of the problems that arise in early vision can be

MANY of the problems that arise in early vision can be IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 2, FEBRUARY 2004 147 What Energy Functons Can Be Mnmzed va Graph Cuts? Vladmr Kolmogorov, Member, IEEE, and Ramn Zabh, Member,

More information

As-Rigid-As-Possible Image Registration for Hand-drawn Cartoon Animations

As-Rigid-As-Possible Image Registration for Hand-drawn Cartoon Animations As-Rgd-As-Possble Image Regstraton for Hand-drawn Cartoon Anmatons Danel Sýkora Trnty College Dubln John Dnglana Trnty College Dubln Steven Collns Trnty College Dubln source target our approach [Papenberg

More information



More information

(Almost) No Label No Cry

(Almost) No Label No Cry (Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau

More information

Sector-Specific Technical Change

Sector-Specific Technical Change Sector-Specfc Techncal Change Susanto Basu, John Fernald, Jonas Fsher, and Mles Kmball 1 November 2013 Abstract: Theory mples that the economy responds dfferently to technology shocks that affect the producton

More information

Do Firms Maximize? Evidence from Professional Football

Do Firms Maximize? Evidence from Professional Football Do Frms Maxmze? Evdence from Professonal Football Davd Romer Unversty of Calforna, Berkeley and Natonal Bureau of Economc Research Ths paper examnes a sngle, narrow decson the choce on fourth down n the

More information

Support vector domain description

Support vector domain description Pattern Recognton Letters 20 (1999) 1191±1199 Support vector doman descrpton Davd M.J. Tax *,1, Robert P.W. Dun Pattern Recognton Group, Faculty of Appled Scence, Delft Unversty

More information

Ensembling Neural Networks: Many Could Be Better Than All

Ensembling Neural Networks: Many Could Be Better Than All Artfcal Intellgence, 22, vol.37, no.-2, pp.239-263. @Elsever Ensemblng eural etworks: Many Could Be Better Than All Zh-Hua Zhou*, Janxn Wu, We Tang atonal Laboratory for ovel Software Technology, anng

More information

As-Rigid-As-Possible Shape Manipulation

As-Rigid-As-Possible Shape Manipulation As-Rgd-As-Possble Shape Manpulaton akeo Igarash 1, 3 omer Moscovch John F. Hughes 1 he Unversty of okyo Brown Unversty 3 PRESO, JS Abstract We present an nteractve system that lets a user move and deform

More information

Sequential DOE via dynamic programming

Sequential DOE via dynamic programming IIE Transactons (00) 34, 1087 1100 Sequental DOE va dynamc programmng IRAD BEN-GAL 1 and MICHAEL CARAMANIS 1 Department of Industral Engneerng, Tel Avv Unversty, Ramat Avv, Tel Avv 69978, Israel E-mal:

More information

Mean Value Coordinates for Closed Triangular Meshes

Mean Value Coordinates for Closed Triangular Meshes Mean Value Coordnates for Closed Trangular Meshes Tao Ju, Scott Schaefer, Joe Warren Rce Unversty (a) (b) (c) (d) Fgure : Orgnal horse model wth enclosng trangle control mesh shown n black (a). Several

More information

Assessing health efficiency across countries with a two-step and bootstrap analysis *

Assessing health efficiency across countries with a two-step and bootstrap analysis * Assessng health effcency across countres wth a two-step and bootstrap analyss * Antóno Afonso # $ and Mguel St. Aubyn # February 2007 Abstract We estmate a sem-parametrc model of health producton process

More information

DP5: A Private Presence Service

DP5: A Private Presence Service DP5: A Prvate Presence Servce Nkta Borsov Unversty of Illnos at Urbana-Champagn, Unted States George Danezs Unversty College London, Unted Kngdom Ian Goldberg Unversty

More information

can basic entrepreneurship transform the economic lives of the poor?

can basic entrepreneurship transform the economic lives of the poor? can basc entrepreneurshp transform the economc lves of the poor? Orana Bandera, Robn Burgess, Narayan Das, Selm Gulesc, Imran Rasul, Munsh Sulaman Aprl 2013 Abstract The world s poorest people lack captal

More information

Journal of International Economics

Journal of International Economics Journal of Internatonal Economcs 79 (009) 31 41 Contents lsts avalable at ScenceDrect Journal of Internatonal Economcs journal homepage: Composton and growth effects of the current

More information

Complete Fairness in Secure Two-Party Computation

Complete Fairness in Secure Two-Party Computation Complete Farness n Secure Two-Party Computaton S. Dov Gordon Carmt Hazay Jonathan Katz Yehuda Lndell Abstract In the settng of secure two-party computaton, two mutually dstrustng partes wsh to compute

More information

Verification by Equipment or End-Use Metering Protocol

Verification by Equipment or End-Use Metering Protocol Verfcaton by Equpment or End-Use Meterng Protocol May 2012 Verfcaton by Equpment or End-Use Meterng Protocol Verson 1.0 May 2012 Prepared for Bonnevlle Power Admnstraton Prepared by Research Into Acton,

More information

From Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions

From Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 45, NO. 1, JANUARY 1999 105 From Computng wth Numbers to Computng wth Words From Manpulaton of Measurements to Manpulaton

More information