An Algorithm for Data-Driven Bandwidth Selection

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 An Algorthm for Data-Drven Bandwdth Selecton Dorn Comancu, Member, IEEE Abstract The analyss of a feature space that exhbts multscale patterns often requres kernel estmaton technques wth locally adaptve bandwdths, such as the varable-bandwdth mean shft. Proper selecton of the kernel bandwdth s, however, a crtcal step for superor space analyss and parttonng. Ths paper presents a mean shft-based approach for local bandwdth selecton n the multmodal, multvarate case. Our method s based on a fundamental property of normal dstrbutons regardng the bas of the normalzed densty gradent. We demonstrate that, wthn the large sample approxmaton, the local covarance s estmated by the matrx that maxmzes the magntude of the normalzed mean shft vector. Usng ths property, we develop a relable algorthm whch takes nto account the stablty of local bandwdth estmates across scales. The valdty of our theoretcal results s proven n varous space parttonng experments nvolvng the varable-bandwdth mean shft. Index Terms Varable-bandwdth mean shft, bandwdth selecton, multscale analyss, Jensen-Shannon dvergence, feature space. INTRODUCTION æ THE objectve of varable-bandwdth kernel estmaton s to mprove the performance of kernel estmators by adaptng the kernel bandwdth to the local data statstcs. It can be shown that the estmaton bas of sample pont densty estmators [] decreases n comparson to the fxed-bandwdth estmators, whle the covarance remans the same. Only recently, these estmators have been used n computer vson applcatons, such as hstogram constructon from color nvarants [9]. We have ntroduced the varable-bandwdth mean shft as an adaptve estmator of the densty s normalzed gradent and appled t for mode detecton n complex feature spaces [5]. Although theoretcally promsng, varable-bandwdth methods rely heavly on the selecton of local bandwdth. In the case when the bandwdth s not properly selected, the performance s suboptmal and often worse than that of fxed-bandwdth methods. Data-drven bandwdth selecton for multvarate data s a complex problem, largely unanswered by the current technques [28, p. 09], []. Dependng on the pror knowledge on nput data, we dstngush two classes of problems. If the data statstcs s homogeneous, then one global bandwdth suffces for the analyss. If the data statstcs are, however, changng across the feature space, local bandwdths should be computed. Unfortunately, most of the tasks encountered n autonomous vson reduce to the latter class of problems,.e., the nput s represented by multdmensonal features, whose propertes (scales) are varable n space and mght change n tme. Examples of such tasks are background modelng, trackng, or segmentaton. One can dentfy two general approaches to bandwdth selecton: statstcal analyss-based and task-orented methods. Statstcal methods compute the global bandwdth by balancng between the bas and varance of the densty estmate obtaned wth that bandwdth, over the entre space. Asymptotc approxmatons are used to express the qualty of the densty estmate. A relable method for. The terms bandwdth and scale wll be consdered equvalent n ths paper. Bandwdth wll be preferred when used n conjuncton wth a kernel, whle scale wll be employed to underlne the dea of sze.. D. Comancu s wth the Real-Tme Vson and Modelng Department, Semens Corporate Research, 755 College Road East, Prnceton, NJ 08540. E-mal: comanc@scr.semens.com. Manuscrpt receved 8 Mar. 2002; revsed 9 July 2002; accepted 25 July 2002. Recommended for acceptance by S. Sclaroff. For nformaton on obtanng reprnts of ths artcle, please send e-mal to: tpam@computer.org, and reference IEEECS Log Number 609. unvarate data s the plug-n rule [24], shown superor to leastsquares cross-valdaton and based cross-valdaton [4], [2], [26, p. 46]. The global bandwdth, however, s not effectve when data exhbts multscale patterns. In addton, for the multvarate case the optmal bandwdth formula [25, p. 85], [28, p. 99] s of lttle practcal use snce t depends on the Laplacan of the unknown densty beng estmated. The most often used method for local bandwdth adaptaton follows Arbamson s rule, whch takes the bandwdth proportonal to the nverse of the square root of a frst approxmaton of the local densty []. The proportonalty constant s an mportant choce of the method [26, p. 46]. Task-orented methods for bandwdth selecton typcally rely on the stablty of feature space parttonng. The bandwdth s taken as the center of the largest operatng range over whch the same number of parttons are obtaned for the gven data [8, p. 54]. Ths strategy s also mplemented wthn the framework of scale-space theory [7]. Nevertheless, t assumes that the space s homogeneous,.e., all the parttons should have roughly the same scale, whch s not always true. In a related class of technques, the best bandwdth maxmzes an objectve functon, whch expresses the qualty of space parttonng and s called ndex of cluster valdty. The objectve functon compares nter- versus ntra-cluster varablty [3], [5], or evaluates the solaton and connectvty of the delneated clusters [22]. See [20], for an evaluaton of a large set of such ndces. Ths paper presents a new and effectve approach to local bandwdth selecton for multmodal and multvarate data. The method estmates for each data pont the covarance matrx whch s the most stable across scales. The analyss s unsupervsed and the only assumpton s that the range of scales at whch structures appear n the data s known. In almost all vson scenaros, ths nformaton s avalable from pror geometrc, camera, or dynamcal constrants. The selected bandwdth matrces are employed n the varable-bandwdth mean shft for adaptve mode detecton and feature space parttonng, as shown n Fg.. The paper s organzed as follows: A more general form of the varable-bandwdth mean shft, ncludng fully parameterzed bandwdth matrces, s ntroduced n Secton 2. Secton 3 presents the theoretcal crteron for bandwdth selecton based on the normalzed mean shft vector. Secton 4 detals the proposed algorthm and shows bandwdth selecton experments. In Secton 5, we apply the varable-bandwdth mean shft to partton feature spaces. Dscussons are presented n Secton 6. 2 VARIABLE-BANDWIDTH MEAN SHIFT Let x, ¼...n be a set of d-dmensonal ponts n the space R d and assume that a symmetrc postve defnte d d bandwdth matrx H s defned for each data pont x. The matrx H quantfes the uncertanty assocated wth x [2]. The sample pont densty estmator wth d-varate normal kernel, computed at the pont x s gven by where ^f v ðxþ ¼ Xn d=2 nð2þ j H j exp =2 2 D2 ðx; x ; H Þ ; ðþ D 2 ðx; x ; H Þ ðx x Þ > H ðx x Þ ð2þ s the Mahalanobs dstance from x to x. Let H h be the dataweghted harmonc mean of the bandwdth matrces computed at x where the weghts w ðxþ ¼ H h ðxþ ¼Xn w ðxþh ; ð3þ jh j exp =2 2 D2 ðx; x ; H Þ jh j exp ð4þ =2 2 D2 ðx; x ; H Þ P n 062-8828/03/$7.00 ß 2003 IEEE

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 Fg.. Feature space analyss wth varable-bandwdth. For the ntal step, the fxed-bandwdth mean shft procedure [5] s appled wth dfferent analyss scales and, at each scale, each data pont s classfed nto a local mode. The trajectory ponts and mean shft vectors are then used to ft a normal surface to the densty surroundng each mode. For each data pont, the most stable covarance matrx across scales s then selected usng a specalzed verson of the Jensen-Shannon dvergence. Fnally, the covarance matrces are used n the varable-bandwdth mean shft. satsfy P n w ðxþ ¼. An estmator of the gradent of the true densty s the gradent of ^f v ^rf v ðxþ r^f v ðxþ ¼ Xn d=2 nð2þ H ðx xþ j H j =2 exp 2 D2 ðx; x ; H Þ : By multplyng (5) to the left wth H h ðxþ and usng (), t results that where H h ðxþ ^rf v ðxþ ¼ ^f v ðxþm v ðxþ; m v ðxþ H h ðxþ Xn ð5þ ð6þ w ðxþh x x ð7þ s the varable-bandwdth mean shft vector. From (6), we also have m v ðxþ ¼H h ðxþ ^rf v ðxþ ^f v ; ð8þ whch shows that the varable-bandwdth mean shft vector s an adaptve estmator of the normalzed gradent of the underlyng densty. If the bandwdth matrces H are all equal to a fxed matrx H, called analyss bandwdth, the sample pont estmator () reduces to the smple multvarate densty estmator wth normal kernel X ^fðxþ n ¼ n j 2H j =2 Equaton (8) becomes exp 2 D2 ðx; x ; HÞ : ð9þ mðxþ ¼H ^rfðxþ ^fðxþ ; ð0þ where P n mðxþ x exp 2 D2 ðx; x ; HÞ P n exp x ðþ 2 D2 ðx; x ; HÞ s the fxed-bandwdth mean shft vector. A mode seekng algorthm can be derved by teratvely computng the fxed- or varable-bandwdth mean shft vector [4], [5]. The partton of the feature space s obtaned by groupng together all the data ponts that converged to the same mode. Theoretcally, the partton qualty n the varable-bandwdth case s better, however, t depends on the selected bandwdth matrces H. The next sectons are devoted to the proper computaton of these matrces. 3 BANDWIDTH SELECTION THEOREM Ths secton explots a fundamental property of the normalzed gradent of normal dstrbutons, whose estmate s proportonally downward based [27]. The drect consequence of ths property s that, wthn the large sample approxmaton, the estmaton bas can be canceled, allowng the estmaton of the true local covarance of the underlyng dstrbuton. Our assumpton s that, n the neghborhood of locaton x, the data s a dstrbuted multvarate normal wth unknown mean and covarance matrx. The drect estmaton of s generally dffcult snce, to locally ft a normal one needs a pror knowledge of the neghborhood sze n whch the fttng parameters are to be estmated. If the estmaton s performed for several neghborhood szes, a scale nvarant measure of the goodness of ft s needed. The followng theorem, however, presents an elegant soluton to the problem. It s vald when the number of avalable samples s large. Theorem. Assume that the true dstrbuton f s Nð; Þ and the fxed-bandwdth mean shft s computed wth a normal kernel K H. The bandwdth normalzed norm of the mean shft vector s maxmzed when the analyss bandwdth H s equal to. Proof. Snce the true dstrbuton f s normal wth covarance matrx, t follows that the mean of ^fðxþ, E ^fðxþ ðx; þ HÞ s also a normal surface wth covarance þ H [27]. Lkewse, snce the gradent s a lnear operator, we have E r ^fðxþ ¼rðx; þ HÞ. When the large sample approxmaton s vald, the varances of the means are relatvely small. By employng (0), ths mples that plm mðxþ ¼H h E r ^fðxþ h E ^fðxþ ¼ H ¼ Hð þ HÞ ðx Þ; rðx; þ HÞ ðx; þ HÞ ð2þ where plm denotes the probablty lmt wth H held constant. The norm of the bandwdth normalzed mean shft s gven by mðx; HÞ H =2 plm mðxþ ¼ H =2 ð þ HÞ ðx Þ : ð3þ It s shown n Appendx A that mðx; HÞ s maxmzed ff H ¼.tu Theorem leads to an nterestng scale selecton crteron: The underlyng dstrbuton has the local covarance equal to the analyss bandwdth that maxmzes the magntude of the normalzed mean shft vector. The man dea of ths property s underlned n Fg. 2 for a undmensonal case. Gven the nput data drawn from Nð2; 4Þ,we computed the magntude of the normalzed mean shft for dfferent locatons and usng dfferent bandwdths. Each curve n Fg. 2b represents the results for one locaton. Snce the locatons were chosen on both sdes of the mean, the curves appear n pars. The upper curves are for the ponts located far from the mean. Observe

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 3 Fg. 2. Local mean shft-based scale selecton. (a) Input data. Nð2; 4Þ, n ¼ 2; 000. (b) Each curve represents the magntude of the normalzed mean shft computed at one locaton, but wth dfferent analyss bandwdths. The maxma of the curves correctly ndcate that the standard devaton of the nput data s equal to 4. that each curve s maxmum when the analyss bandwth s h 0 ¼ 4, ndcatng, accordng to the Theorem, that the standard devaton of the nput data s equal to 4. Snce the theorem s vald n the neghborhood of each mode, a more global soluton (least squares) can be obtaned by usng multple measurements represented by the mean shft trajectores of all data ponts convergng to the same mode. Note also that the nput data mght be multmodal wth asymmetrc structures, whle neghborng structures mght contamnate each other. In ths case, the normalty assumpton of Theorem s not vald and the result wll depend on the analyss bandwdth H. To solve ths problem, we propose a procedure whch selects the most stable bandwdth across scales. These deas are dscussed n the next secton. 4 ALGORITHM FOR BANDWIDTH SELECTION We derve n the sequel a least-squares soluton for covarance matrx estmaton and show how to choose the most stable result across scales. Then, the bandwdth selecton algorthm s summarzed and experments are presented. 4. Least-Squares Soluton Let us denote by x, ¼...n u all the data ponts assocated wth the u th mode and by y, ¼...t u the locaton of all trajectory ponts assocated wth the same mode. The partton s obtaned usng the mean shft procedure wth analyss bandwdth H. Assume that ð; Þ are the mean and covarance of the underlyng structure. We note that the mean and covarance of the ponts x, ¼...n u are not relable estmates of ð; Þ. The reason s that the data parttonng s nonparametrc, based on the peaks and valleys of the densty probablty functon of the entre data set. As a result, the set x, ¼...n u s an ncomplete sample from the local underlyng dstrbuton. It can be asymmetrc (dependng on the neghborng structures) and t mght not contan the tal. Hence, the sample mean and varance dffer from ð; Þ. The soluton s to ft a normal surface to the densty values computed n the trajectory ponts assocated wth the mode. The fttng problem s easly solved by usng the mean shft vector. For each trajectory pont y, we apply (2) to obtan mðy Þ¼ Hð þ HÞ ðy Þ; ð4þ where ð; Þ are the mean and covarance of the true dstrbuton. By fxng the mean as the the local peak n the densty surface (see Fg. 3), we can derve a least-squares soluton for the covarance matrx. If H ¼ h 2 I and ¼ 2 I, the least-squares soluton for 2 s " P # tu 2 ¼ h 2 m > ð y Þ P tu km k 2 : ð5þ If H ¼ dag h 2...h2 d and ¼ dag 2...2 d, then " P # tu 2 v ¼ m > h2 v ð v y v Þ v P tu ; ð6þ m 2 v where the subndex v ¼...d denotes the vth component of a vector. Although a fully parameterzed covarance matrx can be computed usng (4), ths s not necessarly advantageous [28, p. 07] and, for dmensons d>2, the number of parameters ntroduced are too large to make relable decsons. We wll therefore use n the sequel only (5) and (6). 4.2 Multscale Analyss When the underlyng data dstrbuton s normal, the analyss bandwdth H does not nfluence the computaton of ð; Þ. When the underlyng structure devates from normalty, H affects the estmaton. Therefore, n the fnal step of the algorthm, we test the stablty of ð; Þ aganst the varaton of the analyss bandwdth. The smplest test s to take H ¼ h 2 I and vary h on a logarthmc scale wth constant step. Let H ¼ h 2 I;...; H b ¼ h 2 bi be a set of analyss bandwdths generated as above. The range of these bandwdths s assumed known a pror. Denote by ð ; Þ;...; ð b ; b Þ the correspondng set of estmates and by p...p b the assocated normal dstrbutons. The stablty test for dstrbuton p j nvolves the computaton of the overall dssmlarty between p j and ts neghbors across scale p j w...p j ;p jþ...p jþw. The smplest choce s w ¼. The dssmlarty s measured usng a specalzed verson of the Jensen-Shannon dvergence, whch s defned for the d-varate normal dstrbutons p j, j ¼...r as Fg. 3. Fttng a normal surface to the densty values computed n the trajectory ponts. Observe that, even for asymmetrc regons, the mean of the normal surface should be taken equal to the mode of the densty.

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 Fg. 4. Bandwdth selecton example. (a) Hstogram of nput data drawn wth equal probablty from two normals Nð4; 0:5Þ and Nð7; Þ wth total n ¼ 200. (b) Bandwdth selecton for each data pont usng the proposed algorthm. For presentaton, the data pont ndex ncreases wth locaton. P JSðp...p r Þ¼ r 2 log r j qffffffffffffffffffffffffffffffffffffff Q r r j þ 2 X r > X r j! j j ð7þ wth ¼ r P r j. Ths formula s derved n Appendx B. Observe that, for r ¼ 2, the specalzed Jensen-Shannon dvergence reduces to the well-known Bhattacharyya dstance [8, p. 99]. 4.3 Bandwdth Selecton Summary The proposed algorthm solves the bandwdth selecton problem n two stages. The frst stage s defned at the partton level and determnes a mean and covarance matrx for each mode detected through multscale analyss. The second stage s defned at the data level and selects for each data pont the most stable mean and covarance across the analyss scale. The algorthm s presented below. Bandwdth Matrx Selecton Gven n data ponts x, ¼...n and a set of analyss matrces H ¼ h 2 I;...; H b ¼ h 2 bi constructed on a logarthmc scale: A. Evaluate the bandwdth at the partton level. For each H j, j ¼...b. Partton the data usng the mean shft procedure. 2. Compute ð ju ; ju Þ for each mode u of the partton usng the locaton of the mode for the mean and (5) or (6) for the covarance. 3. Assocate to each data pont x the mean and covarance of ts mode. B. Evaluate the bandwdth at the data level. For each data pont x. Based on the set of estmates ð ; Þ...ð b ; b Þ, defne the normal dstrbutons p...p b. 2. Select the most stable par ð; Þ by mnmzng the Jensen- Shannon dvergence between neghborng dstrbutons across scales. represents the selected bandwdth for x. The complexty of the algorthm s b tmes larger than the complexty of data parttonng usng mean shft analyss wth one scale. Drect mplementaton of mean shft analyss wth one scale has a complexty of Oðn 2 Þ, where n s the number of data ponts. However, by selectng a set of q representatve data ponts usng rregular tessellaton of the space and only computng trajectores of those ponts, the complexty of mean shft analyss can be decreased to OðqnÞ, wth q n [3]. 4.4 Sample Sze Whle the large sample approxmaton s not crtcal for (2), the sparse data needs attenton. The local sample sze should be suffcently large for nference. The approach we take s based on the Effectve Sample Sze [0] whch computes the kernel weghted count of the number of ponts n each wndow P n ESSðx; HÞ ¼ K P n Hðx x Þ ¼ exp 2 D2 ðx; x ; HÞ K H ð0 0Þ exp : ð8þ 2 D2 ð0; 0; HÞ Usng the bnomal rule of thumb, we cancel the nference when ESSðx; HÞ < 5. 4.5 Bandwdth Selecton Examples A frst example for a bmodal data set generated wth equal probablty from Nð4; 0:5Þ and Nð7; Þ s presented n Fg. 4. The standard devaton for each dstrbuton (measured before amalgamatng the data) s 0:53 and 0:92. Our algorthm resulted n 0:58 and 0:93, respectvely. We used eght analyss bandwdths n the range of 0:3-:42 wth a rato of :25 between two consecutve bandwdths. For all the experments presented henceforth, we wll use the same rato of :25 between two consecutve bandwdths. The specalzed Jensen-Shannon dvergence was computed wth r ¼ 3 (three consecutve bandwdths). No other addtonal nformaton was used. For the next example, the data s drawn wth equal probablty from Nð8; 2Þ, Nð25; 4Þ, Nð50; 8Þ, and Nð00; 6Þ. The data hstogram s shown n Fg. 5a, whle our bandwdth selecton s shown n Fg. 5b. We used 2 analyss bandwdths n the range of :5-7:46. Another example s shown n Fg. 6 for bvarate data. We run the algorthm wth sx analyss bandwdths n the range 0:5-:5. The algorthm detected three classes of bandwdths: 0:96, :04, and :08. In Fg. 6b, the bandwdth assocated wth each data pont s ndcated by the bullet (smallest bullets for 0:96, largest bullets for :08). The allocated bandwdths are very close to the true data scale, whch s equal to. 5 FEATURE SPACE PARTITIONING Ths secton presents results for feature space parttonng usng the varable-bandwdth mean shft wth bandwdth selecton. Only the range of analyss scales s provded for each experment. 5. Nonlnear Structures wth Multple Scales For the data shown n Fg. 7a, the algorthm was run wth sx analyss bandwdths n the range of 0:-0:3. Ths tme, we used expresson (6) to estmate a dagonal form for the covarance matrx assocated wth each data pont. The results are presented n Fg. 7c for the scales assocated wth the coordnate x and Fg. 7d for the scales assocated wth the coordnate y of each data pont.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 5 Fg. 5. Bandwdth selecton example. (a) Hstogram of nput data drawn wth equal probablty from four normals Nð8; 2Þ, Nð25; 4Þ, Nð50; 8Þ, and Nð00; 6) wth total n ¼ 400. (b) Bandwdth selecton for each data pont usng the proposed algorthm. For presentaton, the data pont ndex ncreases wth locaton. Fg. 6. Bandwdth selecton example. (a) Bvarate data drawn wth equal probablty from Nð½0; Š; IÞ, Nð½2:5; 2Š; IÞ, Nð½5; Š; IÞ wth total n ¼ 250. (b) Bandwdth selecton for each data pont usng the proposed algorthm. Three classes of bandwdth were detected. See text for detals. Fg. 7. Nonlnear data analyss. (a) Input contanng structures at dfferent scales (n ¼ 400). (b) Fnal decomposton obtaned through varable-bandwdth mean shft. Each structure s marked dfferently. (c) Scale selecton for the x coordnate of each data pont. (d) Scale selecton for the y coordnate of each data pont. Observe that the elongated structure of the data s reflected n a larger bandwdth for the coordnate x. Also, each graph contans two dstnct groups of scale values correspondng to the two scales n the data. The spurous peaks represent ponts located on the

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 Fg. 8. Jensen-Shannon dvergence for data from Fg. 7. (a) Ponts from large structures. (b) Ponts from small structures. Fg. 9. Color clusterng experment. (a) Orgnal mage, 500 333 pxels. (b) L u v color space contanng 66,500 ponts. (c) Segmented mage n pseudogray levels. (d) Obtaned clusters. The poston of each cluster s shfted to show the delneaton. border between two structures. Note also that, for both coordnates, the smaller scale s approxmately half of the larger scale, smlar to the data characterstcs. Fg. 8 shows of the specalzed Jensen-Shannon dvergence for ponts from the large structures (Fg. 8a) and small structures (Fg. 8b). As one can observe, n the case of large structures, the estmaton s most stable (small dvergence) for the analyss scales from the mddle. On the contrary, n the case of small structures, the estmaton s the most stable for the smallest analyss scale. The last step nvolves the applcaton of the varable-bandwdth mean shft wth the bandwdths shown n Fg. 7c and Fg. 7d. The algorthm detected four modes and the resultng parttonng s shown n Fg. 7b. Note that most of the algorthms usng one analyss bandwdth are prone to fal for ths type of data. If the bandwdth s large, the two small structures wll be joned together. If the bandwdth s small, each of the two large structures wll be dvded. 5.2 Color Clusterng We tested the new algorthm for the task of color clusterng n the three-dmensonal L u v space. The selected examples contan large and elongated clusters n the vcnty of small clusters, a dffcult scenaro for fxed-bandwdth analyss. A frst test mage s shown n Fg. 9a. The sky, ocean, and the waves generate compact and small clusters, whle the texture from the land generates a large cluster (Fg. 9b). Usng sx analyss bandwdths n the range of 3-9, our algorthm correctly obtaned the four clusters shown n Fg. 9d whch corresponds to the segmentaton from Fg. 9c. The same analyss bandwdths have been employed for processng the color data comng from the test mage shown n Fg. 0a. Observe agan, the presence of a large cluster n the vcnty of small clusters (Fg. 0b). The algorthm dentfed three clusters (Fg. 0d) whch are assocated to the man structures n the mage, as can be seen n the correspondng segmented mage (Fg. 0c). 6 DISCUSSION It s useful to contrast the proposed algorthm aganst some classcal alternatves. The EM algorthm [23] also assumes a mxture of normal structures and fnds teratvely the maxmumlkelhood estmates of the a pror probabltes, means, and

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 7 Fg. 0. Color clusterng experment 2. (a) Orgnal mage, 500 333 pxels. (b) L u v color space contanng 66,500 ponts. (c) Segmented mage n pseudogray levels. (d) Obtaned clusters. The poston of each cluster s shfted to show the delneaton. covarances. However, the EM needs the specfcaton of the number of clusters, needs a good ntalzaton, and does not deal wth non-normal structures. In addton, ts convergence s dffcult when the number of clusters s large, determnng the ncrease of the number of parameters. See [7] for a dscusson and changes of the EM to overcome some of these lmtatons. Our algorthm s not affected by the number of clusters snce t does not employ a global crteron that should be optmzed. We only requre the a pror knowledge of a range of vable scales, whch s a very practcal crteron. In almost all stuatons, the user has ths knowledge. In addton, the normalty assumpton s only for bandwdth selecton, whle the overall algorthm mantans the ablty of analyzng complex, non-normal structures. The only lmtaton of our method comes wth the dmensonalty of the data. It s known that nonparameterc technques are not relable n hghdmensonal spaces. Let us also contrast the proposed algorthm wth methods based on multscale analyss. From ths pont of vew and accordng to our knowledge, ths s the frst method whch tests the stablty of the second order statstcs derved from the data. Up to now, the stablty testng was lmted to the frst order statstcs such as the mean, the mode, or drecton vectors (see, for example, [2]). By checkng the stablty of the covarance matrx through the specalzed Jensen- Shannon dvergence, we ncrease the amount of nformaton nvolved n the test. Fnally, the method can be mproved by replacng the least-square estmaton wth a robust method. Ths work mostly presented the theory related to the new algorthm. The algorthm s useful for scenaros nvolvng multscale patterns, such as feature space parttonng n trackng, background modelng, and segmentaton. An nterestng subject of future research s to analyze the relaton between the proposed method and scale selecton technques for mage features [9]. APPENDIX A THE MAGNITUDE OF THE BANDWIDTH NORMALIZED MEAN SHIFT VECTOR mðx; HÞ IS MAXIMIZED WHEN H ¼ Recall that the magntude of the bandwdth normalzed mean shft vector s gven by mðx; HÞ ¼ H =2 ð þ HÞ ðx Þ : ða:9þ We assume that H and are symmetrc, postve defnte matrces, and the magntude of x s strctly postve. We wll show that mðx;þ 2 mðx; HÞ 2 0 wth equalty ff H ¼. The left sde of (A.20) becomes mðx;þ 2 mðx; HÞ 2 ¼ h =2 ðx Þ 2 4 H =2 ð þ HÞ ðx Þ 2 4 ¼ h ðx Þ> 4ð þ HÞ Hð þ HÞ ðx Þ 4 ¼ 4 ðx Þ> ð þ HÞ ðh IÞ 2 ð þ HÞ ðx Þ; ða:20þ ða:2þ where I s the d d dentty matrx. Wthn the condtons stated, all the matrces n the last term of (A.2) are postve defnte, exceptng ðh IÞ 2 whch s equal to 0 ff H ¼. APPENDIX B OVERALL DISSIMILARITY OF A SET OF MULTIVARIATE NORMAL DISTRIBUTIONS One of the few measures of the overall dfference of more than two dstrbutons s the generalzed Jensen-Shannon dvergence [8].

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 Gven r probablty dstrbutons p j, j ¼...r, ther Jensen- Shannon dvergence s defned as! JSðp...p r Þ¼H X r p j X r Hðp j Þ; ðb:22þ r r where Z HðpðxÞÞ ¼ pðxþ log pðxþdx ðb:23þ s the entropy of pðxþ. Ths dvergence s postve and equal to zero ff all p j are equal. Usng (B.23) n (B.22), we obtan JSðp...p r Þ¼ r X r Z p j ðxþ log p jðxþ qðxþ dx wdth qðxþ ¼ X r p j : r ðb:24þ For the d-varate normal case, the dstrbutons p j are defned by p j ðxþ ¼ j2 j exp =2 2 ðx Þ > ðx Þ : ðb:25þ A specalzed verson of the Jensen-Shannon dvergence can be obtaned by takng qðxþ as the most lkely normal source for the homogeneous model P r r p j, havng the mean ¼ P r r j and covarance ¼ P r r j [6]. The new measure s equvalent to a goodness-of-ft test between the emprcal dstrbutons p j, j ¼...r and the homogeneous model P r r p j. To derve a closed form expresson, we use (B.25) and the dentty x > x ¼ tr xx > to obtan [6, p.89] log p ðxþ qðxþ ¼ jj log 2 j j 2 tr ðx Þðx Þ > þ 2 tr ðx Þðx Þ > ðb:26þ for ¼...r, where tr denotes the trace of a matrx. Performng the ntegraton yelds Z p ðxþ log p ðxþ qðxþ dx ¼ jj log 2 j j þ 2 tr d 2 þ ðb:27þ 2 tr ð Þð Þ > : Summng (B.27), for ¼...r and substtutng ¼ r P r j,we have JSðp...p r Þ P ¼ r 2 log r! j qffffffffffffffffffffffffffffffffffffff Q r þ r 2r tr X r j j X r! þ 2r tr X r j r P ¼ r 2 log r j ffffffffffffffffffffffffffffffffffffff Q r þ rq 2 j X r j! X r j r r 2 > j > X r j! j j ; REFERENCES [] I. Abramson, On Bandwdth Varaton n Kernel Estmates A Square Root Law, The Ann. Statstcs, vol. 0, no. 4, pp. 27-223, 982. [2] N. Ahuja, A Transform for Multscale Image Segmentaton by Integrated Edge and Regon Detecton, IEEE Trans. Pattern Analyss Machne Intellgence, vol. 8, pp. 2-235, 996. [3] D. Comancu and P. Meer, Dstrbuton Free Decomposton of Multvarate Data, Pattern Analyss and Applcatons, vol. 2, pp. 22-30, 999. [4] D. Comancu and P. Meer, Mean shft: A Robust Approach toward Feature Space Analyss, IEEE Trans. Pattern Analyss Machne Intellgence, vol. 24, no. 5, pp. 603-69, May 2002. [5] D. Comancu, V. Ramesh, and P. Meer, The Varable Bandwdth Mean Shft and Data-Drven Scale Selecton, Proc. Eghth Int l Conf. Computer Vson, vol. I, pp. 438-445, July 200. [6] R. El-Yanv, S. Fne, and N. Tshby, Agnostc Classfcaton of Markovan Sequences, Proc. Advances n Neural Informaton Processng Systems, vol. 0, pp. 465-47, 997. [7] M. Fgueredo and A. Jan, Unsupervsed Learnng of Fnte Mxture Models, IEEE Trans. Pattern Analyss Machne Intellgence, vol. 24, no. 3, pp. 38-396, Mar. 2000. [8] K. Fukunaga, Introducton to Statstcal Pattern Recognton, second ed. Academc Press, 990. [9] T. Gevers, Robust Hstogram Constructon from Color Invarants, Proc. Int l Conf. Computer Vson, vol. I, pp. 65-620, July 200. [0] F. Godtlebsen, J. Marron, and P. Chaudhur, Sgnfcance n Scale Space for Densty Estmaton, Unpublshed manuscrpt, Avalable at www.stat.unc.edu/faculty/marron/marron_papers.html 999. [] P. Hall, T. Hu, and J. Marron, Improved Varable Wndow Kernel Estmates of Probablty Denstes, The Ann. Statstcs, vol.23,no.,pp.-0, 995. [2] M. Iran and P. Anandan, Factorzaton wth Uncertanty, Proc. Sxth European Conf. Computer Vson, pp. 539-553, 2000. [3] A.K. Jan and R.C. Dubes, Algorthms for Clusterng Data. Prentce Hall, 988. [4] M. Jones, J. Marron, and S. Sheather, A Bref Survey of Bandwdth Selecton for Densty Estmaton, J. Am. Statstcal Assoc., vol. 9, pp. 40-407, 996. [5] L. Kauffman and P. Rousseeuw, Fndng Groups n Data: An Introducton to Cluster Analyss. J. Wley & Sons, 990. [6] S. Kullback, Informaton Theory and Statstcs. Dover, 997. [7] Y. Leung, J. Zhang, and Z. Xu, Clusterng by Scale-Space Flterng, IEEE Trans. Pattern Analyss Machne Intellgence, vol. 22, no. 2, pp. 396-40, Dec. 2000. [8] J. Ln, Dvergence Measures Based on the Shannon Entropy, IEEE Trans. Informaton Theory, vol. 37, pp. 45-5, 99. [9] T. Lndeberg, Feature Detecton wth Automatc Scale Selecton, Int l J. Computer Vson, vol. 30, no. 2, pp. 79-6, 998. [20] G. Mllgan and M. Cooper, An Examnaton of Procedures for Determnng the Number of Clusters n a Data Set, Psychometrka, vol. 50, pp. 59-79, 985. [2] B. Park and J. Marron, Comparson of Data-Drven Bandwdth Selectors, J. Am. Statstcal Assoc., vol. 85, pp. 66-72, 990. [22] E.J. Pauwels and G. Frederx, Fndng Salent Regons n Images, Computer Vson and Image Understandng, vol. 75, pp. 73-85, 999. [23] R. Redner and H. Walker, Mxture Denstes, Maxmum Lkelhood and the EM Algorthm, SIAM Rev., vol. 26, pp. 95-239, 984. [24] S. Sheather and M. Jones, A Relable Data-Based Bandwdth Selecton Method for Kernel Densty Estmaton, J. Royal Statstcal Soc. B, vol. 53, pp. 683-690 99. [25] B.W. Slverman, Densty Estmaton for Statstcs and Data Analyss. Chapman & Hall, 986. [26] J. Smonoff, Smoothng Methods n Statstcs. Sprnger-Verlag, 996. [27] T. Stoker, Smoothng Bas n Densty Dervatve Estmaton, J. Am. Statstcal Assoc., vol. 88, no. 423, pp. 855-863, 993. [28] M.P. Wand and M. Jones, Kernel Smoothng. Chapman & Hall, 995.. For more nformaton on ths or any other computng topc, please vst our Dgtal Lbrary at http://computer.org/publcatons/dlb. ðb:28þ where ¼ r P r j. ACKNOWLEDGMENTS The author would lke to thank Yakup Genc from Semens Corporate Research for valuable dscussons on ths work.