Mean shift-based clustering

Transcription

1 Pattern Recognition (7) Mean shift-based clustering Kuo-Lung Wu a, Miin-Shen Yang b, a Deartment of Information Management, Kun Shan University of Technology, Yung-Kang, Tainan 7, Taiwan, ROC b Deartment of Alied Mathematics, Chung Yuan Christian University, Chung-Li, Taiwan, ROC Received June 6; received in revised form February 7; acceted 9 February 7 Abstract In this aer, a mean shift-based clustering algorithm is roosed. The mean shift is a kernel-tye weighted mean rocedure. Herein, we first discuss three classes of Gaussian, Cauchy and generalized anechnikov kernels with their shadows. The robust roerties of the mean shift based on these three kernels are then investigated. According to the mountain function concets, we roose a grahical method of correlation comarisons as an estimation of defined stabilization arameters. The roosed method can solve these bandwidth selection roblems from a different oint of view. Some numerical eamles and comarisons demonstrate the sueriority of the roosed method including those of comutational comleity, cluster validity and imrovements of mean shift in large continuous, discrete data sets. We finally aly the mean shift-based clustering algorithm to image segmentation. 7 Pattern Recognition Society. Published by lsevier Ltd. All rights reserved. Keywords: Kernel functions; Mean shift; Robust clustering; Generalized anechnikov kernel; Bandwidth selection; Parameter estimation; Mountain method; Noise. Introduction Kernel-based methods are widely used in many alications [,]. There are two ways of imlementing kernel-based methods along with suervised and unsuervised learning. One way is to transform the data sace into a high-dimensional feature sace F where the inner roducts in F can be reresented by a Mercer kernel function defined on the data sace (see Refs. [ ]). An alternative way is to find a kernel density estimate on the data sace and then search the modes of the estimated density [6]. The mean shift [7,8] and the mountain method [9] are two simle techniques that can be used to find the modes of a kernel density estimate. Fukunaga and Hostetler [7] roosed the mean shift rocedure based on asymtotic unbiasedness, consistency and uniform consistency of a nonarametric density function gradient estimate using a generalized kernel aroach. This technique has been alied in image analysis [,], teture segmentation [,], objective tracking [ 6] and data fusion [7]. Cheng [8] clarified the relationshi between mean shift and Corresonding author. Tel.: ; fa: mail address: [email protected] (M.-S. Yang). otimization by introducing the concet of shadows. He showed that a mean shift is an instance of gradient ascent with an adative ste size. Cheng [8] also roved some of the convergence roerties of a blurring mean shift rocedure and showed some eculiar behaviors of mean shift in cluster analysis with the most used Gaussian kernel. Moreover, Fashing and Tomasi [8] showed that mean shift is a bound otimization and is equivalent to Newton s method in the case of iecewise constant kernels. Yager and Filev [9] roosed mountain methods to find the aroimate modes of the data set via the mountain function, which is equivalent to the kernel density estimate defined on the grid nodes. Chiu [9] modified the mountain method by defining the mountain function, not on the grid nodes, but on the data set. Recently, Yang and Wu [] roosed a modified mountain method to identify the number of modes of the mountain function. The alications of the mountain method can be found in Refs. [,]. In this aer, we will roose a mean shift-based clustering method (MSCM) using the concet of mountain functions to solve the bandwidth selection roblem in the mean shift rocedure. The bandwidth selection for a kernel function directly affects the erformance of the density estimation. It also heavily -/$. 7 Pattern Recognition Society. Published by lsevier Ltd. All rights reserved. doi:.6/j.atcog.7..6

2 6 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) influences the erformance of the mean shift rocedure. The modes found by the mean shift do not adequately resent the dense area of the data set if a oor bandwidth estimate is used. Comaniciu and Meer [] summarized four different techniques according to statistical analysis-based and task-oriented oints of view. One ractical bandwidth selection technique is related to the stability of decomosition for a density shae estimate. The bandwidth is taken as the center of the largest oerating range over which the same number of clusters are obtained for the given data []. Similar concets are used by Beni and Liu [] to estimate the resolution arameter of a least biased fuzzy clustering method, and also to solve the cluster validity roblem. In this aer, we will offer an alternative method of solving this roblem. We first normalize the kernel function and then estimate the defined stabilization arameter. This estimation method will be discussed in Section based on the mountain function concet. The roerties of the mean shift rocedures are reviewed in Section.. We discuss some secial kernels with their shadows and then define generalized anechnikov kernels in Section.. Note that most of mean shift-based algorithms are less likely to include the roerty of robustness [] that is often emloyed in clustering [6 ]. In Section., we will discuss the relation between the mean shift and the robust statistics based on the discussed kernel functions. In Section., we roosed an alternative technique to solve the bandwidth selection roblem which solved the roblem from a different oint of view. The urose of the bandwidth selection is to find a suitable bandwidth (covariance) for a kernel function of a data oint so that a suitable kernel function will induce a good density estimate. Our technique assigns a fied samle variance for the kernel function so that a suitable stabilization arameter can induce a satisfactory density estimate. We then roose a technique to estimate the stabilization arameter using an adative mountain function in Section.. According to our analysis, we roose the mean shift-based clustering algorithm based on the defined generalized anechnikov kernel in Section.. Some numerical eamles, comarisons and alications are stated in Section. These include the comutational comleity, cluster validity, imrovements in large continuous and discrete data sets and image segmentation. Finally, conclusions are given in Section.. Mean shift Let X ={,..., n } be a data set in an s-dimensional uclidean sace R s. Camastra and Verri [] and Girolami [] had recently considered kernel-based clustering for X in the feature sace where the data sace is transformed to a high-dimensional feature sace F and the inner roducts in F are reresented by a kernel function. On the other hand, the kernel density estimation with the modes of the density estimate over X is another kernel-based clustering method based on the data sace [6]. The modes of a density estimate are equivalent to the location of the densest area of the data set where these locations could be satisfactory cluster center estimates. In the kernel density estimation, the mean shift is a simle gradient technique used to find the modes of the kernel density estimate. We first review the mean shift rocedures in the net subsection... Mean shift rocedures Mean shift rocedures are techniques for finding the modes of a kernel density estimate. Let H : X R be a kernel with H()= h( j ). The kernel density estimate is given by ˆ f H () = h( j )w( j ), () j= where w( j ) is a weight function. Based on a uniform weight, Fukunaga and Hostetler [7] first gave the statistical roerties including the asymtotic unbiasedness, consistency and uniform consistency of the gradient of the density estimate given by ˆ f H () = ( j )h ( j )w( j ). j= Suose that there eists a kernel K : X R with K() = k( j ) such that h (r) = ck(r) where c is a constant. The kernel H is termed a shadow of kernel K (see Ref. [8]). Then ˆ f H () = k( j )( j )w( j ) j= = k( j )w( j ) j= [ nj= k( j )w( j ) j nj= k( j )w( j ) = fˆ K ()[m K () ]. () The term m K () = fˆ H ()/ fˆ K () is called the generalized mean shift which is roortional to the density gradient estimate. Formulation () was first remarked uon in Ref. [] with the uniform weight case. Taking the gradient estimator fˆ H () to be zero, we derive a mode estimate as nj= k( j )w( j ) j = m K () = nj= k( j )w( j ), () where H is a shadow of kernel K. q. () is also called the weighted samle mean with kernel K. The mean shift has three kinds of imlementation rocedures. The first is to set all data oints as initial values and then udate each data oint j with m K ( j ), j =,...,n (i.e. j m K ( j )). This rocedure is called a blurring mean shift. For the blurring rocess, each data oint j is udated with each iteration. Hence the density estimate fˆ H () is also changed. The urose of the second one is still to set all data oints as initial values, but we only udate the data oint with m K ( j ) (i.e. m K ( j )). This rocedure is called a nonblurring mean shift. In this rocess, most data oints and the density estimate are not udated. The third one is to choose c initial values where c may be greater or smaller ]

3 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) 7. The Gaussian kernels =.. The Cauchy kernels =.. The generalized anechnikon kernel =..... = - -. = - -. = - - Fig.. The kernel functions with different values: (a) Gaussian kernels, (b) Cauchy kernels and (c) generalized anechnikov kernels. than n and then udate the data oint with m K ( j ) (i.e. m K ( j )). This rocedure is called a general mean shift. Cheng [8] roved some convergence roerties of the blurring mean shift rocess using q. (). Moreover, Comaniciu and Meer [] also gave some roerties for the discrete data. They also discussed its relation to the Nadaraya Watson estimator from the kernel regression and robust M-estimator oints of view. If there is a shadow H of kernel K, then the mean shift rocedure could ascertain modes of a known density estimate fˆ H (). This technique can be used to directly find the modes of the density shae estimate. If a shadow of kernel K does not eist or it has not been found yet, the mean shift can be used to estimate the alternative modes (cluster centers) of the given data set with an unknown density function... Some secial kernels and their shadows In this subsection, we investigate those secial kernels with their shadows. The most commonly used kernels that are their own shadows are the Gaussian kernels G () defined as G () =[g( j )] =[e{ j /β}] with their shadows SG defined as SG () = G (), >. This means that the mean shift rocedures with m G ( j ) are used to find the modes of the density estimate fˆ SG (). Cheng [8] showed some behaviors of mean shift in cluster analysis with a Gaussian kernel. The maimum entroy clustering algorithm [,] is a Gaussian kernel-based mean shift with a secial weight function. Chen and Zhang [] used the Gaussian kernel-induced distance measure to imlement the satially constrained fuzzy c-means (FCM) [] as a robust image segmentation method. Yang and Wu [] directly used fˆ SG () as a total similarity objective function. They then derived a similarity-based clustering method (SCM) that could self-organize the cluster number and volume according to the structure of the data. Another imortant class of kernels is the Cauchy kernels defined as C () =[c( j )] =[( + j /β) ] that are based on the Cauchy density function f()= (/π) ( + ), << with their shadows defined as SC () = C (), >. The mean shift rocess with m C ( j ) is used to find the modes of the density estimate fˆ SC (). There is less alication with the Cauchy kernels. To imrove the weakness of FCM in a noisy environment, Krishnauram and Keller [6] first considered relaing the constraint of the fuzzy c-artitions summation to and then roosing the so-called ossibilistic c-means (PCM) clustering algorithm. ventually, these ossibilistic membershi functions become the Cauchy kernels. This is the only alication of the Cauchy kernels to clustering what we can find. The simlest kernel is the flat kernel defined as { if j, F()= if j > with the anechnikov kernel () as its shadows { j if j, () = if j >. Moreover, the anechnikov kernel has the biweight kernel B() as its shadow { ( j ) if j, B() = if j >. For a more general resentation, we etend the anechnikov kernel () to be the generalized anechnikov kernels K () with the arameter defined as K () =[k ( j )] { ( j /β) if j β, = if j > β. Thus, the generalized anechnikov kernels K () has the corresonding shadows SK () defined as SK () = K+ (), >. Note that we have K ()=F(), K ()=() and K ()= B() when β =. The mean shift rocess with m K ( j )

4 8 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) is used to find the modes of the density estimate fˆ (). SK In total, we resent three kernel classes with their shadows. These are Gaussian kernels G () with their shadows SG ()= G (), Cauchy kernels C () with their shadows SC () = C () and the generalized anechnikov kernels K () with their shadows SK () = K+ (). The behaviors of these kernels with different are shown in Fig.. The mean shift rocedures using these three classes of kernels can easily find their corresonding density estimates. The arameters β and, called the normalization and stabilization arameters, resectively, greatly influence the erformance of the mean shift rocedures for the kernel density estimate. We will discuss these in the net section.. Mean shift as a robust clustering A suitable clustering method should have the ability to tolerate noise and detect outliers in the data set. Many criteria such as the breakdown oint, local-shift sensitivity, gross error sensitivity and influence functions [] can be used to measure the level of robustness. Comaniciu and Meer [] discussed some relationshis of the mean shift and the nonarametric M-estimator. Here, we rovide a more detailed analysis of the robustness of the mean shift rocedures... Analysis of robustness The mean shift mode estimate m k () can be related to the location M-estimator ˆθ = arg min θ ρ( j θ), j= where ρ is an arbitrary loss measure function. ˆθ can be generated by solving the equation ( / θ) ρ( j θ) = j= φ( j θ) =, j= where φ( j θ) = ( / θ)ρ( j θ). If the kernel H is a reasonable similarity measure which takes values on the interval [, ], then ( H) could be a reasonable loss function. Thus, a location M-estimator can be found by ˆθ = arg min θ = arg ma θ j= j= ρ( j θ) = arg min θ [ h( θ j )] j= h( θ j ) = arg ma θ fˆ H (θ). This means that m k () is a location M-estimator if K is a kernel with its shadow H. The influence curve (IC) can hel us to assess the relative influence of an individual observation toward the value of an estimate. In the location roblem, we have the influence of an M-estimate with IC(y; F,θ) = φ(y θ) φ (y θ) df Y (y), where F Y (y) denotes the distribution function of Y. The M- estimator has shown that IC(y; F,θ) is roortional to its φ function []. If the influence function of an estimator is unbounded, an outlier might cause trouble where the φ function is used to denote the degree of influence. Suose that {,..., n } is a data set in the real number sace R s and SG () is a shadow of G (). Then, m G () is a location M-estimator with φ function defined to be φ G ( j ) = d d [ SG ()]= β ( j )G (). By alying the L Hosital s rule, we derive lim j ± φ G ( j ) =. Thus, we have the influence curve of m G () with IC( j ; F,) = when j tends to ositive or negative infinity. This means that the influence curve is bounded and the influence of an etremely large or small j on the mode estimator m G () is very small. We can also find the location of j which has a maimum influence on m G () by solving ( / θ)φ G ( j ) =. Note that both m C () and m () K also have the revious roerties where their φ functions are as follows: ( ) φ C ( j ) = ( j )C (), β ( + ) φ ( ( j )K K j ) = β () if ( j ) β, if ( j ) β. The φ function for the mean shift-based mode-seeking estimators with β = is shown in Fig.. The influence of an individual j on m () is zero if ( K j ) β and an etremely large or small j will have no effect on m K (). However, the influence of an etremely large or small j on m G () and m C () is a monotone decreasing function of the stabilization arameter. An outlier has no influence when becomes large. In this aer, we will choose these generalized anechnikov kernels for our mean shift-based clustering algorithm because they can suitably fit the data sets even with etremely large or small data oints... Bandwidth selection and stabilization arameter Bandwidth selection greatly influences the erformance of kernel density estimation. Comaniciu and Meer [] summarized four different techniques according to statistical analysisbased and task-oriented oints of view. Here, we give new consideration to first normalizing the distance measure by dividing the normalization arameter β and then focusing on estimating the stabilization arameter. The normalization arameter β is set to be the samle variance nj= j nj= j β = where =. n n

5 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) = The hi function of Gaussian kernels = = j - The hi function of Cauchy kernels = = =. - - The hi function of generalized anechnikon kernels = = j j = Fig.. The φ functions with different values: (a) Gaussian kernels, (b) Cauchy kernels and (c) generalized anechnikov kernels. Frequency f G () ^ 8 P= 7 6 P= P= = f C () ^ P= P= P= = fk () ^ 7 6 P= P= P= = Fig.. (a) The histogram of a three-cluster normal miture data. (b), (c) and (d) are its corresonding density estimates with different. ˆ f SG (), ˆ f SC () and fˆ SK () We have the roerties of and G() = G()/Ĝ(). We then have lim m G() = lim and nj= G () j nj= G () = nj= j n = () nj= lim m G () j G() = lim nj= G () = lim nj= [G() ] j nj= [G() ] = G() = j G() =. (6) lim m G() = lim m C() = lim m K () =. () This means that when tends to zero, the kernel density estimate has only one mode with the samle mean. Fig. shows the histogram of a three-cluster normal miture data with its corresonding density estimatesfˆ SG (), fˆ SC () and fˆ (). SK In fact, a small will lead to only one mode in the estimated density, as shown in Fig. for the case of =. However, a too larger will cause the density estimate to have too many modes as shown in Fig. for the case of =. This can be elained as follows. We denote Ĝ() = ma j G() This means that, as tends to infinity, the data oint which is the closest to the initial value will become the eak. Hence, we have lim m G() = lim m C() = lim m K (). (7) In this case, each data oint will become a mode in the blurring and nonblurring mean shift rocedures. The stabilization arameter can control the erformance of the density estimate. This situation is somehow similar to the bandwidth selection, but with different merits. The urose of the bandwidth selection is to find a suitable bandwidth

6 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) The increased shift for the stabilization arameter is M= suitable estinate The increased shift for the stabilization arameter is M= suitable estinate. The increased shift for the stabilization arameter is M=. The increased shift for the stabilization arameter is M=.9 suitable estinate.9 suitable estinate Fig.. The correlation values of { fˆ SK ( ),..., fˆ SK ( n )} and { fˆ +M ( SK ),..., fˆ +M ( n )}, where (a) M =, (b) M =, (c) M = and (d) M =. SK (covariance) for a kernel function of a data oint so that a suitable kernel function can induce a suitable density estimate. However, our new technique assigns a fied samle variance for the kernel function and then uses a suitable to induce a suitable density estimate. As shown in Fig., different corresonds to different shaes of the density estimates. A suitable will corresond to a satisfactory kernel density estimate so that the modes found by the mean shift can resent the dense area of the data set... A stabilization arameter estimation method A ractical bandwidth selection technique is related to the decomosition stability of the density shae estimates. The bandwidth is taken as the center of the largest oerating range over which the same number of clusters are obtained for the given data []. This means that the shaes of the estimated density are unchanged over this oerating range. Although this technique can yield a suitable bandwidth estimate, it needs to find all cluster centers (modes) for each bandwidth over the chosen oerating range. It thus requires a large comutation. The selected oerating range is also erformed case by case. In our stabilization arameter estimation method, we adot the above concet but ski the ste of finding all modes for each. We describe the roosed method as follows. We first define the following function: ˆ f H ( i ) = h( j ), i =,...,n, j=,...,n. (8) j= This function denotes the value of the estimated density shae on the data oints and is similar to the mountain function roosed by Yager and Filev [9] which is used to obtain the initial values for a clustering method. Note that the original mountain function is defined on the grid nodes. However, we define q. (8) on the data oints. In q. (8), we can set H to equal to SG, SC and SK. We now elain how to use q. (8) to find a suitable stabilization arameter. Suose that the density shae estimates are unchanged with = and =, the correlation value of { fˆ H ( ),..., fˆ H ( n )} between = and will be very close to. In this situation, = will be a suitable arameter estimate. In this way, we can ski the ste for finding the modes of the density estimate. In our eeriments, a good oerating range for always falls between and. This is because we normalize the kernel function by dividing the normalization arameter β. amle. We use the reviously described grahical method of correlation comarisons to find the suitable for the data set shown in Fig.. We have the correlation values of { fˆ ( SK ),..., fˆ ( SK n)} and { fˆ +M SK ( ),..., fˆ +M SK ( n )} as shown in Fig.. Figs. (a) (d) give the results of the cases with the increased shifts M =,, and, resectively. In Fig. (a), the y-coordinate of the first solid circle oint denotes the correlation value between = and. The y-coordinate of the second solid circle oint denotes the correlation value between = and. The st, nd, rd solid circle oints, etc., denote the correlation values of the resective airs ( =,= ), ( =,= ) and

7 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) = = Fig.. The data histograms and density estimates using generalized anechnikov kernel with =.. ( =,= ), etc. In Fig. (b), the y-coordinate of the first solid circle oint denotes the correlation value between = and = + =. The y-coordinate of the second solid circle oint denotes the correlation value between = and =. The st, nd, rd solid circle oints, etc., denote the correlation values of the resective airs ( =,= ), ( =,= ) and ( =,= 7), etc. The others can be similarly induced. Since the density shae will be unchanged when the correlation value is close to, we may choose with the solid circle oint close to the dotted line of value. In Fig. (a), with the increased shift M =, the th solid circle oint is very close to the dotted line (i.e. the correlation value between = and = is very close to ). Hence, = is a suitable estimate. We will elain why we do not choose the oint that lies on the dotted line later. In Fig. (b), with the increased shift M =, the sith solid circle oint is close to the dotted line (i.e. the correlation value between = and is close to ). Hence, = is a suitable estimate. In Fig. (c), with the increased shift M =, the correlation value between = and 6 is very close to. Hence, = is a suitable estimate. Similarly, = is a suitable estimate as shown in Fig. (d). Fig. illustrates the density estimates using fˆ () SK with =.. We find that these density estimates actually match the histogram of the data. The following is another eamle with two-dimensional data. amle. We use a two-dimensional data set in this eamle. Fig. 6(a) shows a 6-cluster data set where the data oints in each grou are uniformly generated from rectangles. Figs. 6(b) (d) are the correlation values with the increased shift M =., resectively. In Fig. 6(b), the th oint is close to the dotted line and hence = is a good estimate. In Fig. 6(c), the 7th oint is close to the dotted line and hence we choose =. Fig. 6(d) also indicates that = is a suitable estimate. In this eamle, our grahical method of correlation comarisons with different increased shift M resents all the same results. The density estimates using SK with =,, and are shown in Figs. 7(a), (b), (c) and (d), resectively. In Section., qs. () and () show that a small value will cause the kernel density estimate to have only one mode with the samle mean. Figs. and 7(a) also verify this oint. The selected = is between = and whose density shaes are shown in Figs. 7(c) and (d) that match the original data structure well. Our grahical method of correlation comarisons can find a good density estimate. This estimation method can accomlish the tasks where the bandwidth selection method can do it. However, our method skis the ste of finding all modes of the density estimate so that it is much simler and less comutational. We estimate a suitable stabilization arameter where its oerating range is always located between and. Note that, if the increased shift M is large such as M = or, then it may miss a good estimate for. However, a too small increased shift M, such as M<, may take too much comutational time. We suggest that taking M = for the grahical method of correlation comarisons may erform well in most simulations. We now elain why we did not choose the oint that lies on the dotted line according to the following two reasons. Note that the stabilization arameter is similar to the number of the bars drawn on the data histogram. The density shae with a large corresonds to the histogram with a large number of bars and hence has too many modes. qs. (6) and (7) also show this roerty. The stabilization arameter is the ower of the kernel function which takes values between and. ach data oint will have the value of q. (8) being close to with a large case and hence the correlation value becomes large. Figs. and 6 also show this tendency on the curvilinear tail. According to these two reasons, we suggest to choose the estimate of with the oint being very close to the dotted line... A mean shift-based clustering algorithm Since the generalized anechnikov kernel K has the robustness roerty as analyzed in Section., we use the K kernel in the mean shift rocedure. By combining the grahical method of correlation comarisons for the estimation of the stabilization arameter, we roose a mean shift-based clustering method (MSCM) with the K kernel, called MSCM(K ). The roosed MSCM(K ) rocedure is therefore constructed with four stes: () select the kernel K ; () estimate the stabilization ; () use the mean shift rocedures; () identify the clusters. The MSCM(K ) rocedure is with its diagram as shown in Fig. 8 and summarized as follows. The MSCM(K ) rocedure: Ste. Choose the K kernel. Ste. Use the grahical method of correlation comarisons for estimating. Ste. Use the nonblurring mean shift rocedure. Ste. Identify the clusters. We first imlement the MSCM(K ) rocedure for the data set shown in Fig.. According to Fig. (a), = is a suitable estimate for the data set in Fig. (a). The results of the MSCM(K ) rocedure are shown in Fig. 9. The curve

8 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) y The increased shift of is M= suitable estimate. The increased shift of is M=. The increased shift of is M=.9 suitable estimate.9 suitable estimate.9.8 Fig. 6. (a) The 6-cluster data set. (b), (c) and (d) resent the correlation values with the increased shift M =, and, resectively. = 6 y = 9 8 y = y = y Fig. 7. The density estimates of the 6-cluster data set where (a), (b), (c) and (d) are the fˆ SK values of each data oint with =,, and, resectively. reresents the density estimate using fˆ SK with =. The locations of the data oints are illustrated by the histograms. Figs. 9(a) (f) show these data histograms after imlementing the MSCM(K ) rocedure with the iterative times T =,,,, and 77. The MSCM(K ) rocedure is convergent when T = 77. The algorithm is terminated when

9 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) Data oints Select a kernel () Choose the kernel K P (used in this aer) () Choose the kernel G P or C P stimate the stabilization arameter. Use the roosed grahical method of correlation comarisons (used in this aer). Directly assign a value Use the mean shift rocedure. Nonblurring mean shift (used in this aer). General mean shift. Blurring mean shift Identify clusters. Merge data. Other methods (discuss in Section ) Fig. 8. The diagram of the roosed MSCM(K ) rocedure. Fig. 9. The density estimates using the MSCM(K ) rocedure with = and the histograms of the data oints where (a) T =, (b) T =, (c) T =, (d) T =, (e) T = and (f) T = 77. the locations of all data oints are unchanged when the iterative time achieves the maimum T =. Fig. 9(f) shows that all data oints are centralized to three locations, which are the modes of the density estimate fˆ. The SK MSCM(K ) rocedure finds that these three clusters do indeed match the data structure. We also imlement the MSCM(K ) rocedure on the data set in Fig. 6(a). According to the grahical method of correlation comarisons as shown in Fig. 6(b), = is a suitable estimate. The MSCM(K ) results after the iterative times T =, and are shown in Figs. (a), (b) and (c), resectively. The data histograms show that all data oints are centralized to 6 locations which match the data structure when the iterative time T =. We mention that no initialization roblems occurred with our method because the stabilization arameter is estimated by the grahical method of correlation comarisons. We can also solve the cluster validity roblem by merging the data oints which are similar to the many rogressive clustering methods [,7,8]. We have shown that the mean shift can be a robust clustering method by definition of the M-estimator and also discussed the φ function that denotes the degree of influence of an individual observation for the kernels G, C and K.Inthe comarisons of comutational time, our grahical method of correlation comarisons for finding the stabilization arameter is much faster than with the bandwidth selection method. Finding all cluster centers for the bandwidth selection method requires the comutational comleity O(n st ) for the first selected bandwidth where s and t denote the data dimension and the algorithm iteration number, resectively. Suose that

10 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) Fig.. (a), (b) and (c) are the histograms of the data locations after using the MSCM(K ) rocedure with iterative time T =, and, resectively. we choose the maimum bandwidths of the oerating range for the grahical method of correlation comarisons, then this would require the comutational comleity O(n s i= t i) for finding a suitable bandwidth. In the MSCM(K ) rocedure, the stabilization arameter is always in Refs. [,] because we normalize the kernel function. Thus, it requires O(n s) to find a suitable when the increased shift M =. Then, the MSCM(K ) rocedure should be faster if we set the increased shift M to be greater than. We finally further discuss the MSCM(K ) rocedure. The first ste is to choose a kernel function where we recommend the selection of K. You may also choose the kernels G and C. The second ste is to estimate the stabilization arameter using the roosed grahical method of correlation comarisons. Of course, the user can also directly assign a value for. In our eeriments, a suitable always fell between and. The third ste is to imlement the mean shift rocedure. Three kinds of mean shift rocedures mentioned in Section can be used. We suggest the nonblurring mean shift. The fourth ste is to classify all data oints. If the data set only contains the round shae clusters, the nonblurring mean shift can label the data oints by merging them. However, if the data set contains other cluster shaes such as line or circle structures, then the data oints may not be centralized to some small locations. In this case, merging the data oints will not work well. The net ste is to discuss this roblem and also rovide some numerical eamles and comarisons. We then aly these to image rocessing.. amles and alications Some numerical eamles, comarisons and alications are stated in this section. We also consider the comutational comleity, cluster validity and imrovements of the mean shift in large continuous, discrete data sets with its alication to the image segmentation... Numerical eamles with comarisons amle. In Fig. (a), a large cluster number data set with clusters is resented where we also add some uniformly noisy oints. The grahical lot of the correlation comarisons for the data set is shown in Fig. (b) where = is a suitable estimate. The MSCM(K ) results when T =,, and are shown in Figs. (c), (d), (e) and (f), resectively. We can see that all data oints are centralized to locations that suitably match the structure of data. The results are not affected by the noisy oints. The cluster number can easily be found by merging the data oints which are centralized to the same locations. Thus, the MSCM(K ) rocedure can be a simle and useful unsuervised clustering method. The sueriority of our roosed method to other clustering methods is discussed below. We also use the well-known FCM clustering algorithm [9] to cluster the same data set shown in Fig. (a). Suose that we do not know the cluster number of the data set. We adot the validity indees, such as the artition coefficient (PC) [], Fukuyama and Sugeno (FS) [], and Xie and Beni (XB) [], to solve the validity roblem when using FCM. Therefore, we need to rocess the FCM algorithm for each cluster number c =,,.... However, the first roblem is to assign the locations of the initial values in FCM. We simulate two cases of assignments with the random initial values and the designed initial values. Figs. (a) (c) resent the PC, FS and XB validity indees when the random initial values are assigned. FCM cannot detect those searated clusters very effectively, but it can when the designed initial values are assigned as shown in Fig.. Note that the PC inde has a large value tendency when the cluster number is small. Thus, a local otimal value of the PC inde curve may resent a good cluster number estimate []. In the case of the designed initials assignment, all validity measures should have an otimal solution with c =. However, this result is obtained only when the FCM algorithm can divide the data into well-searated clusters. If the initializations are not roerly chosen (for eamle, no centers are initialized in of these rectangles), we cannot ensure good artitions being found by the FCM algorithm. In fact, most clustering algorithms always have this initialization roblem. This is a more difficult roblem in a high-dimensional case. Here we find that our roosed MSCM(K ) rocedure has the necessary robust roerty for the initialization. We had mentioned that the comutational comleity of finding the stabilization arameter estimate in MSCM(K ) is O(n s). After estimating, the comutational comleity of imlementing the MSCM(K ) rocedure is O(n st K ) where t K denotes the number of iterations. In FCM, the comutational comleity is O(ncst C ), where t C is the number of iterations and c is the cluster number. For solving the validity roblem, we

11 K.-L. Wu, M.-S. Yang / Pattern Recognition (7). z 6 - y P= z y z y z y z y Fig.. (a) The data set. (b) The grah of correlation comarisons where = is a suitable estimate. (c), (d), (e), and (f) are the data locations after using the MSCM(K ) rocedure with iterative time T =,, and, resectively..6.. PC validity inde for FCM clustering algorithm with random initial values cluster number c c= - FS validity inde for FCM clustering algorithm with random initial values cluster number c c= XB validity inde for FCM clustering algorithm with random initial values cluster number c Fig.. (a), (b) and (c) resent the PC, FS and XB validity indees when the random initial values are assigned..6.. PC validity inde for FCM clustering algorithm with designed initial values c= cluster number c - FS validity inde for FCM clustering algorithm with designed initial values cluster number c c= XB validity inde for FCM clustering algorithm with designed initial values cluster number c c= Fig.. (a), (b) and (c) resent the PC, FS and XB validity indees when the designed initial values are assigned. need to rocess FCM from c = toc = C ma where C ma is the maimum number of clusters. Thus, the comutational comleity of FCM is O(n(C ma )s c=c ma c= t C ). Although the comleity of the MSCM(K ) rocedure is high when the data set is large, it can directly solve the validity roblem in a simle way. In order to reduce the comleity of the

12 6 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) = Fig.. (a) The data set. (b) The grah of correlation comarisons where = 8 is a suitable estimate. (c) The data locations after using the MSCM(K ) rocedure. a Height b Fig.. (a) The hierarchical tree of the data set in Fig. (c). (b) Identified three clusters with different symbols. MSCM(K ) rocedure for the large data set, we will roose a technique to deal with it later... Cluster validity and identified cluster for the MSCM(K ) rocedure Note that, after we have found the MSCM(K ) clustering results, we had simly identified clusters and the cluster number by merging the data oints which are centralized to the same location. However, if the data set contains different shae clusters such as a line or a circle structure, the data oints may not be centralized to the same small region. On the other hand, the estimated density shae may have flat curves on the modes. In this case, the method of identifying clusters by merging data oints may not be an effective means of finding the clusters. If we use the Agglomerative Hierarchical Clustering (AHC) algorithm with some linkage methods such as single linkage, comlete linkage and Ward s method, etc., we can identify different shae clusters from the MSCM(K ) clustering results. We demonstrate this identified method with AHC as follows. amle. Fig. (a) shows a three-cluster data set with different cluster shaes. The grahical lot of the correlation comarisons is shown in Fig. (b) where = 8 is a suitable stabilization arameter estimate. The locations of all data oints after rocessing the MSCM(K ) are shown in Fig. (c). In this situation, the identified method of merging data oints may cause difficulties. Since the mean shift rocedure can be seen as the method that shifts the similar data oints to the same location, the AHC with single linkage can hel us to find clusters so that similar data oints can be easily linked into the same cluster. We rocess the AHC with single linkage for the MSCM(K ) clustering results shown in Fig. (c). We obtain the hierarchical clustering tree as shown in Fig. (a). The hierarchical tree shows that the data set contains three wellsearated clusters. These clusters are shown in Fig. (b) with different symbols. In fact, most clustering algorithms are likely to fail when they are alied to the data set shown in Fig. (a). By combing MSCM(K ) with AHC, these methods can detect different shae clusters for most data sets. On the other hand, it may also decrease the convergence time for the MSCM(K ) rocedure. Note that, the estimated density shae may have flat curves on the modes for the data set so that the MSCM(K ) rocedure is too time consuming in this case, esecially for a large data set. Since the AHC method can connect the close data oints into the same cluster, we can then sto the MSCM(K ) rocedure when the data oints shift to the location near the mode. Therefore, we can set a larger stoing threshold for the MSCM(K ) rocedure or set a smaller maimum iteration... Imlementation for large continuous data sets We know that one imortant roerty of the MSCM(K ) rocedure is to take all data oints as the initial values so that these data oints can be centralized to the locations of the modes. The cluster validity roblems can be solved

13 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) 7 Fig. 6. (a) The Lenna image with size 8 8. (b) The results using the MSCM(K ) rocedure with = when T =. (c) The results with = when T =. (d) The results with = when T = 98. simultaneously in this way. However, this rocess wastes too much time for a large data set. One way to reduce the number of data oints for the MSCM(K ) rocedure is to first merge similar data oints from the original data set. Although this technique can reduce the number of data oints, the MSCM(K ) clustering results will differ as to the original nonblurring mean shift rocedure. Since the mode reresents the location of where most data are located, these data oints will shift to the mode at a few mean shift iterations. However, these data oints have to continue to rocess the mean shift rocedure until all data oints are centralized to these modes. This is because the mean shift rocedure is treated as a batch tye. Although all data oints are taken to be the initial values, there are no constrains on these initial values when we rocess the mean shift clustering algorithm using q. (). This means that we can treat the mean shift as a sequential clustering algorithm. The inut of the second data oint is erformed only when the first data oint shifts to the mode. The inut of the third data oint is erformed only when the second data oint shifts to the mode. The rest oints can be similarly induced. In this sequential-tye mean shift rocedure, only the data oints that are far away from modes require more iteration. However, most data oints (nearby the modes) require less iteration. The comutational comleity of this sequential-tye mean shift rocedure is O(ns n i= t i ) where t i denotes the iterative count of data oint i. However, the iteration of each data oint will be all equal to the largest iteration number in the batch-tye mean shift rocedure where the comutational comleity is O(n st K ) with t K = ma{t,...,t n }. Note that this technique is only feasible for the nonblurring mean shift rocedure, such as the roosed MSCM(K ) rocedure. Since the blurring mean shift algorithm udate the data oints at each iterative, the location of each data oint will be changed so that the sequential tye is not feasible for the blurring mean shift rocedure... Imlementation for large discrete data sets Suose that the data set {,..., n } only takes values on the set {y,...,y m } with corresonding counts {n,...,n m }. That is, there are n observations of {,..., n } with values y, n observations of {,..., n } with values y, etc. For eamle, a 8 8 gray level image will have 6 8 data oints (or iels). However, it only takes values on the gray level set {,,...,}. The mean shift rocedure has large comutational comleity in this n=6 8 case. The following theorem is a technique that can greatly reduce comutational comleity. Theorem. Suose that {,..., n }, y {y,...,y m } and = y. The mean shift of y is defined by mi= k( y y i )w(y i )n i y i y = m K (y) = mi= k( y y i. (9) )w(y i )n i Then we have m K () = m K (y). That is, the mean shift of using q. () can be relaced by the mean shift of y using q. (9) with the equivalent results. Proof. Since {,..., n } only take values on the set {y,...,y m } with corresonding counts {n,...,n m },wehave mi= k( y y i )w(y i )n i = n j= k( j )w( j ) and mi= k( y y i )w(y i )n i y i = n j= k( j )w( j ) i. Thus, m K () = m K (y). We can take {y,...,y m } to be initial values and then rocess the mean shift rocedure using q. (9) for the MSCM(K ) rocedure. The final locations of {y,...,y m } using q. (9) will be equivalent to the final locations of {,..., n } using q. (). In this discrete data case, the comutational comleity for the MSCM(K ) rocedure becomes O(m st) where t denotes the iterative counts and m<n. The reviously discussed sequential tye of continuous data case can also be used to greatly reduce comutational comleity in this discrete data case. The following is a simle alication in the image segmentation... Alication in image segmentation amle. Fig. 6(a) is the well-known Lenna image with the size 8 8. This image contains 6 8 iel values with a maimum value of 7 and minimum value of. This means that it only takes values on {,,...,7} and the count frequency of these iel values is shown in Fig. 7(a). The comutational comleity of the original nonblurring mean shift rocedure is O(n st) = O(6 8 st). After alying the described technique using q. (9), the comutational comleity of the nonblurring mean shift rocedure in this discrete case becomes O(m st)=o( st) which actually greatly reduces comutational comleity. The grah of correlation comarisons for the MSCM(K ) rocedure is shown in Fig. 7(b) where = is a suitable estimate. The estimated density shae with = is shown in Fig. 7(c) which quite closely matches

14 8 K.-L. Wu, M.-S. Yang / Pattern Recognition (7). count corr P= iel.997 P= P= T=98 iel iel iel Fig. 7. (a) The counts of the iel of the Lenna image. (b) The grah of correlation comarisons where = is a suitable estimate. (c) The density estimate with =. (d) All iel values are centralized to four locations after using the MSCM(K ) rocedure with iterative time T = 98. Frequency 6 data corr P= Height Fig. 8. (a) Histogram of the data set. (b) The grah of correlation comarisons. (c) The hierarchical tree of the data after imlementing the MSCM(K ) algorithm. the iel counts with four modes. The MSCM(K ) clustering results when t =, and 98 are shown in Figs. 6(b), (c) and (d), resectively. Following convergence, all iel values are centralized to four locations resented by the arrows in the -coordinate as shown in Fig. 7(d). Note that, for most c- means based clustering algorithms, c = 8 is mostly used as the cluster number of the Lenna data set. If we combine a validity inde to search for a good cluster number estimate for this data, the comutational comleity should become very large. In the MSCM(K ) rocedure, a suitable cluster number estimate is found automatically and the comutational comleity for this discrete large data set can be reduced by using the above-mentioned technique. For color image data, each iel contains a three-dimensional data oint that each dimension takes values on the set {,,...,} and hence we will have ossible iel values. However, for a 8 8 color image, the worst situation is to have 8 8 different iel values. In general, an image data will contain many overlaing iel values and hence the technique can also significantly reduce comutational comleity even for color images..6. Comarisons to other kernel-based clustering algorithms Comaniciu [] roosed the variable-bandwidth mean shift with data-driven bandwidth selection. To demonstrate their bandwidth selection method, Comaniciu [] used the data set drawn with equal robability from normal distributions N(8, ), N(, ), N(, 8) and N(, 6) with total n =. Fig. 8(a) shows the histogram of this data set. Comaniciu [] used analysis bandwidths in the range of

15 corr K.-L. Wu, M.-S. Yang / Pattern Recognition (7) P= Height data Fig. 9. (a) Data set. (b) The grah of correlation comarisons. (c) The hierarchical tree of the data after imlementing the MSCM(K ) algorithm. (d) Four identified clusters resented in different symbols P= Height.9 inde Fig.. (a) The grah of correlation comarisons and (b) the hierarchical trees after imlementing the MSCM(K ) algorithm for the Iris data P= Height 8 6 Fig.. (a) The grah of correlation comarisons and (b) the hierarchical trees after imlementing the MSCM(K ) algorithm for the Wine data.

16 K.-L. Wu, M.-S. Yang / Pattern Recognition (7)..999 P=9 Height Fig.. (a) The grah of correlation comarisons and (b) the hierarchical trees after imlementing the MSCM(K ) algorithm for the Crabs data.. 8 Height = Height Height Fig.. (a) stimated for Cmc data. (b), (c) and (d) are the single linkage, comlete linkage and Ward s method for the Cmc data after imlementing the MSCM(K ) algorithm where four classes of bandwidth were detected (see Ref. [,. 8-8]). Based on our roosed MSCM(K ), the grah of correlation comarisons for the data set of Fig. 8(a) gives = 8 and there are four clusters resented as shown in Figs. 8(b) and (c), resectively. Furthermore, we use the data set of nonlinear structures with multile scales in Comaniciu [] for our net comarisons as shown in Fig. 9(a). The grah of correlation comarisons, the hierarchical tree of the data after imlementing the MSCM(K ) algorithm and four identified clusters resented in different symbols are shown in Figs. 9(b) (d), resectively. These results with four identified clusters from MSCM(K ) are coincident to those of the variable bandwidth mean shift in Comaniciu []. Note that the variable-bandwidth mean shift in Comaniciu [] is a good technique to find a suitable bandwidth for each data oint, but it requires imlementing the fied bandwidth mean shift with many different selected bandwidths. Hence, it is time consuming. In our MSCM(K ) method, we are not to focus on estimating good bandwidth for each data oint, but on estimating the stabilization arameter so that the MSCM(K ) with this suitable can rovide suitable final results. In this case, it requires imlementing the mean shift rocess only one time. The kernel k-means roosed by Girolami [] is another kernel-based clustering algorithm. The cluster numbers of the data sets Iris (c = ), Wine (c = ) and Crabs (c = ) are correctly detected by kernel k-means (see Ref. [, ]). We imlement the roosed method for the data sets Iris, Wine and Crabs where the results of these real data sets from MSCM(K ) are shown in Figs.,, and, resectively. The MSCM(K ) algorithm with = indicates that c = is suitable for the Iris data. However, there are some clues to see c = is also suitable for the Iris data according to the results of Fig. (b). Note that this is because one of the clusters

17 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) Fig.. The eigenvalue decomosition of the kernel matri to estimate the cluster number of the Cmc data. (a), (b) and (c) are the results of the RBF kernel widths with, and, resectively. of Iris is searable from the other two overlaing clusters. The results of Wine data from the MSCM(K ) as shown in Fig. (b) give the ossible cluster numbers c =, or.the results from the MSCM(K ) for the Crabs data as shown in Fig. (b) are the same as kernel k-means with c =. Although the mean shift can move data oints to the data dense regions, these regions may be very flat so that the roosed MSCM(K ) with a linkage method, such as single linkage, may not give a clear cluster number estimate. This situation aears in some real data alications. We recommend that we use more linkage methods such as comlete linkage, Ward s method, etc., to offer more clustering information. In our final eamle, we use the Cmc data set from the UCI machine learning reository [] in the comarisons of the roosed MSCM(K ) with kernel k-means. There are attributes, 7 observations and three clusters in the Cmc data set. The MSCM(K ) results for the Cmc data set are shown in Fig. where the single linkage, comlete linkage and Ward s method indicate that the data set contains three clusters that are coincident to the data structure. Although kernel k-means using the eigenvalue decomosition of the kernel matri roosed by Girolami [] can estimate the correct cluster number for the Iris, Wine and Crabs data sets, the estimates of the cluster number quite deend on the selected radial basis function (RBF) kernel width. The eigenvalue decomosition results with different RBF kernel widths for the Cmc data set based on the kernel matri are shown in Fig.. The correct result c = shown in Fig. (a) deends on the suitable chosen RBF kernel width. However, other choices with and as shown in Figs. (b) and (c) cannot detect the cluster number of c =. Finally, we mention that merging data oints using AHC after the mean shift rocess MSCM(K ) not only rovide the estimated cluster number, but also can give more information about the data structure. For eamle, the observation 9 of Wine data as shown in Fig. (b) and the observation of Crabs data as shown in Fig. (b) can be seen as the abnormal observations. Fig. 8(c) also shows that the observation is far away from other data oints. In unsuervised clustering, our object is not only to detect the correct cluster number for an unknown cluster number data set, but also to discover a reasonable cluster structure for the data set.. Conclusions In this aer, we roosed a mean shift-based clustering rocedure, called MSCM(K ). The roosed MSCM(K ) can be robust with three facets. In facet, since we combined the mountain function to estimate the stabilization arameter, the bandwidth selection roblem for the density estimation can be solved in a grahical way. Also the oerating range of the stabilization arameter is always fied for various data sets. This led to the roosed method being robust for the initializations. In facet, we discussed the roblem of the mean shift rocedure faced in the roblems of cluster validity and identified clusters. According to the roerties of the nonblurring mean shift rocedure, we suggested combining the AHC with the single linkage to identified different shaed clusters. This technique can also save the comutational comleity of the MSCM(K ) rocedure by setting a larger stoing threshold or setting a smaller maimum iterative count. Thus, the roosed method can be robust for different cluster shaes in the data set. In facet, we analyzed the robust roerties of the mean shift rocedure according to the nonarametric M-estimator and the φ functions of three kernel classes. We then focused on the generalized anechnikov kernel where etremely large or small data oints will have no influence on the mean shift rocedures. Our demonstrations showed that the roosed method is not influenced by noise and hence is robust as to the noise and outliers. rovided some numerical eamles, comarisons and alications to illustrate the sueriority of the roosed MSCM(K ) rocedure including the comutational comleity, cluster validity, imrovements of the mean shift in large continuous, discrete data sets and image segmentation. Acknowledgments The authors are grateful to the anonymous referees for their critical and constructive comments and suggestions to imrove

18 K.-L. Wu, M.-S. Yang / Pattern Recognition (7) the resentation of the aer. This work was suorted in art by the National Science Council of Taiwan, under Kuo-Lung Wu s Grant: NSC-9-8-M-68- and Miin-Shen Yang s Grant: NSC-9-8-M---MY. References [] V.N. Vanik, Statistical Learning Theory, Wiley, New York, 998. [] N. Cristianini, J. Shawe-Taylor, An Introduction to Suort Vector Machines, Cambridge University Press, Cambridge,. [] F. Camastra, A. Verri, A novel kernel method for clustering, I Trans. Pattern Anal. Mach. Intell. 7 () 8 8. [] M. Girolami, Mercer kernel based clustering in feature sace, I Trans. Neural Networks () [] K.R. Muller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkof, An introduction to kernel-based learning algorithm, I Trans. Neural Networks () 8. [6] B.W. Silverman, Density stimation for Statistics and Data Analysis, Chaman & Hall, New York, 998. [7] K. Fukunaga, L.D. Hostetler, The estimation of the gradient of a density function with alications in attern recognition, I Trans. Inf. Theory (97). [8] Y. Cheng, Mean shift, mode seeking, and clustering, I Trans. Pattern Anal. Mach. Intell. 7 (99) [9] R.R. Yager, D.P. Filev, Aroimate clustering via the mountain method, I Trans. Syst. Man Cybern. (99) [] D. Comaniciu, An algorithm for data-driven bandwidth selection, I Trans. Pattern Anal. Mach. Intell. () [] D. Comaniciu, P. Meer, Mean shift: a robust aroach toward feature sace analysis, I Trans. Pattern Anal. Mach. Intell. () [] K.I. Kim, K. Jung, J.H. Kim, Teture-based aroach for tet detection in images using suort vector machines and continuously adative mean shift algorithm, I Trans. Pattern Anal. Mach. Intell. () [] X. Yang, J. Liu, Unsuervised teture segmentation with one-ste mean shift and boundary Markov random fields, Pattern Recognition Lett. () 7 8. [] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, I Trans. Pattern Anal. Mach. Intell. () [] N.S. Peng, J. Yang, Z. Liu, Mean shift blob tracking with kernel histogram filtering and hyothesis testing, Pattern Recognition Lett. 6 () 6 6. [6] O. Debeir, P.V. Ham, R. Kiss, C. Decaestecker, Tracking of migrating cells nuder hase-contrast video microscoy with combined mean-shift rocesses, I Trans. Med. Imaging () [7] H. Chen, P. Meer, Robust fusion of uncertain information, I Trans. Syst. Man Cybern. B Cybern. () [8] M. Fashing, C. Tomasi, Mean shift is a bound otimization, I Trans. Pattern Anal. Mach. Intell. 7 () 7 7. [9] S.L. Chiu, Fuzzy model identification based on cluster estimation, J. Intell. Fuzzy Syst. (99) [] M.S. Yang, K.L. Wu, A modified mountain clustering algorithm, Pattern Anal. Al. 8 () 8. [] N.R. Pal, D. Chakraborty, Mountain and subtractive clustering method: imrovements and generalizations, Int. J. Intell. Syst. () 9. [] B.P. Velthuizen, L.O. Hall, L.P. Clarke, M.L. Silbiger, An investigation of mountain method clustering for large data sets, Pattern Recognition (997). [] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 99. [] G. Beni, X. Liu, A least biased fuzzy clustering method, I Trans. Pattern Anal. Mach. Intell. 6 (99) [] P.J. Huber, Robust Statistics, Wiley, New York, 98. [6] X. Zhuang, T. Wang, P. Zhang, A highly robust estimator through artially likelihood function modeling and its alication in comuter vision, I Trans. Pattern Anal. Mach. Intell. (99) 9. [7] K.L. Wu, M.S. Yang, Alternative c-means clustering algorithms, Pattern Recognition () [8] C.V. Stewart, Minran: a new robust estimator for comuter vision, I Trans. Pattern Anal. Mach. Intell. 7 (99) [9] R.N. Dave, R. Krishnauram, Robust clustering methods: a unified view, I Trans. Fuzzy Syst. (997) 7 9. [] H. Frigui, R. Krishnauram, A robust cometitive clustering algorithm with alications in comuter vision, I Trans. Pattern Anal. Mach. Intell. (999) 6. [] K. Rose,. Gurewitz, G.C. Fo, Statistical mechanics and hase transitions in clustering, Phys. Rev. Lett. 6 (99) [] K. Rose,. Gurewitz, G.C. Fo, A deterministic annealing aroach to clustering, Pattern Recognition Lett. (99) [] S. Chen, D. Zhang, Robust image segmentation using FCM with secial constraints based on new kernel-induced distance measure, I Trans. Syst. Man Cybern. B Cybern. () [] M.N. Ahmed, S.M. Yamany, N. Mohamed, A.A. Farag, T. Moriarty, A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data, I Trans. Med. Imaging () [] M.S. Yang, K.L. Wu, A similarity-based robust clustering method, I Trans. Pattern Anal. Mach. Intell. 6 () 8. [6] R. Krishnauram, J.M. Keller, A ossibilistic aroach to clustering, I Trans. Fuzzy Syst. (99) 98. [7] R. Krishnauram, H. Frigui, O. Nasraoui, Fuzzy and ossibilistic shell clustering algorithm and their alication to boundary detection and surface aroimation, I Trans. Fuzzy Syst. (99) 9 6. [8] H. Frigui, R. Krishnauram, Clustering by cometitive agglomeration, Pattern Recognition (997). [9] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithm, Plenum Press, New York, 98. [] J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybern. (97) 8 7. [] Y. Fukuyama, M. Sugeno, A new method of choosing the number of clusters for fuzzy c-means method, in: Proceedings of the th Fuzzy System Symosium, 989,. 7, (in Jaanese). [] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, I Trans. Pattern Anal. Mach. Intell. (99) [] K.L. Wu, M.S. Yang, A cluster validity inde for fuzzy clustering, Pattern Recognition Lett. 6 () 7 9. [] C.L. Blake, C.J. Merz, UCI reository of machine learning databases, a huge collection of artificial and real-world data sets, 998. Available from: htt:// mlearn/mlreository.html. About the Author KUO-LUNG WU received the BS degree in mathematics in 997, the MS and PhD degrees in alied mathematics in and, all from the Chung Yuan Christian University, Chungli, Taiwan. Since, he has been an Assistant Professor in the Deartment of Information Management at Kun Shan University of Technology, Tainan, Taiwan. He is a member of the Phi Tau Phi Scholastic Honor Society of Taiwan. His research interests include fuzzy theorem, cluster analysis, attern recognition and neural networks. About the Author MIIN-SHN YANG received the BS degree in mathematics from the Chung Yuan Christian University, Chungli, Taiwan, in 977, the MS degree in alied mathematics from the National Chiao-Tung University, Hsinchu, Taiwan, in 98, and the PhD degree in statistics from the University of South Carolina, Columbia, USA, in 989. In989, he was an Associate Professor of the Deartment of Alied Mathematics at the Chung Yuan Christian University. Since 99, he has been a Professor at the same university, where, from to, he was the Chairman of the Deartment of Alied Mathematics. During , he was a Visiting Professor with the Deartment of Industrial ngineering, University of Washington, Seattle, USA. His current research interests include alications of statistics, fuzzy clustering, neural fuzzy systems, attern recognition and machine learning. Dr. Yang is an Associate ditor of the I Transactions on Fuzzy Systems.