A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

Transcription

1 A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining Binu Thomas and Rau G 2, Research Scholar, Mahatma Gandhi University,Kerala, India. binumarian@rediffmail.com 2 SCMS School of Technology & Management, Cochin, Kerala, India. kurupgrau@rediffmail.com Abstract In data mining, the conventional clustering algorithms have difficulties in handling the challenges posed by the collection of natural data which is often vague and uncertain. Fuzzy clustering methods have the potential to manage such situations efficiently. This paper introduces the limitations of conventional clustering methods through k-means and fuzzy c-means clustering and demonstrates the drawbacks of the algorithms in handling outlier points. In this paper, we propose a new fuzzy clustering method which is more efficient in handling outlier points than conventional fuzzy c-means algorithm. The new method excludes outlier points by giving them extremely small membership values in existing clusters while fuzzy c-means algorithm tends give them outsized membership values. The new algorithm also incorporates the positive aspects of k-means algorithm in calculating the new cluster centers in a more efficient approach than the c-means method. Index Terms fuzzy clustering, outlier points, knowledge discovery, c-means algorithm I. INTRODUCTION The process of finding useful patterns and information from raw data is often known as Knowledge discovery in databases or KDD. Data mining is a particular step in this process involving the application of specific algorithms for extracting patterns (models from data [5]. Cluster analysis is a technique for breaking data down into related components in such a way that patterns and order becomes visible. It aims at sifting through large volumes of data in order to reveal useful information in the form of new relationships, patterns, or clusters, for decisionmaking by a user. Clusters are natural groupings of data items based on similarity metrics or probability density models. Clustering algorithms maps a new data item into one of several known clusters. In fact cluster analysis has the virtue of strengthening the exposure of patterns and behavior as more and more data becomes available [7]. A cluster has a center of gravity which is basically the weighted average of the cluster. Membership of a data item in a cluster can be determined by measuring the distance from each cluster center to the data point [6]. The data item is added to a cluster for which this distance is a minimum. This paper provides an overview of the crisp clustering technique, advantages and limitations of fuzzy c-means clustering and a new fuzzy clustering method which is simple and superior to c-means clustering in handling outlier points. Section 2 describes the basic notions of 6 clustering and also introduces k-means clustering algorithm. In Section 3 we explain the concept of vagueness and uncertainty in natural data. Section 4 introduces the fuzzy clustering method and describes how it can handle vagueness and uncertainty with the concept of overlapping clusters with partial membership functions. The same section also introduces the most common fuzzy clustering algorithm namely c-means algorithm and it ends with the limitations of it. Section 5 proposes the new fuzzy clustering method. Section 6 demonstrates the concepts presented in the paper. Finally, section 7 concludes the paper. II. CRISP CLUSTERING TECHNIQUES Traditional clustering techniques attempt to segment data by grouping related attributes in uniquely defined clusters. Each data point in the sample space is assigned to only one cluster. K-means algorithm and its different variations are the most well-known and commonly used partitioning methods. The value k stands for the number of cluster seeds initially provided for the algorithm. This algorithm takes the input parameter k and partitions a set of m obects into k clusters [7]. The technique work by computing the distance between a data point and the cluster center to add an item into one of the clusters so that intra-cluster similarity is high but inter-cluster similarity is low. A common method to find the distance is to calculate to sum of the squared difference as follows and it is known as the Euclidian distance [0](exp.. where, d k = 2 k d k X Ci ( n : is the distance of the k th data point from C With the definition of the distance of a data point from the cluster centers, the k-means the algorithm is fairly simple. The cluster centers are randomly initialized and we assign a data point x i into a cluster to which it has minimum distance. When all the data points have been assigned to clusters, new cluster centers are calculated by finding the weighted average of all data points in a cluster. The cluster center calculation causes the previous centroid location to move towards the center of the cluster set. This is continued until there is no change in cluster centers.

2 A. Limitations of k-means algorithm The main limitation of the algorithm comes from its crisp nature in assigning cluster membership to data points. Depending on the minimum distance, a data point always becomes a member of one of the clusters. This works well with highly structured data. The real world data is almost never arranged in clear cut groups. Instead, clusters have ill defined boundaries that smear into the data space often overlapping the perimeters of surrounding clusters [4]. In most of the cases the real world data have apparent extraneous data points clearly not belonging to any of the clusters and they are called outlier points. The k-means algorithm is not capable of dealing with overlapping clusters and outlier points since it has to include a data point into one of the existing clusters. Because of this even extreme outlier points will be included in to a cluster based on the minimum distance. III. FUZZY LOGIC The modeling of imprecise and qualitative knowledge, as well as handling of uncertainty at various stages is possible through the use of fuzzy sets. Fuzzy logic is capable of supporting, to a reasonable extent, human type reasoning in natural form by allowing partial membership for data items in fuzzy subsets [2]. Integration of fuzzy logic with data mining techniques has become one of the key constituents of soft computing in handling the challenges posed by the massive collection of natural data [].Fuzzy logic is logic of fuzzy sets. A Fuzzy set has, potentially, an infinite range of truth values between one and zero[3]. Propositions in fuzzy logic have a degree of truth, and membership in fuzzy sets can be fully inclusive, fully exclusive, or some degree in between[3].the fuzzy set is distinct from a crisp set is that it allows the elements to have a degree of membership. The core of a fuzzy set is its membership function: a function which defines the relationship between a value in the sets domain and its degree of membership in the fuzzy set(exp 2. The relationship is functional because it returns a single degree of membership for any value in the domain[]. µ=f(s,x (2 Here, µ : is he fuzzy membership value for the element s : is the fuzzy set x : is the value from the underlying domain. data points are assigned membership values for each of the clusters. The fuzzy clustering algorithms allow the clusters to grow into their natural shapes [5]. In some cases the membership value may be zero indicating that the data point is not a member of the cluster under consideration. Many crisp clustering techniques have difficulties in handling extreme outliers but fuzzy clustering algorithms tend to give them very small membership degree in surrounding clusters [4]. The non-zero membership values, with a maximum of one, show the degree to which the data point represents a cluster. Thus fuzzy clustering provides a flexible and robust method for handling natural data with vagueness and uncertainty. In fuzzy clustering, each data point will have an associated degree of membership for each cluster. The membership value is in the range zero to one and indicates the strength of its association in that cluster. A. C-means fuzzy clustering algorithm[0] Fuzzy c-means clustering involves two processes: the calculation of cluster centers and the assignment of points to these centers using a form of Euclidian distance. This process is repeated until the cluster centers stabilize. The algorithm is similar to k-means clustering in many ways but it assigns a membership value to the data items for the clusters within a range of 0 to. So it incorporates fuzzy set s concepts of partial membership and forms overlapping clusters to support it. The algorithm needs a fuzzification parameter m in the range [,n] which determines the degree of fuzziness in the clusters. When m reaches the value of the algorithm works like a crisp partitioning algorithm and for larger values of m the overlapping of clusters is tend to be more. The algorithm calculates the membership value µ with the formula, m d i (3 µ = p m k = d ki where µ (x i : is the membership of x i in the th cluster d i : is the distance of x i in cluster c m : is the fuzzification parameter p : is the number of specified clusters d ki : is the distance of xi in cluster C k Fuzzy sets provide a means of defining a series of overlapping concepts for a model variable since it represent degrees of membership. The values from the complete universe of discourse for a variable can have memberships in more than one fuzzy set. IV. FUZZY CLUSTERING METHODS The central idea in fuzzy clustering is the non-unique partitioning of the data in a collection of clusters. The 62 The new cluster centers are calculated with these membership values using the exp. 4. c = where C x i m [ µ ] i [ µ ] i x i m (4 : is the center of the th cluster : is the i th data point

3 µ : the function which returns the membership m : is the fuzzification parameter This is a special form of weighted average. We modify the degree of fuzziness in x i s current membership and multiply this by x i. The product obtained is divided by the sum of the fuzzified membership. The first loop of the algorithm calculates membership values for the data points in clusters and the second loop recalculates the cluster centers using these membership values. When the cluster center stabilizes (when there is no change the algorithm ends. B. Limitations of the algorithm The fuzzy c-means approach to clustering suffers from several constrains that affect the performance [0]. The main drawback is from the restriction that the sum of membership values of a data point x i in all the clusters must be one as in Expression (5, and this tends to give high membership values for the outlier points. So the algorithm has difficulty in handling outlier points. Secondly the membership of a data point in a cluster depends directly on the membership values of other cluster centers and this sometimes happens to produce undesirable results. p = = µ (5 In fuzzy c-means method a point will have partial membership in all the clusters. The exp.(4 for calculating the new cluster centers finds a special form of weighted average of all the data points. The third limitation of the algorithm is that due to the influence (partial membership of all the data members, the cluster centers tend to move towards the center of all the data points [0].The fourth constrain of the algorithm is its inability to calculate the membership value if the distance of a data point is zero(exp.3 TABLE. FUZZY C-MEANS ALGORITHM initialize p=number of clusters initialize m=fuzzification parameter initialize C (cluster centers Repeat For i= to n :Update µ (x i applying (3 For = to p :Update C i with(4with current µ (x i Until C estimate stabilize V. THE NEW FUZZY CLUSTERING METHOD The new fuzzy clustering method we propose, removes the restriction imposed by exp (4. Due to this constrain the c-means algorithm tends to give more membership values for the outlier points. In c-means algorithm, the membership of a point a cluster is calculated based on its membership in other clusters. Many limitations of the algorithm arise due to this and in the new method the membership of a point in a cluster center depends only on its distance in that cluster. For calculating the membership values, we use a new simple expression as given in exp (6. ( x = Max ( d Max d ( d i µ i (6 Where µ (x i : is the membership of x i in the th cluster d i : is the distance of x i in cluster c Max(d : is the maximum distance in the cluster c Since Max ( d =, the above membership function Max ( d (exp.6 will generate values closer to one( for smaller distances (d i and a membership value of zero for the maximum distance. If the distance of a data point is zero then the function returns a membership value of one and thus it overcomes the fourth constrain of c-means algorithm. The membership values are calculated only based on the distance of a data member in the cluster and due to this the method does not suffer from the first and second constrains of c-means algorithm. To overcome the third limitation of c-means algorithm in calculating new cluster centers, the new method inherits the features of k- means algorithm. For this purpose, A data point is completely assigned to a cluster where it has got maximum membership and the point is used only for the calculation of new cluster center in that cluster. This way the influence of a data point in the calculation of all the cluster centers can be avoided. The point is used of the calculations only if its membership value falls above a threshold value α. This way we can ensure that outlier points are not considered for new cluster center calculations. Unlike the crisp clustering algorithms like k-means, most of the fuzzy clustering algorithms are sensitive to the selection of initial centroids. The effectiveness of the algorithms depends on the initial cluster seeds [7]. For better results we suggest to use k-means algorithm to get the initial cluster seeds. This way we can ensure that the algorithm converges with best possible results. In the new method, since the membership values vary strictly from 0 to, we find that initializing the threshold value of α to 0.5 produces better results in cluster center calculation. 63

4 TABLE 2. THE NEW FUZZY CLUSTERING ALGORITHM initialize p=number of clusters initialize C (cluster centers initialize α (threshold value Repeat For i= to n :Update µ (x i applying (6 For k= to p : Sum=0 Count=0 For i= to n : If µ(x i is maximum in C k then If µ(x i >=α Sum=Sum+x i count=count + C k =Sum/count Until C estimate stabilize The fuzzy membership values found with the expression 6. can be used in the new fuzzy clustering algorithm given in table2. The algorithm stops when the cluster centers stabilize. The algorithm will be more efficient in handling data with outlier points and in overcoming other constrains imposed by c-means algorithm. The new method we propose is also far superior in the calculation of new cluster centers, which we demonstrate in the next session with two numerical examples. VI. ILLUSTRATIONS In order to find the effectiveness of the new algorithm, we applied it with a small synthetic data set to demonstrate the functioning of the algorithm in detail. The algorithm is also tested with real data collected for Bhutan s Gross National Happiness (GNH program. A. With synthetic data Consider the sales of an item from different shops at different periods of the year, from figure. There is an outlier point at (2,400. If we start data exploration with two random cluster centers at C(.5,50 and C2(8,50, the algorithm ends with the cluster centers at C(.52,66.20 and C2(5.49,72.. We also applied c- means algorithm with the same data set and we found that the algorithm converges to cluster centers C(2.79,47.65 and C2(3.29, Figure. The data points and final overlapping clusters We also found that the algorithm is far superior to c- means in handling outlier points. (See table 3. The new algorithm is capable of giving very low membership values to the outlier points. If we start with C at (.5,50, the algorithm takes four steps to converge the first cluster center to(.52, Similarly the second cluster center C2 converges to (5.49,72. from (8,50 in four steps. The algorithm finally ends with the final cluster centers at C(.5,66.2 and C2 (5.4,72. and with two overlapping clusters as shown in figure. The outlier point is given zero membership in both the clusters. From Table 3, it is clear that the new fuzzy clustering method we propose is better than the conventional c- means algorithm in handling outlier points and in the calculation of new cluster centers. Due the constrain given in expression 5, the c-means algorithm gives membership values.37 and.63 to the outlier point (2,400 so that the sum of memberships of this point is one. But the new algorithm gives zero membership to this point. 6.2 With Natural Data Happiness and satisfaction are directly related with a community s ability to meet their basic needs and these are important factors in safeguarding their physical health. The unique concept of Bhutan s Gross National Happiness (GNH depends on nine factors like health, ecosystem, emotional well being etc.[6]. GNH regional chapter at Sherubtse College conducted a survey among 3 villagers and the responses were converted into numeric values. For the analysis of the new method we took the attributes income and health index as shown in figure 2. TABLE 3. C-MEANS AND NEW METHOD Methods Final Cluster Centers Outlier membership Point - 2,400 C-means C 2.79, C2 3.29, New Method C.52, C2 5.49,72. 0 Figure 2. The data points 64

5 Cent ers TABLE 4. THE FINAL CLUSTER CENTERS Final Cluster Centers C-means New Method X Y X Y C C C As we can see from the data set, in Bhutan the low income group maintains better health than high income group since they are self sufficient in many ways. Like any other natural data this data set also contains many outlier points which do not belong to any of the groups. If we apply c-means algorithm these points tend to get more membership values due to exp.4. To start the data analysis, first we applied k-means algorithm to find the initial three cluster centers. The algorithm ended with three cluster centers at C(24243,6.7,C2(69794,5. and C3(979.29,2.72. We applied these initial values in both c-means algorithm and the new method to analyze the data and the algorithms ended with centroids as given in Table 4. From figure 2, and table 4, It can be seen that the final centroids of c-means method does not represent the actual centers of the clusters and this is due to the influence of outlier points. The memberships of all the points are also considered for the calculation of cluster centers. So the cluster centers tend to move towards the centre of all the points. But the new method identifies the cluster centers in a better way. The efficiency of the new algorithm lies in its ability to provide very low membership values to outlier points and also to consider a point only for the calculation of one cluster centre. VII. CONCLUSION A good clustering algorithm produces high quality clusters to yield low inter cluster similarity and high intra cluster similarity. Many conventional clustering algorithms like k-means and fuzzy c-means algorithm achieve this on crisp and highly structured data. But they have difficulties in handling unstructured natural data which often contain outlier data points. The proposed new fuzzy clustering algorithm combines the positive aspects of both crisp and fuzzy clustering algorithms. It is more efficient in handling the natural data with outlier points than both k-means and fuzzy c-means algorithm. It achieves this by assigning very low membership values to the outlier points. But The Algorithm has limitations in exploring highly structured crisp data which is free from outlier points. The efficiency of the algorithm has to be further tested on a comparatively larger data set. REFERENCES [] Sankar K. Pal, P. Mitra, Data Mining in Soft Computing Framework: A Survey, IEEE transactions on neural networks, vol. 3, no., January 2002 [2] R. Cruse, C. Borgelt, Fuzzy Data Analysis Challenges and Perspective kruse99fuzzy.html [3] G. Rau, Th. Shanta Kumar, Binu Thomas, Integration of Fuzzy Logic in Data Mining: A comparativecase Study, Proc. of International Conf. on Mathematicsand Computer Science, Loyola College, Chennai, 28-36,2008 [4]Maria Halkidi, Quality assessment and Uncertainty Handling in Data Mining Process, halkidi00quality.html [5] W. H. Inmon The data warehouse and data mining, Commn. ACM, vol. 39, pp , 996. [6] U. Fayyad and R. Uthurusamy, Data mining and knowledge discovery in databases, Commn. ACM, vol. 39, pp , 996. [7] Pavel Berkhin, Survey of Clustering Data Mining Techniques, [8] Chau, M., Cheng, R., and Kao, B, Uncertain Data Mining: A New Research Direction, /~mchau/papers/uncertaindatamining_wsa.pdf [9] Keith C.C, C. Wai-Ho Au, B. Choi, Mining Fuzzy Rules in A Donor Database for Direct Marketing by A Charitable Organization, Proc of First IEEE International Conference on Cognitive Informatics, pp: , 2002 [0] E. Cox, Fuzzy Modeling And Genetic Algorithms For Data Mining And Exploration, Elsevier, 2005 [] G. J Klir, T A. Folger, Fuzzy Sets, Uncertainty and Information, Prentice Hall,988 [2] J Han, M Kamber, Data Mining Concepts and Techniques, Elsevier, 2003 [3] J. C. Bezdek, Fuzzy Mathematics in Pattern Classification, Ph.D. thesis, Center for Applied Mathematics, Cornell University, Ithica, N.Y., 973. [4] Carl G. Looney, A Fuzzy Clustering and Fuzzy Merging Algorithm, [5] Frank Klawonn, Anette Keller, Fuzzy Clustering Based on Modified Distance Measures, [6] Sullen Donnelly, How Bhutan Can Develop and Measure GNH, gnh/gnh-papers-st_8-20.pdf [7] Lei Jiang and Wenhui Yang, A Modified Fuzzy C-Means Algorithm for Segmentation of Magnetic Resonance Images Proc. VIIth Digital Image Computing: Techniques and Applications, pp , 0-2 Dec. 2003, Sydney. 65