Efficient and Effective Clustering Methods for Spatial Data Mining

Transcription

1 Efficient and Effective Custering Methods for Spatia Data Mining Raymond T. Ng Department of Computer Science University of British Coumbia Vancouver, B.C., V6T 124, Canada Jiawei Han Schoo of Computing Sciences Simon Fraser University Burnaby, B.C., V5A S6, Canada hanqcs.sfu.ca Abstract Spatia data mining is the discovery of interesting reationships and characteristics that may exist impicity in spatia databases. In this paper, we expore whether custering methods have a roe to pay in spatia data mining. To this end, we deveop a new custering method caed CLAHANS which is based on randomized search. We aso deveop two spatia data mining agorithms that use CLAHANS. Our anaysis and experiments show that with the assistance of CLAHANS, these two agorithms are very effective and can ead to discoveries that are difficut to find with current spatia data mining agorithms. Furthermore, experiments conducted to compare the performance of CLAHANS with that of existing custering methods show that CLAHANS is the most efficient. 1 Introduction Data mining in genera is the search for hidden patterns that may exist in arge databases. Spatia data mining in particuar is the discovery of interesting reationships and characteristics that may exist impicity in spatia databases. Because of the huge Permission to copp without fee a or part of this materia ia granted provided that the copies are not made or distributed for direct commercia advantage, the VLDB copyright notice and the tite of the pubieation and itr date appear, and notice is given that copying ir by pemierion of the Very Large Data Base Endowment. To copy otherwise, or to repubish, requirer a fee and/or specia permission from the Endowment. Proceedings of the 20th VLDB Conference Santiago, Chie, 1994 amounts (usuay, tersbytes) of spatia data that may be obtained from sateite images, medica equipments, video cameras, etc., it is costy and often unreaistic for users to examine spatia data in detai. Spatia data mining aims to automate such a knowedge discovery process. Thus, it pays an important roe in a) extracting interesting spatia patterns and features; b) capturing intrinsic reationships between spatia and non-spatia data; c) presenting data reguarity concisey and at higher conceptua eves; and d) heping to reorganize spatia databases to accommodate data semantics, as we as to achieve better performance. Many exceent studies on data mining have been conducted, such as those reported in [, 2, 4, 7, 11, 13, 161. [] considers the probem of inferring cassification functions from sampes; [2] studies the probem of mining association rues between sets of data items; [7 J proposes an attributeoriented approach to knowedge discovery; [] deveops a visua feedback querying system to support data mining; and [16] incudes many interesting studies on various issues in knowedge discovery such as finding functiona dependencies between attributes. However, most of these studies are concerned with knowedge discovery on non-spatia data, and the study most reevant to our focus here is [13] which studies spatia data mining. More specificay, [13] proposes a spatia datadominant knowedgeextraction agorithm and a nonspatia data-dominant one, both of which aim to extract high-eve reationships between spatia and nonspatia data. However, both agorithms suffer from the foowing probems. First, the user or an expert must provide the agorithms with spatia concept hierarchies, which may not be avaiabe in many appications. Second, both agorithms conduct their spatia exporation primariy by merging regions at a certain eve of the hierarchy to a arger region at a higher eve. Thus, the quaity of the resuts produced by 144

2 both agorithms reies quite cruciay on the appropriateness of the hierarchy to the given data. The probem for most appications is that it is very difficut to know a priori which hierarchy wi be the most appropriate. Discovering this hierarchy may itsef be one of the reasons to appy spatia data mining. To dea with these probems, we expore whether custer anaysis techniques are appicabe. Custer Anaysis is a branch of statistics that in the past three decades has been intensey studied and successfuy ap pied to many appications. To the spatia data mining task at hand, the attractiveness of custer anaysis is its abiity to find structures or custers directy from the given data, without reying on any hierarchies. However, custer anaysis has been appied rather unsuccessfuy in the past to genera data mining and machine earning. The compaints are that custer anaysis agorithms are ineffective and inefficient. Indeed, for custer anaysis agorithms to work effectivey, there need to be a natura notion of simiarities among the objects to be custered. And traditiona custer anaysis agorithms are not designed for arge data sets, say more than 2000 objects. For spatia data mining, our approach here is to ap py custer anaysis ony to the spatia attributes, for which natura notions of simiarities exist (e.g. Eucidean or Manhattan distances). As wi be shown in this paper, in thii way, custer anaysis techniques are effective for spatia data mining. Aa for the e%ciency concern, we deveop our own custer anaysis agorithm, caed CLAHANS, which is designed for arge data sets. More specificay, we wi report in this p& per: the deveopment of CLAHANS, which is based on randomized search and is party motivated by two existing agorithms we-known in custer anaysis, caed PAM and CLARA; and the deveopment of two spatia mining agorithms SD(CLAHANS) and NSD(CLAHANS). Given the nature of spatia data mining, and the fact that CLAHANS is based on randomized search, the methodoogy we have adopted here ia one baaed on experimentation. In particuar, we wi preeent: experimenta resuts showing that CLAHANS is more efficient than the existing agorithms PAM and CLARA; and experimenta evidence and anaysis demonstrating the effectiveness of SD(CLAHANS) and NSD(CLAHANS) for spatia data mining. The paper is organized as foows. Section 2 introduces PAM and CLARA. Section 3 presents our custering agorithm CLAHANS, as we as experimenta resut8 comparing the performance of CLAHANS, PAM and CLARA. Section 4 studies spatia data mining and presents two spatia data mining agorithms, SD(CLAH,ANS) and NSD(CLAHANS). Section 5 gives an experimenta evauation on the effectiveness of SD(CLAHANS) and NSD(CLAHANS) for spatia data mining. Section 0 discusaea how SD(CLAHANS) and NSD(CLAHANS) can assist in further spatia discoveries, and how they can contribute towards the buiding of a genera-purpose and powerfu spatia data mining package in the future. 2 Custering Agorithms based on Partitioning 2.1 PAM In the past 30 years, custer anaysis has been widey appied to many areas such as medicine (cassification of diseases), chemistry (grouping of compounds), socia stud& (caseification of statistica findings), and so on. Its main goa is to identify structures or cusfers present in the data. Whie there is no genera definition of a custer, agorithms have been deveoped to find severa kinds of custers: spherica, inear, drawnout, etc. See [o, 181 for more detaied discussions and anayses of these issues. Among a the existing custering agorithms, we have chosen the k-medoid methods as the basis of our agorithm for the foowing reasons. First, unike many other partitioning methods, the k-medoid methods are very robust to the existence of outiers (i.e. data points that are very far away from the rest of the data points). Second, custers found by A-medoid methods do not depend on the order in which the objects are examined. Furthermore, they are invariant with respect to transations and orthogona transformations of data points. La& but not east, experiments have shown that the k-medoid methods described beow can hande very arge data sets quite efficienty. See [o] for a more detaied comparison of k-medoid methods with other partitioning methods. In this section, we present the two beat-known k-medoid methods on which our agorithm is based. PAM (Partitioning Around Medoids) was deveoped by Kaufman and Housseeuw [o]. To find k custers, PAM s approach is to determine a representative object for each custer. Thii representative object, caed a medoid, is meant to be the most centray ocated object within the custer. Once the medoids have been seected, each non-seected object ia grouped with the medoid to which it is the most simiar. More precisey, if Oi is a non-seected object, and Oi is a (aeected) medoid, we say that Oj beongs to the C~USter represented by Oi, if d(oj, Oi) = mino,d(oj, O,), where the notation mine, denotes the minimum over 145

3 a medoids O,, and the notation d(o,,ob) denotes the dissimiarity or distance between objects 0, and Ob. A the dissimiarity vaues are given as inputs to PAM. Finay, the quaity of a cudcring (i.e. the combined quaity of the chosen medoids) is measured by the average dissimiarity between an object and the medoid of its custer. To find the k medoids, PAM begins with an arbitrary seection of ) objects. Then in each step, a swap between a seected object Oi and a non-seected object Oh is made, as ong as such a swap woud resut in an improvement of the quaity of the custering. In particuar, to cacuate the effect of such a swap between Oi and Oh, PAM computes COSb Cjih for a non-seected objects Oj. Depending on which of the foowing csxs Oj is in, Cjih is defined by one of the equations beow. First Case: suppose Oj currenty beongs to the custer represented by Oi. Furthermore, et Oj be more simiar to Oj,a than Oh, i.e. d(oj, Oh) 2 CyOj, Oj,z), where Oj,r is the second most simiar medoid to Oj. Thus, if Oi is repaced by Oh as a medoid, Oj woud beong to the custer represented by Oj,2. Hence, the cost ofthe swap ae far as Oj is concerned is: C.. j*h = d(oj,oj,a)-d(oj,oi). (1) This equation aways gives a non-negative Cjih, indicating that there is a non-negative cost incurred in repacing Oi with Oh. Second Case: Oj currenty beongs to the custer represented by Oi. But this time, Oj is ess simiar to Oj,2 than Oh, i.e. d(oj,oh) < d(oj,oj,a). Then, if Oi is repaced by Oh, Oj woud beong to the custer represented by Oh. Thus, the cost for Oj is given by: c*. ash = d(oj 9 Oh) - d(oj 9 Oi). (2) Unike in Equation (), Cjih here can be positive or negative, depending on whether 0, is more simiar to oi Or to oh. Third Case: suppose that Oj currenty beongs to a custer other than the one represented by Oi. Let Or,2 be the representative object of that custer. Furthermore, et Oj be more simiar to Oi2 than Oh. Then even if Q is repaced by Oh, Oj woud stay in the custer represented by Oj,2. Thus, the cost is: Cjih = 0. (3) Fourth Case: Oj currenty beongs to the custer represented by Oj,2. But Oj is e~e simiar to Oj,2 than Oh. Then repacing Oi with Oh woud cause Oj to jump to the custer of Oh from that of Oj,2. Thus, the cost is: c., t*h = d(oj 9 Oh) - d(oj 9 Oj,2), (4) and is aways negative. Combining the four cases above, the tota cost of repacing Oi with Oh is given by: TCih = c i We now present Agorithm PAM. Agorithm PAM Cjih Seect B representative objects arbitrariy. upus T&, for cd pairs of objects Oi,Oh where Oi is currenty seected, and Oh is not. Seect the pair Oi, Oh which corresponds to minoi,o, TCih. If the minimum TCih is negative, repace Oi with Oh, and go back to Step (2). Otherwise, for each non-seected object, find the most simiar representative object. Hat. 0 Experimenta resuts show that PAM works satisfactoriy for sma data sets (e.g. 100 objects in 5 custers [o]). But it is not efficient in deaing with medium and arge data sets. Thii is not too surprising if we perform a compexity anaysis on PAM. In Steps (2) and (3), there are atogether k(n - h) pairs of Oi, Oh. For each pair, computing TCih requires the examination of (n - k) non-seected objects. Thus, Steps (2) and (3) combined is of O(k(n - E)2). And this is the compexity of ony one iteration. Thus, it is obvious that PAM becomes too costy for arge vaues of n and h. This anaysis motivates the deveopment of CLARA. 2.2 CLARA Designed by Kaufman and Bousseeuw to hande arge data sets, CLARA (Custering LARge Appications) reies on samping [o]. Instead of finding representative objects for the entire data set, CLARA draws a sampe of the data set, appies PAM on the sampe, and finds the medoids of the sampe. The point is that if the sampe is drawn in a sufficienty random way, the medoids of the sampe woud approximate the medoids of the entire data set. To come up with better approximations, CLARA draws mutipe sampes and gives the best c~tering as the output. Here, for accuracy, the quaity of a custering is measured based on the average dissimiarity of a objects in the entire data set, and not ony of those objects in the sampes. Experiments reported in [o] indicate that 5 sampes of size L give satisfactory resuts. Agorithm CLARA 1. Fori= 1 to 5, repeat the foowing steps: 146

4 Draw a sampe of k objects randomy from the entire data set, and ca Agorithm PAM to find k medoids of the sampe. For each object Oj in the entire data set, determine which of the k medoids is the most simiar t0 Oj. Cacuate the average dissimiarity of the custering obtained in the previous step. If this vaue is ess than the current minimum, use this vaue 8s the current minimum, and retain the k medoids found in Step (2) as the best set of medoids obtained so far. Return to Step (1) to start the next iteration. 0 Compementary to PAM, CLARA performs satisfactoriy for arge data sets (e.g objects in 10 custers). Reca from Section 2.1 that each iteration of PAM is of O(k(n - k) ). But for CLARA, by appying PAM just to the sampes, each iteration is of O(k(40 + k)2 + k(n - k)). This expains why CLARA is more efficient than PAM for arge vaues of n. 3 A Custering Agorithm based on Randomized Search In this section, we wi present our custering agorithm - CLARANS (Custering Large Appications based on RANdomized Search). We wi first introduce CLARANS by giving a graph abstraction of it. Then after describing the detais of the agorithm, we wi present experimenta resuts showing that CLARANS outperforms CLARA and PAM in terms of both efciency and effectiveness. In the next section, we wi show how CLARANS can be used to provide effective spatia data mining. 3.1 Motivation of CLARANS: a Graph Abstraction Given n objects, the process described above of finding k medoids can be viewed abstracty as searching through a certain graph. In this graph, de noted by &,r, a node is represented by a set of k objects (O,,,..., O,, ), intuitivey indicating that O,,,..., O,, are the seected medoids. The set of nodes in the graph is the set { {O,,,...,O,,) 1 O,,,...,O,,,,, are objects in the data set). 1 [o] reports a usefu hemietic to draw sampes. Apart from the first sampe, eubeequent sampes incude the beet set of medoids found 80 far. In other words, apart from the ibxt iteration, m&sequent iterations draw 40 + k objects to add on to the best k medoids. Two nodes are neighbors (i.e. connected by 8i1 arc) if their sets differ by ony one object. More formay, two nodes Si = {O,,,..., O,,} and Sz = ww, >owy) are neighbors if and ony if the cardinaity of the intersection of S and Sz is A - 1, i.e. ISin&(=k-1. Itiseasytoseethateachnodehas k(n-k) neighbors. Since a node represents a coection of k medoids, each node corresponds to a custering. Thus, each node can be assigned a cost that is defined to be the tota dissimiarity between every object and the medoid of its custer. It is not difficut to see that if objects Oi, Oh sre the differences between neighbors S and S8 (i.e. Oi,Oh e Si n S8, but Oi E Si and Oh E SZ), the cost differentia between the two neighbors is exacty given by n: h defined in Equation (5). By now, it is obvious that PAM can be viewed ss a search for a minimum on the graph f&k. At each step, a the neighbors of the current node are examined. The current node is then repaced by the neighbor with the deepest descent in costs. And the search continues unti a minimum is obtained. For arge vaues of n and k (ike n = 1000 and k = o), examining a k(n - k) neighbors of a node is time consuming. This accounts for the inefficiency of PAM for arge data sets. On the other hand, CLARA tries to examine fewer neighbors and restricts the search on subgraphs that are much smaer in sise than the origina graph Gn,k. However, the probem is that the subgraphs examined 8re defined entirey by the objects in the sampes. Let Sa be the set of objects in a sampe. The subgraph GSc,k consists of a the nodes that are subsets (of cardim&ties k) of Sa. Even though CLARA thoroughy examines Gs,,,k via PAM, the troube is that the search is fuy confined within Gs,,,k. If M is the minimum node in the origina graph G,,,k, and if M is not incuded in G&,,k, M wi never be found in the search of Gs,,,k, regardess of how thorough the search is. To atone for this deficiency, many, many sampes woud need to be coected and processed. Like CLARA, our agorithm CLARANS does not check every neighbor of a node. But unike CLARA, it does not restrict its search to a particuar subgraph. In fact, it searches the origina graph G,,k. One key difference between CLARANS and PAM is that the former ony checks a sampe of the neighbors of a node. But unike CLARA, each sampe is drawn dynamicay in the sense that no nodes corresponding to particuar objects are eiminated outright. In other words, whie CLARA draws a sampe of nodes at the beginning of a search, CLARANS draws a sampe of neighbors in each step of a search. This has the benefit of not confining a search to a ocaiaed area. As wi be shown ater, a search by CLARANS gives higher quaity custerings than CLARA, and CLARANS requires a very sma number of searches. We now present the detais of 147

5 Agorithm CLARANS. 3.2 CLARANS Agorithm 1. CLARANS 2. Set current to an arbitrary node in G,,k Input parameters numoca and maxneighbor. Initiaize i to 1, and mincost to a arge number. Set j to 1. Consider a random neighbor S of current, and based on Equation (5) cacuate the cost differentia of the two nodes If 5 haa a ower cost, set current to S, and go to Step (3). Otherwise, increment j by 1. If j 5 maxneighbor, go to Step (4). Otherwise, when j > maxneighbor, compare the cost of current with mincost. If the former is ess than mincoet, set mincost to the cost of current, and set bestnode to current. 8. Increment i by 1. If i > numoca, output bestnode and hat. Otherwise, go to Step (2). 0 Steps (3) to (6) above search for nodes with progressivey ower costs. But if the current node has aready been compared with the maximum number of the neighbors of the node (specified by maxneighbor) and is sti of the owest co&, the current node is decared to be a oca minimum. Then in Step (7), the cost of this oca minimum L compared with the owest coat obtained so far. The ower of the two coats above is stored in mincost. Agorithm CLARANS then repeats to search for other oca minima, unti numiocu of them have been found. Aa shown above, CLABANS has two parame tern: the maximum number of neighbors examined (maxneighbor), and the number of oca minima obtained (numoca). The higher the vaue of maxneighbor, the coser is CLABANS to PAM, and the onger is each search of a oca minima. But the quaity of such a oca minimais higher, and fewer oca minima needs to be obtained. Like many appications of randomized search [8, 91, we rey on experiments to determine the appropriate vaue-s of these parameters. A the performance resuts of CLARANS quoted in the remainder of this paper are baaed on the version of CLABANS that set numioca1 to 2 and maxneighbor to be the arger vaue between 1.25% of h(n - k) and 250. See [15] f or more information on how and why these specific vaues are chosen number of objects Figure 1: Efficiency: CLABANS vs PAM 3.3 Experimenta Resuts: CLARANS vs PAM In the foowing we present experimenta resuts comparing CLARANS with PAM. As discussed before, for arge and medium data sets, it is obvious that CLABANS, whie producing custerings of very comparabe quaity, is much more efficient than PAM. Thus, our focus here was to compare the two ago rithma on sma data sets. We appied both agorithms to data sets with 40,60,80 and 100 points in 5 custers. Figure 1 shows the runtime taken by both agorithms. Note that for a those data sets, the custerings produced by both agorithms are of the same quaity (i.e. same average distance). Thus, the difference between the two agorithms is determined by their efficiency. It is evident from Figure 1 that even for sma data sets, CLABANS outperforms PAM significanty. As expected, the performance gap between the two ago rithms grows, as the data set increases in size. 3.4 Experimenta Resuts: CLARANS vs CLARA In this series of experiments, we compared CLARANS with CLARA. As discussed in Section 2.2, CLARA is not designed for sma data sets. Thus, we ran thii set of experiments on data sets whose number of objects exceeda 100. And the objects were organized in different number of custers, aa we as in different types of custers [15]. When we conducted this series of experiments run- 148

6 number zoo0 of objects 3ooo (10 CIW) (5 cur) comns Figure 2: Reative Quaity: Same Time for CLARANS and CLARA ning CLARA and CLARANS as presented earier, CLARANS is aways abe to find custerings of better quaity than those found by CLARA. However, in some cazes, CLARA may take much eas time than CLARANS. Thus, we wondered whether CLARA woud produce custerings of the same quaity, if it was given the same amount of time. This eads to the next series of experiments in which we gave both CLARANS and CLARA the same amount of time. Figure 2 shows the quaity of the custerings produced by CLARA, normaized by the corresponding vaue produced by CLARANS. Given the same amount of time, CLARANS ceary outperforms CLARA in a cazes. The gap between CLARANS and CLARA increases from 4% when k, the number of custers, is 5 to 20% when k is 20. This widening of the gap as k increases can be best expained by ooking at the compexity anayses of CLARA and CLARANS. Reca from Section 2.2 that each iteration of CLARA is of O(ks + nk). On the other hand, the cost of CLARANS is basicay ineary proportiona to the number of objects a. Thus, an 2There is a random aspect and a non-random wect to the execution of CLAFUNS. The non-random aspect corresponds to the part that finds the cost differentia between the curre-nt node and its neighbor. This part, as defiued in Equation (5) is ineary proportiona to the number of objects in the data set. On the other hand, the random aspect corresponds to the part that searches for a oca minimum. As the vaues to pot the eaphs are average vaues of 10 runs, which have the &ect of reducing the infuence of the random aspect, the runtimes increase in k imposes a much arger cost on CLARA than on CLARANS. The above compexity comparison aso expains why for a fixed number of custers, the higher the number of objects, the narrower the gap between CLAFfANS and CLARA is. For exampe, when the number of objects is 1000, the gap is as high as 30%. The gap drops to around 20% as the number of object increases to Since each iteration of CLARA is of O(k3 + nk), the first term k3 dominatea the second term. Thus, for a fixed k, CLARA is reativey ess sensitive to au increase in n. On the other hand; since the co@ of CLARANS is roughy ineary proportiona to n, an increaze in n imposes a arger cost on CLAB,ANS than on CLARA. This expains why for a fixed k, the gap narrows as the number of objects increases. Nonetheess, the bottom-ine shown in Figure 2 is that CLARANS beats CLARA in a c88es. In sum, we have presented experimenta evidence showing that CLARANS is more efficient than PAM and CLARA for sma and arge data sets. Our experimenta resuts for medium data sets (not incuded here) ead to the same concusion. In the next section, we wi present two spatia data mining agorithms that use custering methods. Later we wi present experimenta evidence on the effectiveness of these agorithms Spatia Data Mining based on Custering Agorithms Spatia Dominant Approach: SD(CLARANS) There are different approaches to spatia data mining. The kind of spatia data mining considered in this paper assumes that a spatia database consists of both spatia and non-spatia attributes, and that nonspatia attributes are stored in reations [3, 12, 171. The genera approach here is to use custering age rithms to dea with the spatia attributes, and use other earning toos to take care of the non-spatia counterparts. DBLEARN is the too we have chosen for mining non-spatia attributes [7]. It takes as inputs reationa data, generaization hierarchies for attributes, and a earning query specifying the focus of the mining task to be carried out. From a barning re quest, DBLEARN first extracts a set of reevant tupes via SQL queries. Then based on the generaization hierarchies of attributes, it,iterativey generaizes the tupes. For exampe, suppose the tupes reevant to a certain earning query have attributes of CLARANS used in our graphs are argey dominated by the non-random aspect of CLAFtANS. 149

7 (major, ethnicgroup). Further assume that the generaization hierarchy for ethnicgroup has Indian and Chinese generaized to Asians. Then a generaization operation on the attribute ethnicgroup causes a tupes of the form (m, Indian) and (m, Chinese) to be. merged to the tupe (m, Asians). Thii mergmg has the effect of reducing the number of remaining (generaized) tupes. As described in [q, each tupe has a system-defined attribute caed count which keeps track of the number of origina tupes (as stored in the reationa database) that are represented by the current (generaized) tupe. Thii attribute enabes DBLEARN to output such statistica statements as 8% of a students majoring in Socioogy are Asians. In genera, a generaization hierarchy may have mutipe eves (e.g. Asians further generaized to non- Canadians), and a earning query may require more than one generaization operation before the fina number of generaized tupes drops beow a certain threshod 3. At the end, statements such as 90% of a Arts students are Canadians may be returned as the findings of the earning query. Having outined what DBLEARN does, the specific issue we address here is how to extend DBLEARN to dea with spatia attributes. In particuar, we wi present two ways to combine custering agorithms with DBLEARN. The agorithm beow, caed SD(CLARANS), combines CLARANS and DBLEARN in a spatia dominant fashion. That is, spatia custering is performed first, foowed by nonspatia generaization of every custer. Agorithm SD(CLARANS) Given a earning request, find the initia set of reevant tupes by the appropriate SQL queries. Appy CLARANS to the spatia attributes and find the most natura number knet of custers. For each of the k,,,,t custers obtained above, (a) coect the non-spatia components of the tupes incuded in the current custer, and (b) appy DBLEARN to this coection of the non-spatia components. c Simiary, Agorithms SD(PAM) and SD(CLARA) can be obtained. But as shown in the ast section that CLARANS is more efficient than PAM and CLARA, the experimenta evauation to be reported in Section 5 ony considers SD(CLAR.ANS). 3Apart from generaiestion operations (aso known as hierarchy ascension operations), DBLEABN, in its fu form, may sometimes choose to drop an attribute, if generaizing such an attribute woud produce wintereating resuta (e.g. generai5h~~ names of students). 4.2 Determining knot for CLARANS Step (2) of Agorithm SD(CLARANS) tries to find knat custers, where knot is the most natura number of custers for the given data set. However, reca that CLARANS and a partitioning agorithms require the number k of custers to be given as input. Thus, an immediate question to ask is whether SD(CLARANS) knows beforehand what Anot is and can then simpy pass the vaue of k,,* to CLARANS. The unfortunate answer is no. In fact, determining knot is one of the most difficut probems in custer anaysis, for which no unique soution exists. For SD(CLARANS), we adopt the heuristics of computing the sihouette coefcients, first deveoped by Kaufman and Rousseeuw [o]. (For a survey of aternative criteria, see [14].) For space considerations, we do not incude the formuas for computing sihouettes, and wi ony concentrate on how we use sihouettes in our agorithms. Intuitivey, the sihouette of an object Oj, a dimensioness quantity varying between -1 and 1, indicates how much Oi truy beongs to the custer to which Oj is cassified. The coser the vaue is to 1, the higher the degree Oj beongs to its custer. The sihouette width of a custer is the average sihouette of a objects in the custer. Based on extensive experimentation, [o] proposes the foowing interpretation of the sihouette width of a custer: For a given number k 2 2 of custers, the sihouette coefficient for A ia the average sihouette widths of the k custers. Notice that the sihouette coefficient does not necessariy decrease monotonicay as k increases 4. If the vaue k is too sma, some distinct custers are incorrecty grouped together, eading to a sma sihouette width. On the other hand, if k is too arge, some natura custers may be artificiay spit, again eading to a sma sihouette width. Thus, the most natura k is the one whose sihouette coefficient is the highest. However, our experiments on spatia data mining show that just using the highest sihouette coefficient may not ead to intuitive resuts. For exampe, some custers may not have reasonabe structures, i.e. widths Thus, we use the foowing heuristics to determine the vaue k,,,$ for SD(CLARANS). However, this is not the case for the average disshnikity of an object from its medoid. The arger the w&e of k, the snmer the weage didmihity b. Thin expaim why average.. dwshmkity is ony suitabe as a measumm ent criterion for fixed k, but ie otherwise not e&abe to be umed to compare the quaity of cwtehga produced by different k vaues. 150

8 Heuristics for Determining knat Find the vaue k with the highest sihouette coefficient. If a the k custers have sihouette widths , k nat = k, and hat.. Otherwise, remove the objects in those custers whose sihouette widths are beow 0.5, provided that the tota number of objects removed so far is ess than a threshod (e.g. 25% of the tota number of objects). The objects removed are considered to be outiers or noises. Go back to Step (1) for the new data set without the outiers. If in Step (3), the number of outiers to,be removed exceeds the threshod, simpy set knot = 1, indicating in effect that no custering is reasonabe. 0 In Section 5, we wi see the usefuness of the heuristics. Having described SD(CLARANS), we are now in a position to compare SD(CLARANS) with an earier approach reported in [13].whose goa is to enhance DBLEARN with spatia earning capabiities. One of the two proposed approaches there is to first perform spatia generaizations, and then to use DBLEARN to conduct non-spatia generaizations. The fundamenta difference between SD(CLARANS) and that ago rithm in [13] is that a user of the atter must give a priori as input generaization hierarchies for spatia attributes. The probem is that without prior anaysis, it is amost impossibe to guarantee that the given hierarchies are suitabe for the given data set. (This may in fact be one of the discoveries to be found out by the spatia data mining task!) For exampe, sup pose a spatia data mining request is to be performed on a the expensive houses in Greater Vancouver. A defaut spatia hierarchy to use may be the one that generaizes streets to communities and then to cities. However, if some of the expensive houses are spatiay ocated aong something (such ss a river, the bottom of a range of mountains, etc.) that runs through many communities and cities, then the defaut spatia hierarchy woud be very ineffective, generating such genera statements as that the expensive houses are more or ess scattered in a the cities in Greater Vancouver. Far extending the capabiity of the agorithm in [13], SD(CLARANS) finds the c~ters dire&y from the given data. To a certain extent, the custering agorithm, CLARANS in this case, can be viewed as computing the spatia generaization hierarchy dynamicay. The resut of such computation, combined with the above heuristica to find k,,,,f, precisey finds the custers (if indeed exist in the data set) in terms of the x- and y coordinates of the points, and not confined by any hierarchies specified a priori. For the expensive houses exampe discussed above, SD(CLARANS) coud directy identify custers aong the river or the bottom of the mountain range, and coud ead to such statements as 80% of a mansions have either a mountain or a river view. In Section 5, we wi see how we our spatia data minii agorithms can hande a data set arguaby more compex than the exampe discussed here. 4.3 Non-Spatia Dominant Approach: NSD(CLARANS) To a arge extent, spatia dominant agorithms, such as SD(CLARANS), can be viewed as focusing asymmetricay on discovering non-spatia characterizations of spatia custers. Non-spatia dominant agorithms, on the other hand, focus on discovering spatia custers existing in groups of non-spatia data items. For exampe, these agorithms may find interesting diicoveries based on the spatia custering or diitribution of a certain type of houses. More specificay, unike spatia dominant agorithms, non-spatia dominant agorithms first appy non-spatia generaizations, foowed by spatia custering. The foowing agorithm, NSD(CLARANS), uses DBLEARN and CLARANS to perform data mining on non-spatia and spatia attributes respectivey. Agorithm 4. NSD(CLARANS) Given a earning request, find the initia set of reevant tupes by the appropriate SQL queries. Appy DBLEARN to the non-spatia attributes, unti the fina number of generaized tupes fa beow a certain threshod (cf. Section 4.1). For each generaized tupe obtained above, (4 O-9 coect the spatia components of the tupes represented by the current generaized tupe, and appy CLARANS and the heuristics presented above to find the most natura number knot of custers. For a the custers obtained above, check if there are custers that intersect or overap. If exist, such custers can be merged. Thii in turn causes the corresponding generaized tupea to be combined. c Reca from the previous section on custering ago rithms that for a given da;ta set, custers do not overap or intersect. This is why SD(CLARANS) does not incude.a step anaogous to Step (4) above. However, for NSD(CLARANS) (and other non-spatia dominant 151

9 agorithms such as NSD(PAM)), custers obtained for different generaized tupes can overap or intersect. In that csse, opportunities arise for further generaization of spatia and non-spatia data. This is the purpose of Step (4) above. In the foowing, we present experimenta resuts evauating the effectiveness of NSD(CLARANS), as we as SD(CLARANS). 5 Evauation of SD(CLARANS) and NSD(CLARANS) 5.1 A Rea Estate Data Set One way to evauate the effectiveness of a data mining agorithm is to appy it to a rea data set and see what it finds. But sometimes it may be difficut to judge the quaity of the findings, without knowing a priori what the agorithm is supposed to find. Thus, to evauate our agorithms, we generated a data set that honors severa rues appicabe to the 2500 expensive housing units in Vancouver. These rues, very cose to reaity to the best of our knowedge, are as foows: A. house type, price and size: 1. If the house type is mansion, the price fas within the range [1500K,3500K], and the size within the range [6000,10000] square feet. 2. If the house type is singe-house, the price and size ranges are [800K,1500K] and [3000,7000]. 3. If the house type is condo(minium), the price and size ranges are [300K,800K] and [1000,2500]. For simpicity, we assumed uniform distributions within a the ranges. B. distribution: 1. There are 1200 condos uniformy distributed in the Vancouver downtown area - the rectanguar region at the top of Figure 3. From now on, this region wi be referred to as Area B. 2. Aong Marine Drive, there are about 320 mansions and about 80 singe-houses - the stripe at the bottom eft-hand corner of Figure 3. This area wi be referred to as Area B2. 3. Around Queen Eizabeth Park, there are 800 singr+houses -the poygona area at the bottom right-hand corner of Figure 3. This area wi be referred to as Area B3. 4. Finay, to compicate the situation, there are 100 singkhouses uniformy distributed in the rest of Vancouver tioo x-coordinates Figure 3: Spatia Distribution Units 5.2 Effectiveness of SD( CLARANS) of the 2500 Housing Based on the heuristics presented in Section 4.2, Step (2) of SD(CLARANS) appropriatey sets the vaue of k,,of to 3. The sihouette coefficient for knot = 3 is 0.7, indicating that a 3 custers are quite strong. Thus, Steps (3) and (4) of the heuristics are not needed in this case. After computing kna, it takes CLARANS about 25 seconds to identify the 3 cueters (in a time-sharing SPARC-LX workstation environment). The first custer contains 832 units a singe-houses, 800 of which are those in Area B3 defined in Section 5.1. For this custer, DBLEABN in Step (3) of SD(CLARANS) correcty finds the price and size ranges to be [800K,1500K] and [3000,7000]. It aso reveas that the prices and sizes are more or ess uniformy distributed. The second custer contains 1235 units, 1200 of which are condos, and the remainders singe-houses. It contains a the units in Area B introduced in Section 5.1. For this custer, DBLEARN finds the condo prices and sizes uniformy distributed within the ranges [300K,8OOK] and [1000,2500] respectivey. It aso discovers that the singe-house prices and sizes fa within [800K, 15OOK] and [3000,7000]. The third custer contains 431 units, 320 of which are mansions, and the remainders singe-houses. This custer incudes a the units aong the stripe Area B2. For this custer, DBLEARN nds the mansion prices and sizes uniformy distributed within the 152

10 iz I x-coordinates 1200 Figure 4: Custers for the First Generaized Tupe for Mansions ranges [1500K,3500K] and [6000,10000]. As for the singehouses in the custer, DBLEARN again finds the right ranges. In sum, SD(CLARANS) is very effective. This is due primariy to the custers found by CLARANS, even in the presence of outiers (cf. B.4 of Section 5.1). Once the appropriate custers are found, DBLEARN easiy identifies the non-spatia patterns. Thus, CLARANS and DBLEARN together enabe SD(CLAB,ANS) t o successfuy discover a the rues described in Section 5.1 that it is supposed to find. 5.3 Effectiveness of NSD(CLARANS) In Step (2) of NSD(CLAR.ANS), DBLEARN finds 12 genera&d tupes, 4 for each type of housing units. Let us first consider the 4 generaised tupes for mansions. The 4 tupes represent respectivey mansions in the foowing categories: a) price in [1500K,26OOK], size in [6000,8500]; b) price in [1500K,2600K], sise in [8500,10000]; c) price in [2600K,3500K], sire in [6000,8500]; and d) price in [2600K,3500K], sise in [8500,10000]. The graph in Figure 4 shows the spa tia distributions of the mansions in the first category. When CLARANS is appied to the points shown in the graph, 2 custers are found (points in the two custers represented by either dots or +). The graphs for the other catergories b), c) and d) are very simiar, and again two custers are found in each case. Now when Step (4) of NSD(CLARANS) is executed, overapping custers are merged, which in turn causes the 4 generaized tupes to be combined as we. As a resut, NSD(CLARANS) finds out that a mansions are ocated in the stripe area, and have prices and sizes in the ranges [1500K,3500K] and [6000,10000]. The 4 tupes for condos correspond respectivey to the foowing categories: a) price in [300K,600K], size in [1000,1800]; b) price in [300K,600K], size in [1800,2500]; c) price in [600K,800K], size in [1000,1800]; and d) price in [SOOK,SOOK], size in [1800,2500]. Th e p recessing of these tupes is very simiar to the processing of those for mansions above. The ony difference is that for a 4 tupes, no custer is found 5, i.e. knot set to 1 in Step (4) of the heuristics in Section 4.2. Thus, in the fina step of NSD(CLARANS), a 4 regions/custers, which overap, are merged into an area that coincides precisey with Area B Figure 3. Consequenty, NSD(CLARANS) discovers that a (expensive) condos are ocated in the Vancouver downtown area, and have prices and sises in the ranges [300K,800K] and [1000,2500]. The processing of singehouses is the most compicated. The 4 tupes correspond to the categories: a) price in [1200K,1500K], size in [3000,5500]; b) price in [12OOK,1500Kj, size in [5500,7000]; c) price in [800K,1200K], sise in (3000,5500]; and d) price in [8OOK,1200K], size in [5500,7000]. When CLARANS is appied to the houses in the category a), the highest sihouette coefficient is found when the number of custers is 4. However, even though the sihouette coefficient is above 0.5, the sihouette widths of two of the custers are beow 0.5. Thus, Step (3) of the heuristics in Section 4.2 is invoked. As a resut, 15 out of the origina 253 points are removed. For this new coection, two custers are identified: i) aong the stripe Area B2 in Figure 3, and ii) around Area B3 in Figure 3. The custerings for categories b), c) and d) of singe houses are very simiar to the ones described above. Again, outiers need to be removed. At the end, after merging has taken pace in Step (4), 2 regions are found, which are identica to the ones isted i) and ii) above. hrthermore, NSD(CLARANS) correcty identifies the price and size ranges for sing~houses to be [800K,1500K] and [3000,7000]. 5.4 summary With respect to the rues isted in Section 5.1, both SD(CLARANS) and NSD(CLARANS) find most of what they are supposed to find. In terms of performance and effectiveness, SD(CLARANS) has the edge. As discussed earier, this is due to CLARANS success in identifying the custers right away. On 525% ia the threshod used in Step (3) of the heuristics in Section

11 the other hand, in NSD(CLARANS), performing nonspatia generaizations divides the entire set of points into different groups/tupes. This may have the effect of breaking down the tightness of some custers. Outiers remova may then be needed to extract reasonabe custers from each group. This procedure, as we have seen, may weaken the eventua findings and takes more time. Finay, merging overapping and intersecting custers can aso be costy. However, to be fair with NSD(CLARANS), the rues described in Section 5.1 are more favorabe to SD(CLARANS). There is a strong emphasis on finding out non-spatia characterizations of spatia custers, which is the focus of spatia dominant agorithms. In contrast, a non-spatia dominant agorithm focuses more on finding spatia custers within groups of data items that have been generaized non-spatiay. For exampe, if the spatia distribution of singehouses is primariy determined by their price and size categories, then NSD(CLARANS) coud be more effective than SD( CLARANS). 6 Discussions 6.1 Exporing Spatia Reationships Thus far, we have shown that custering agorithms, such as CLARANS, are very promising and effective for spatia data mining. But we beieve that there is an extra dimension a custering agorithm can provide. As discussed in Section 4.2, a custering agorithm does not require any spatia generaization hierarchy to be given, and directy discovers the groups/custers that are the most appropriate to the given data. In other words, custering can provide very tight spatia characterizations of the groups. The tightness and specificity of the characterizations provide opportunities for exporing spatia reationships that may exist between the custers and other interesting objects. Consider again the rea estate exampe discussed in the previous section. SD(CLARANS) finds 3 custers of expensive housing units (cf. Figure 3). Those 3 custers can then be overaid with Vancouver maps of various kinds (e.g. parks, highways, akes, etc.) The foowing findings can be obtained: About 96% of the houses in the first custer (as described in Section 5.2) are within 0.6km from Queen Eizabeth Park. About 97% of the housing units in the second custer are ocated in the Vancouver downtown area which is adjacent to Staney Park 6 gdming the summit meeting between Russia and the US in 1993, Cinton dined in Queen Eizabeth Park and jogged in Staney Park! About 92% of houses in the third custer are within 0.4km from the western coast ine of Vancouver. The point here is that whie SD(CLARANS) or NSD(CLARANS) d o not directy find the above features (which is the job of another package that can provide such spatia operations as map overays), they do produce structures or custers that can ead to further discoveries. 6.2 Towards Buiding a More Genera and Efficient Spatia Data Mining Framework A natura extension to SD(CLARANS) and NSD(CLARANS) wi be the integration of the two agorithms by performing neither spatia dominant nor non-spatia dominant generaizations, but intereaved or baanced generaizations between spatia and nonspatia components. At each step, the data mining agorithm may seect either a spatia or a non-spatia component to generaize. For exampe, if a custering method can detect some high quaity custers, custering may be performed first. These custers may trigger generaization on non-spatia components in the next step if such a generaization may group objects into interesting groups. It is an interesting research issue to study how to compare the quaity of spatia and non-spatia generaizations. A spatia database may be associated with severa thematic maps, each of which may represent one kind of spatia data. For exampe, in a city geographic database, one thematic map may represent the ayout of streets and highways, another may outine the emergency service network, and the third one may describe the distribution of educationa and recreationa services. To many appications, it wi be very usefu if data mining on mutipe thematic maps can be conducted simutaneousy. This woud invove not ony custering, but aso other spatia operations such as spatia region growing, overays and spatia joins. Thus, it is an interesting research issue to study how to provide an effective framework that integrates a these operations together for simutaneous mining of mutipe maps. There are many kinds of spatia data types, such as regions, points and ines, in spatia databases. Custering methods, as presented here, are most suitabe for points or sma regions scattered in a reativey arge background. However, it remains an open question as to how they can be effectivey appied to dea with inetyped spatia data, such as to examine how highways are ocated in cities. Furthermore, due to the nature of spatia data, noise or irreevant information is prevaent in spatia databases. The deveopment of a genera framework 154

12 for removing noises and fitering out irreevant data is important to the effectiveness of spatia data mining. It is aso interesting to find out what roes approximation and aggregation can pay in the framework. 7 Concusions In this paper, we have presented a custering aigo rithm caed CLARANS which is based on randomized search. We have aso deveoped two spatia data mining aigorithmssd(clarans) and NSD(CLARANS). Experimenta resuts and anaysis indicate that both agorithms are effective, and can ead to discoveries that are difficut to obtain with existing spatia data mining agorithms. Finay, we have presented experimenta resuts showing that CLABANS itsef is more efficient than existing custering methods. Hence, CLARANS has estabished itsef as a very promising too for efficient and effective spatia data mining. Acknowedgements Research partiay sponsored by NSERC Grants OGP , OGP03723 and STR , IRIS2 Grants HMI-5, IC-2, X-5, and the Centre for Systems Science of Simon Fraser University. References [31 W. G. Aref and H. Samet. (1991) Optimization Strategies for Spatia Query Processing, Proc. 17th VLDB, pp T. Brinkhoff and H.-P. Kriege and B. Seeger. (1993) Eficient Processing of Spatia Joins Using R-trees, Proc SIGMOD, pp i71 R. Agrawa, S. Ghosh, T. Imieinski, B. Iyer, and A. Swami. (1992) An Interva Cassifier for Database Mining Appications, Proc. 18th VLDB, pp R. AgrawaI, T. Imieinski, and A. Swami. (1993) Mining Association Rues between Sets of Items in Large Databases, Proc SIGMOD, pp A. Borgida and R. J. Brachman. (1993) Loading Data into Description Reasoners, Proc SIGMOD, pp Giinther. (1993) Efficient Computation of Spatia Joins, Proc. 9th Data Engineering, pp J. Han, Y. Cai and N. Cercone. (1992) Knowedge Discovery in Databases: an Attribute- Oriented Approach, Proc. 18th VLDB, pp WI WI P31 WI 1171 H. Samet. (1990) The Design and Anaysis of Spatia Data Structures, Addison-Wesey. M Y. Ioannidis and Y. Kang. (1990) Randomized Agorithms for Optimizing Large Join Queries, Proc SIGMOD, pp Y. Ioannidis and E. Wong. (1987) Query Optimization by Simuated Anneaing, Proc SIGMOD, pp L. Kaufman and P.J. Rousueeuw. (1990) Finding Groups in Data: an Introduction to Custer Anaysis, John Wiey & Sons. D. Keim and H. Kriege and T. Seid. (1994) Supporting Data Mining of Large Databases by Visua Feedbach Queries, Proc. 10th Data Engineering, pp R. Laurini and D. Thompson. (1992) Fundamentas of Spatia Information Systems, Academic Press. W. Lu, J. Han and B. C. Ooi. (1993) Discovery of Genera Knowedge in Large Spatia Databases, Proc. Far East Workshop on Geographic Information Systems, Singapore, pp G. Miigan and M. Cooper. (1985) An Ezamination of Procedures for Determining the Number of Custers in a Data Set, Psychometrika, 50, pp R. Ng and J. Han. (1994) Effective and Efective Custering Methods for Spatia Data Mining, Technica Report 9413, University of British Coumbia. G. Piatetsky-Shapiro and W. J. Frawey. (1991) Knowedge Discove y in Databases, AAAI/MIT Press. H. Spath. (1985) Custer Dissection and Anaysis: Theory, FORTRAN programs, Exampes, Eis Horwood Ltd. 155