Cluster analysis and Association analysis for the same data

Transcription

1 Cluster analysis and Association analysis for the same data Huaiguo Fu Telecommunications Software & Systems Group Waterford Institute of Technology Waterford, Ireland Abstract: Both cluster analysis and association analysis are important tasks of data mining. In some applications, we need both cluster analysis and association analysis for the same data. Each task takes very high time cost to deal with large data. In order to reduce expensive cost of the two mining tasks for large data set of transactions, we propose one strategy to unify cluster analysis and association analysis. This paper presents a new core algorithm of the strategy for analysis of large and high-dimensional data as well. The experimental results show the efficiency of this algorithm. Key Words: Association analysis, Clustering, Closed set, Concept lattice, Algorithm 1 Introduction Both cluster analysis and association analysis are important tasks of data mining. In recent years, cluster analysis and association analysis have attracted a lot of attention among the fields of research and applications. Cluster analysis and association analysis play an important role in data mining applications such as text mining, Web mining, information retrieval and biomedical informatics, and many others. A variety of techniques and approaches of cluster analysis and association analysis have been developed and successfully applied to real-life data mining problems. However, due to large amounts of data continue to grow inexorably in size and complexity, the techniques and approaches of cluster analysis and association analysis suffer from the challenges such as very large data, high-dimensional data, distributed heterogeneous data, and complex data, etc. In some applications, we need both cluster analysis and association analysis for the same data. Each task takes very high time cost to deal with large data. Although cluster analysis and association analysis are separated tasks for research and applications, in order to reduce the expensive cost of data mining tasks, we propose to unify the cluster analysis and association analysis for mining the database of transactions. This is the key motivation to unify cluster analysis and association analysis. Furthermore, we can unify cluster analysis and association analysis for database of transactions due to the following reasons: 1) Both of them analyze the relationship between the elements of data set. In fact, the two tasks extract the same essential relationship: similarity. Only the description and bounds of the relationship are different. Frequent pattern reveals one kind of similarity between elements of data. Cluster analysis may reveal associations and relationships in data that may contribute to mining the models or rules from data. So the elements in a frequent pattern are similar, and the similar elements may have the same frequency. 2) Mining closed sets can be an essential step for cluster analysis and association analysis on transactional data. Some existing works show we can extract the clusters and frequent patterns from closed sets [2, 15]. Cluster analysis and association analysis may share the closed sets for mining the same data set. So we need not to extract closed sets separately for cluster analysis and association analysis. 3) Closed sets mining provides a solution to interpret the clusters and frequent patterns. For the most of techniques and approaches of cluster analysis and association analysis, it s hard to interpret the mining results. For example, it s hard to interpret the clusters and frequent pattern produced with existing mining techniques. It s also hard to give the signification of the distance measure in most of clustering methods. Closed sets is derived from formal concept analysis (FCA). The formal concept can help us to interpret the closed sets. Closed sets mining facilitates pattern interpretation. In human thinking and life, the objects are clustered by concepts and attributes, and we can interpret attribute pat- ISSN: Page 576 ISBN:

2 terns and object patterns with concepts. So the concept-based methods can be used for the interpretation of the clusters and frequent patterns. In this paper, the idea of unifying cluster analysis and association analysis focuses on the database of transactions. The main framework of the idea is: Generating the data context with the description of items or transactions from the database of transactions Mining closed sets and the lattice of closed sets of database of transactions with FCA In each closed sets, adding extended information such as support, similarity, and interpretation, etc. We propose a new structure of each node of lattice. The node contains attribute set, object set, the number of objects, support and similarity description. Generating the clusters and closed frequent patterns with the interpretation The core of FCA is concept lattice. Theoretical foundation of concept lattice founds on the mathematical lattice theory [1, 8]. Lattice is a popular mathematical structure for modeling conceptual hierarchies. Concept lattice is a method for deriving conceptual structures out of data. It allows us to analyze and mine the complex data for such as classification [11, 13], association rules mining [6, 7], clustering [10, 9, 4], etc. Due to the high dimension, large volume of data, we need to develop more scalable and more efficient techniques and methods to analyze and represent the large and high-dimensional data sets. In this paper we present a new algorithm to analyze large and highdimensional data. The rest of this paper is organized as follows. Basic definitions for unifying cluster analysis and association analysis are presented in the next section. The framework of unifying cluster analysis and association analysis is introduced in section 3. In section 4, we present a new algorithm. Section 5 shows the experimental results. The paper ends with a short conclusion in section 6. 2 Definitions Definition 1 Data context is defined by a triple (O, A, R), where O and A are two sets, and R is a relation between O and A. The elements of O are called transactions or objects, while the elements of A are called items or attributes. For example, Figure 1 represents a data context (O, A, R). O = {1, 2, 3, 4, 5, 6, 7, 8} is the set of objects, and A = {a 1, a 2, a 3, a 4, a 5, a 6, a 7, a 8 } is the set of items. The crosses in the table describe the relation R of O and A. In the data context we use detailed description for the name of each item and object. As an example, we only use digital formalization to describe each item and object. a 1 a 2 a 3 a 4 a 5 a 6 a 7 a Figure 1: An example of data context A data context is usually represented by the binary data, but in practice, the values of attribute are not binary, we can transform many-valued data context to binary values data context by concept scaling [8]. Definition 2 Two closure operators are defined as O 1 O 1 for set O and A 1 A 1 for set A. O 1 := {a A ora for all o O 1 } A 1 := {o O ora for all a A 1 } These two operators are called the Galois connection for (O, A, R). These operators are used to determine a formal concept. Definition 3 A formal concept of (O, A, R) is a pair (O 1, A 1 ) with O 1 O, A 1 A, O 1 = A 1 and A 1 = O 1. O 1 is called extent, A 1 is called intent. For example, (68, a 1 a 3 a 4 a 6 ) is a formal concept of the data context of Figure 1. a 1 a 3 a 4 a 6 is intent of (68, a 1 a 3 a 4 a 6 ), and 68 is extent of (68, a 1 a 3 a 4 a 6 ). Definition 4 We say that there is a hierarchical order between two formal concepts (O 1, A 1 ) and (O 2, A 2 ), if O 1 O 2 (or A 2 A 1 ). All formal concepts with the hierarchical order of concepts form a complete lattice called concept lattice. Definition 5 An itemset C A is a closed itemset iff C = C. ISSN: Page 577 ISBN:

3 (a 1, ) e(8) (a 1 a 7, 1234) e(4) (a 1 a 3, 34678) e(5) (a 1 a 2, 12356) e(5) (a1 a 4, 5678) e(4) (a 1 a 7 a 8, 234) e(3) (a 1 a 3 a 7 a 8, 34) e(2) (a 1 a 3 a 4, 678) e(3) (a 1 a 2 a 3, 36) (a e(2) 1 a 2 a 7, 123) e(3) (a 1 a 4 a 6, 568) e(3) (a 1 a 2 a 4 a 6, 56) e(2) (a 1 a 2 a 7 a 8, 23) e(2) (a 1 a 3 a 4 a 6, 68) e(2) (a 1 a 2 a 3 a 7 a 8, 3) e(1) (a 1 a 3 a 4 a 5, 7) e(1) (a 1 a 2 a 3 a 4 a 6, 6) e(1) (a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8, ) e(0) Figure 2: An example of knowledge lattice Definition 6 If C 1 and C 2 are closed itemsets, C 1 C 2, then we say that there is a hierarchical order between C 1 and C 2. All closed itemsets with the hierarchical order of closed itemsets form of a complete lattice called closed itemset lattice. Definition 7 A formal concept is called extended concept if the formal concept is added by described information of the formal concept in data context. We note (O 1, A 1 ) e(described information) or (A 1, O 1 ) e(described information) as the extended concept of (O 1, A 1 ). A concept lattice is called knowledge lattice if all formal concepts of the concept lattice are updated with their extended concepts. Figure 2 presents an example of knowledge lattice. Each node contains intent, extent and number of extent. 3 Framework of unifying cluster analysis and association analysis In this section, we propose a framework of unifying cluster analysis and association analysis (see Figure 3). From the database of transactions, we can generate data context that should be described by the items and transactions. And then an efficient algorithm should be applied to generate formal concepts. When the formal concepts are produced, some extended information should be extracted with formal concepts, according to the need of the mining task, to form extended concepts. Extended concepts can contain intent, extent, support and similarity description. Knowledge lattice can be generated with extended concepts. Finally, closed frequent patterns and clusters can be produced from the same knowledge lattice or extended concepts. Database Data context Closed Frequent Pattern Cluster Formal concepts Knowledge lattice: Extended concepts Concepts Support Description... Figure 3: Framework of unifying cluster analysis and association analysis Data context is the base of the mining task. Data context need to have understandable description for each item and transaction. Sometimes we need to reduce, transpose or order the data context. For example, when data have high dimension, especially the the size of object set is smaller than the size of item set, we can transpose the data context to generate formal concepts for mining high-dimensional data. Analyzing the most of lattice algorithms, we find that one algorithm can focuss on items or transactions of data context. The performances of an algorithm can be different according to the number of items and transac- ISSN: Page 578 ISBN:

4 tions. In this framework, the generation of formal concepts and knowledge lattice is the essential step. The key of the applications is the performance of the algorithm of generation of the formal concepts or closed itemsets. So we focus on lattice algorithm and propose a new algorithm based on lattice structure to generate frequent patterns in next section. 4 New algorithm In this section, we analyze the search space of the closed itemsets of a data context, and then present a new algorithm to analyze and represent large data. We can decompose the search space into many partitions such as A m, A m 1, A m 2, A m 3 or combination of some of them. In each partition we can look for the closed itemsets independently. But the problem is: how to balance the number of closed itemsets of partitions whether each partition contains closed itemsets For example, for the data context of Figure 1, we can decompose the search space into following 4 partitions: 4.1 Analysis of the search space partition 1 A 8 partition 2 A 4 partition 3 A 2 Using one example: a data context with 4 attributes (a m, a m 1, a m 2, a m 3 ), we analyze the search space of closed itemsets (see Figure 4). A 7 A 6 A 5 A 3 partition 4 A 1 a m 1a m a m a m 1 a m 2 a m 3 a m 2a m a m 2a m 1a m a m 3a m 1a m a m 2a m 1 a m 3a m a m 3a m 1a m 3a m 2 a m 3a m 2a m 1a m a m 3a m 2a m a m 3a m 2a m 1 Figure 4: An example of the search space of closed itemsets Figure 4 illustrates each node maybe a closed itemset for any data context with 4 attributes. The search space of closed itemsets is very large if there are too many attributes. It s hard for concept lattice structure to face the complexity of very large data. So we propose a new method to decompose the search space, and then separately deal with in each partition. In order to discuss the decomposition of the search space, we give the following definition. Definition 8 Given an attribute a i A of the context (O, A, R), a set E, a i E. We define a i E = {{a i } X for all X E}. Figure 5: Decomposition of the search space of the data context Figure 1 The result is there are no closed itemsets in partition 4, partition 3, partition 2 but 17 closed itemsets in partition 1. So there are some problems for this strategy to decompose the search space. We need to improve it. One solution is to order the data context. Definition 9 A data context is called ordered data context if we order the items of data context by number of objects of each item from the smallest to the biggest one, and the items with the same objects are merged as one item. We note ordered data context (O, A, R) of the data context (O, A, R). The following example (see Figure 6) is Ordered data context of the data context of figure 1. From the ordered data context, using the same method as above to decompose the search space in 4 partitions, we can get closed itemsets in each partition. We can prove that there exists closed itemsets in each A i of an ordered data context. For example, there are respectively 6, 6, 4, 1 closed itemsets in 4 partitions of the ordered data context (see Figure 6). A k = {a k } {{a k } X i } X i A j = a k {a k+1, a k+2,, a m } k + 1 j m Definition 10 An item a i of a data context (O, A, R), all subsets of {a i, a i+1,..., a m 1, a m } that include a i, form a search sub-space (for closed itemset) that is called folding search sub-space (F3S) of a i, denoted F 3S i. ISSN: Page 579 ISBN:

5 a 5 a 8 a 6 a 7 a 4 a 3 a 2 a Figure 6: An example of ordered data context Summing up the analysis of the search space of closed itemsets, we can order the data context as ordered data context, the search space of closed itemsets is: F 3S m F 3S m 1 F 3S m 2 F 3S m 3 F 3S i F 3S 1, and then decompose the search space into some partitions. We can generate closed itemsets in each partition. 4.2 The new algorithm Definition 11 Given an itemset A 1 A, A 1 = {b 1, b 2,..., b i,..., b k }, b i A. A 1 is an infrequent itemset. The candidate of next closed itemset of A 1, noted CA 1, is A 1 a i = (A 1 (a 1, a 2,..., a i 1 ) {a i }), where a i < b k and a i / A 1, a i is the biggest one of A with A 1 < A 1 a i following the order: a 1 <... < a i <... < a m. We propose a new algorithm that can be used to generate closed itemsets or frequent closed itemsets. The principle of the algorithm is presented by following steps: Decompose the search space into some partitions Convert (O, A, R) to (O, A, R) where A = {a 1, a 2..., a i,..., a m } In order to balance the number of closed itemsets of partitions, some items of A are chosen to form an order set P 1) P = {a P T, a P T 1..., a P k,..., a P 1 }, P = T, a P k A 2) a P T < a P T 1 <... < a P k <... < a P 2 < a P 1 = a m 3) A parameter DP is used to choose a P k (0 < DP < 1), where DP = {a 1,,a P k } {a 1,,a P k 1 } Get the partitions: [a P k, a P k+1 ) and [a P T ) 1) Interval [a Pk, a Pk+1 ) is the search space from item a Pk to a Pk+1 2) [a Pk, a Pk+1 ) = [ a PT ) = F 3SPT P k i<p k+1 P T (F 3S i ) Generate next frequent closed itemset from an itemset A 1 for each partition If A 1 minsupport, we search the next closure of A 1 If A 1 < minsupport, we search C A 1. The closed itemsets between A 1 and CA 1 are ignored Conceptual clustering [5, 12] can seek clusters by concept structures. One approach of conceptual clustering is based on concept lattice [3]. When minsupport = 1, this algorithm can be used to generate all closed itemsets and then conceptual overlapping clusters based on the algorithm [3]. 5 Experimental results We test our algorithm to generate frequent closed itemsets and clusters on some data of UCI [14] (see table 1). DataSet Objects Items Closed itemsets 1)breast-cancerwisconsin )house-votes )audiology )lung-cancer )agaricus-lepiota )promoters )soybean-large )dermatogogy Table 1: The datasets for experiments The algorithm is implemented in JAVA, and tested on all above contexts in two cases to compare and analyze the performance of the algorithm: Case1: generating frequent itemsets and clusters separately from the context; Case2: generating frequent itemsets and clusters from closed itemsets based on the new strategy. The experimental results (see figure 7) show the total time cost of Case1 is much higher than Case2. So the integration of the cluster analysis and association ISSN: Page 580 ISBN:

6 analysis based on closed itemsets mining can reduce expensive cost of the two mining tasks for large data set of transactions. Figure 7: The time cost (milliseconds) for two cases on test datasets 6 Conclusion and further work In this paper, we propose one strategy to unify the cluster analysis and association analysis for transactional database to reduce the expensive cost of data mining tasks. From data context, knowledge lattice can be generated with extended concepts. Extended concepts can contain intent, extent, support and similarity description. So closed frequent patterns and clusters can be produced from the same knowledge lattice or extended concepts. Furthermore, we present a new algorithm for analysis of large and highdimensional data. For future work, we will develop the algorithm to analyze huge and distributed data, and improve the algorithm for mining non-transactional database. Acknowledgements: This work is supported by Science Foundation Ireland via the Autonomic Management of Communications Networks and Services programme (grant no. 04/IN3/I4040C) and the project of EU IST Network of Excellence OPAALS. References: [1] G. Birkhoff. Lattice Theory. American Mathematical Society, Providence, RI, 3rd edition, [2] C. Carpineto and G. Romano. Galois: An order theoretic approach to conceptual clustering. In Proc. of the Machine Learning conf., pages 33 40, [3] C. Carpineto and G. Romano. Galois: An order-theoretic approach to conceptual clustering. In Proceedings of ICML 93, pages 33 40, Amherst, Juillet [4] C. Carpineto and G. Romano. Concept Data Analysis: Theory and Applications. John Wiley and Sons, [5] D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, (2): , [6] H. Fu and E. Mephu Nguifo. Partitioning large data to scale up lattice-based algorithm. In Proceedings of ICTAI03, pages , Sacramento, CA, November IEEE Computer Press. [7] H. Fu and E. Mephu Nguifo. Mining frequent closed itemsets for large data. In Proceedings of The 2004 International Conference on Machine Learning and Applications (ICMLA04), Louisville, USA, December [8] B. Ganter and R. Wille. Formal Concept Analysis. Mathematical Foundations. Springer, [9] R. Godin, G. Mineau, R. Missaoui, and H. Mili. Méthodes de classification conceptuelle basées sur les treillis de Galois et applications. Revue d intelligence artificielle, 9(2): , [10] R. Godin, R. Missaoui, and A. April. Experimental comparision of Galois lattice browsing with conventional information retrieval methods. Internat. J. Man-Machine studies, (38): , [11] D. Kourie and G. Oosthuizen. Lattices in Machine Learning: Complexity Issues. Acta Informatica, 35(4): , [12] M. Lebowitz. Experiments with incremental concept formation: Unimem. Machine Learning, (2): , [13] E. Mephu Nguifo and P. Njiwoua. Treillis de concepts et classification supervisèe. Technique et Science Informatiques, 24, Hermeslavoisier. [14] C. Merz and P. Murphy. UCI Repository of Machine Learning databases, [15] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules using closed itemsets lattices. Journal of Information Systems, 24(1):25 46, ISSN: Page 581 ISBN: