Mining Multi Level Association Rules Using Fuzzy Logic

Transcription

1 Mining Multi Level Association Rules Using Fuzzy Logic Usha Rani 1, R Vijaya Praash 2, Dr. A. Govardhan 3 1 Research Scholar, JNTU, Hyderabad 2 Dept. Of Computer Science & Engineering, SR Engineering College, Warangal 3 School of Information Technology, JNTU, Hyderabad Abstract Extracting multilevel association rules in transaction databases is most commonly used tass in data mining. This paper proposes a multilevel association rule mining using fuzzy concepts. This paper uses different fuzzy membership function to retrieve efficient association rules from multi level hierarchies that exist in a transaction dataset. In general, the data can spread into many hierarchies or levels. From such datasets retrieving the association rules is a tedious tas. For this reason, in this paper we used the fuzzy-set concepts to retrieve multilevel association rules. This approach adopts a top-down progress and also incorporates fuzzy boundaries instead of sharp boundary intervals to derive large itemsets. Keywords Association Rules, Data Mining, Fuzzy Logic. I. INTRODUCTION Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and nowledge. The information and nowledge gained can be used for applications ranging from maret analysis, fraud detection, and customer retention, to production control and science exploration. An Association rule mining is an important process in data mining, which determines the co relations between items belonging to a transaction database [1][2][3]. Association rules can be used for mareting, planning and etc. For example, the association rules can be used to identify the customer buying habits in a maret-baset analysis, lie if customers buy suit, they are more liely to buy tie as well. In general, every association rule must satisfy two user specified constraints called support and confidence. The support of a rule X Y is defined as the percentage of transactions that contain X Y, where X and Y are disjoint sets of items from the given dataset [4][5]. The confidence is defined as the ratio support (X Y)/support(X). Here the aim is to find all rules that satisfy user specified minimum support and confidence values. Apriori algorithm is very widely used to algorithm to generate the association rules. The appriori algorithm will generate the rules step by step manner. However, this processing way might cause high computational costs and iterative database scans. The majority of algorithms used for association rule mining are dealt with single-concept level datasets. But most of the data can be spread into different levels. Such datasets are called multi level datasets or taxonomy datasets. Mining association rules from such datasets will give more exact, accurate and useful information to the user to gain more nowledge from the data. Relevant data item taxonomies are normally preconceived and can be symbolized using hierarchy trees. In multi level datasets, the data is available in different stages of abstraction or levels. Such levels are represented with concept hierarchies. For example, a user may not only be concerned with the associations between computer and printer", but also wants to now the association between destop PC price and laser printer price. This paper proposes a fuzzy multiple-level association rule mining algorithm for extracting implicit nowledge from multi level dataset. It integrates fuzzy set concepts, data-mining technologies and multiple-level taxonomy to find fuzzy association rules from transaction data sets. II. APRIORI ALGORITHM AND ITS PROPERTY Apriori employs an iterative approach nown as levelwise search, where -itemsets are used to explore +1- itemsets. Apriori exploits the following property: If an itemset is frequent, then all its subsets are also frequent [8]. The idea is frequent itemset must have subsets of frequent itemsets. Let -itemset is an itemset having items. Let L represent the set of frequent -itemsets and C is the set of candidate -itemsets. Therefore the algorithm to generate the frequent itemsets is follows: i. A C is a set of candidate -itemsets, which is generated by joining L -1. ii. C is a superset of L, that is, its members may or may not be frequent, but all of the frequent - itemsets are included in C. iii. All candidates having a count greater than the minimum support count are frequent and belong to L iv. The itemsets in C which that is not in L -1 are deleted. v. This process is repeated until all no more frequent -itemsets can be found. 747

2 III. MULTILEVEL ASSOCIATION RULES CONCEPT Mining association rules at multiple concept levels may, however, lead to discovery of more general and important nowledge from data. Relevant item taxonomies are usually predefined in real-world applications and can be represented as hierarchy trees. Terminal nodes on the trees represent actual items appearing in transactions; internal nodes represent classes or concepts formed from lower-level [10].All paragraphs must be indented. All paragraphs must be justified, i.e. both left-justified and right-justified. 2. Di is the i-th transaction, in the dataset, where 1 i n (n is the number of transaction), add all of the items with the identical first K digit, compute the item count for each groups in the transaction and eliminate the groups which their count are less than α where α is the predefined minimum support value in the current level. 3. Consider different membership function for different data items. Each data item has its own characteristics and its own membership function. For each transaction set D i will have an item say I j, this is a j-th item at level. This I j will have a quantitative value say Q ij. This V ij is converted into a fuzzy set say f iji. The f iji will have number of fuzzy regions for each I j this is denoted by h j. R il (1 l h j ) is the l-th fuzzy region of I j. The Q ij is defined as 4. Compute the value of each fuzzy region R il from the dataset as, Fig. I: Example taxonomy In Fig. I, the root node is at level 0, the internal nodes representing categories lie science and Fantasy, these are at level 1, the internal nodes such as Futuristic are at level 2 and the terminal nodes representing boos such as Lord of the Rings are at level 3. Only terminal nodes appear in transactions [3]. These hierarchies are encoded using sequences of numbers and the symbol * according to their positions in the hierarchy tree. For example, the internal node Science in Fig. 1 would be represented by 1-*-*, the internal node Epic by 2-1-* and the terminal node The chronicles of Narnia by [3]. IV. THE PROPOSED MODEL The proposed algorithm consist of data mining, multilevel taxonomy and a set of membership functions to explore fuzzy association rules in accordance a given transaction dataset. For this each and every item is assigned a sequence number. The proposed algorithm is 1. Use a sequence of numbers and the symbol * to encode the predefined taxonomy. The encoding is started from root with a value zero and continued to next level from left to right by incrementing one value 5. Find the maximum count value say MaxCount j among Count il values (1 l h j ), as If MaxCount j of a fuzzy region R il is equal to the minimum support threshold ( ), then place MaxCount j into l-frequent itemset. 6. If L l is null then increase by one. If r=1 then go to step 2 otherwise go to next step. 7. The following procedure is carried out for different values. i) If r = 2 produce the candidate set C 2, where C 2 is the set of candidate itemset with 2 items at level ii) If r > 2 then Generate the candidate set C r, where C r is the set of candidate itemsets with r-items on level from L r For each obtained candidate r-itemset say S with items (S 1,S 2,..., S r ) in C r i) Compute the fuzzy value of S using minimum operator of fuzzy logic, f is = min(f is1,f is2,..., f isr ) ii) Count s is the sum of fis, 1 i n iii) If Counts is greater than or equal to minimum support value then insert S into L r. 748

3 9. If L r is null then increase by one and go to the next step, otherwise increase r by one and go to step If K > p, where p is number of levels in a taxonomy then go to step 11 otherwise set r=1 and go to step Mae the fuzzy association rules for all frequent r- itemset including S = (S 1, S 2,..., S r ), r>2 as, Find all the rules X Y, where X S and Y S and X Y =, X Y =S. Compute the confidence value of all association rules by: 12. Select the rules which have confidence values greater than predefined confidence threshold value. V. AN EXAMPLE To illustrate the above algorithm, we considered a dairy sales transaction from a sales dataset, which is shown in Table I. Its taxonomy is represented in Fig II. Fig. II Dairy Sales Taxonomy This taxonomy is encoded as specified in algorithm step 1, which is represented in Fig III. Fig. III Dairy Sales Encoded Taxonomy 749 In Fig. III, the Dairy Sales transactions taxonomy is divided into 3 classes namely chesse, mil, curd. Each of these classes have sub items specifies the type of dairy and producing companies. For each class of the dairy, we consider a unique membership function. There are 3 fuzzy regions called low, middle and high are considered for these member functions. The membership function related to the mil, curd and cheese are shown in Fig. IV, Fig. V and Fig. VI respectively. Table I SIX EXAMPLE TRANSACTIONS TID Items (organic feta cheese, 1) (sil feta cheese, 4) (organic D 1 low fat mil, 4) (eagle low fat mil, 6) (organic bean curd, 7) D 2 (organic feta cheese, 3) (sil feta cheese, 3) (horizon cheddar cheese, 1) (horizon high fat mil, 5) (eagle high fat mil, 3), (horizon fruit curd, 4) (eagle fruit curd, 4) D 3 (organic low fat mil, 7) (horizon high fat mil, 8) (horizon bean curd, 5) (horizon fruit curd, 7) D 4 (organic feta cheese, 2) (organic low fat mil, 5) (horizon bean curd, 5) D 5 (organic low fat mil, 5)( eagle high fat mil, 4) D 6 (organic feta cheese, 3) (sil feta cheese, 10) The above Table I, each and every item is identified by a unique id called TID. The items are represented as pair first represent the item description from lower level to higher level and these item counts, i.e. (organic feta cheese, 1) means the organic feta cheese is available in the dataset only once. Similarly, organic beat curd is repeating in the dataset seven times. This Table I is encoded with respect to Fig 3, i.e. the item Organic Feta cheese can be encoded as 111. All the items in the Table I are encoded and represented in Table II. Table II ENCODED TRANSACTION DATA TID Items D 1 (111, 1) (112, 4) (211, 4) (212, 6) (311, 7) D 2 (111, 3) (112, 3) (121, 1) (221, 5) (222, 3) (322, 4) (321, 4) D 3 (211, 7) (221, 8) (312, 5) (322, 7) D 4 (111, 2) (211, 5) (312, 5) D 5 (211, 5) (222, 4) D 6 (111, 3) (112, 10) In Table some items are at the same level or different concept. For example (111, 1) and (112, 4) are first and second items of same hierarchy and same TID items. This can be represented as 1** and their counts are summed up in Table III.

4 TABLE III LEVEL REPRESENTATION TID Items D 1 (1**, 5) (2**, 10) (311, 7) D 2 (1**, 7) (2**, 8) (3**, 8) D 3 (2**, 15) (3**, 12) D 4 (1**, 2) (2**, 5) (3**, 5) D 5 (2**, 9) D 6 (1**, 13) Now the Table III item sets are converted into fuzzy sets based on trapezoidal equation. For example, let consider the item set (1**, 5). According to the Fig 3, this group is belongs to cheese. Similarly, the itemsets (2**, 10) is related to Mil and (3**, 8) is related to Curd respectively. Based on the Trapezoidal membership function the cheese sales are converted into 3 fuzzy regions named as low, middle and high. These are represented in the figures Fig. IV, Fig. V and Fig. VI. The fuzzy value 0.2 indicates the low, 0.8 is middle region and 1 is the high fuzzy regions. The Table IV contains the fuzzy regions for all transactions. Fig. IV: The membership functions for cheese sales Fig. V: The membership functions for Mil sales After obtaining the fuzzy regions of the three items, now we summed the fuzzy values of each region. For example, consider the 1**.Low fuzzy region. The sum of fuzzy values of this region in all transactions is obtained through the equation =1. The sum of fuzzy values for each individual region has been shown in Table V. After the above step, now the fuzzy region is selected with highest value for each group. For example the group 1** low region is equal to one, middle region is equal to 1.8 and high region is equal to 1.2. Since the value of the middle region, is 1.8, which is higher than the other two regions, the middle fuzzy region is chosen as the representative of group 1** for other processes. This tas is also carried out for other groups. Each of these values is compared with the minimum support and in case it is greater or equal with the predefined minimum support, then it is added to L 1 1. For example consider the minimum support value is 1, then the 1**.middle, 2**.middle and 3**.middle values are greater than the minimum support. Thus these values are added into L From L 1 set a two member candidate set C 2 is generated. The fuzzy membership value of each of the twomember sets inside the C 1 2 is calculated based on the predefined membership function for each individual item, for the whole transactions. For example, consider the two member set {1**.Middle, 3**. high}. The fuzzy membership value of this set for transaction D 1 is calculated as: min (0.8, 1) = 0.8. This operation must be carried out for all transactions. All the candidate and frequent itemsets generated and final result of this transaction dataset are shown in Table VI, VII, VIII, IX and X. We will find the fuzzy association rules based on the frequent itemset obtained from the previous steps. We discover all probable rules from the frequent itemset obtained in different levels with the following format. {1**=middle} {3**=high} {3**=high} {1**=middle} {3**=high} {11*=middle} {11*=middle} {3**=high} {111=low} {3**=high} {3**=high} {111=low} The confidence value of all rules are studied with predefined minimum confidence threshold and the rules, whose confidence value is bigger than or equal to the predefined minimum confidence threshold, are chosen as final rules. These are shown in Table XI. For example, if the minimum confidence value is equal to 1, the final rules shall be as follows: Fig. VI: The membership functions for Curd sales 750

5 TID TABLE IV THE LEVEL-1 FUZZY SETS TRANSFORMED FROM THE DATA IN TABLE 3 Level -1 Fuzzy set D 1 D 2 D 3 D 4 D 5 D 6 {1**=middle} {3**= high} {3**=high} {1**=middle} {11*=middle} {3**=high} TABLE V THE COUNTS OF THE LEVEL-1 FUZZY REGIONS Items Count (1**.low) 1.0 (1**.middle) 1.8 (1**.high) 1.2 (2**.low) 0.4 (2**.middle) 2.4 (2**.high) 2.2 (3**.high) 0.0 (3**.high) 0.5 (3**.high) 3.5 Table VI THE SET OF FREQUENT 1-ITEMSETS FOR LEVEL ONE Itemset Count (1**.middle) 1.8 (2**.middle) 2.4 (3**.high) 3.5 Table VII THE COUNTS OF THE LEVEL-1 FUZZY REGIONS Itemset (1**.middle, 2**.middle) (1**.middle, 3**.high) (2**.middle, 3**.high) TABLE VIII THE MEMBERSHIP VALUES FOR 1**.MIDDLE, 3**.HIGH TID 1**.middle 3**.high Min(1**.middle, 3**.high) D D D D D D Table IX THE COUNTS OF THE 2-ITEMSETS AT LEVEL 1 Itemset Count (1**.middle, 2**.middle) 1.4 (1**.middle, 3**.high) 1.8 (2**.middle, 3**.high) 1.7 Table X ALL FREQUENT ITEMSETS FOR LEVEL-1, LEVEL-2, LEVEL 3 Itemset Count (1**.middle) 1.8 (2**.middle) 2.4 (3**.high) 3.5 (1**.middle, 3**.high) 1.8 (11*.middle) 2.0 (21*.middle) 2.6 (31*.high) 2.0 (22*.middle) 2.0 (32*.high) 2.0 (11*.middle, 3**.high) 2.0 (111.low) 3.0 (211.middle) 2.6 (111.middle, 3**.high)

6 Table XI CONFIDENCE VALUE FOR ALL RULES Association rules Confidence {1** = middle} {3** = high} 1.0 {3** = high} {1** = middle} 1.0 {3** = high} {11* = middle} 0.5 {11* = middle} {3** = high} 1.0 {111 = low} {3** = high} 0.7 {3** = high} {111 = low} 1.4 VI. EXPERIMENTAL RESULTS The proposed algorithm carries out the analysis on a number of 100 sales invoices of a food stuff store and 7 of its items and based on the predefined taxonomy from 7 items and the predefined membership function per each item, carries out the mining of association rules. The predefined taxonomy in the first level includes 7 nodes that represent the items used in the test, the second level includes 14 nodes that represent the taste or different types of a specific product and in the third level it also consists of 48 nodes that represent the manufacturing companies and factories. The database transactions include the name of the product and the quantity of such products purchased. One item may not be used twice in one transaction. In order to observe the results, we first analyze the proposed algorithm with a different number of transactions and the results based on the number of rules produced and the predefined minimum support for algorithm and the minimum confidence equal to 0. 5 have been shown in Fig. VII. Fig. VII. Rules generated with different min support The results obtained based on the number of rules developed and different types of the predefined minimum confidence by the user have been shown in Fig. 8 based on the 100 transactions of the customers purchases and minimum support equal to 3. As you can see in Fig. VII, with increased number of the transactions under study, the number of mined association rules will be more and this is obvious and that s because with the increased number of the transactions, the number of frequent itemset will also increase and as a result, a greater number of rules are mined. Also considering the Fig. VIII, with increased number of the predefined. Minimum confidence value, the number of mined association rules will also decrease. Fig. VIII. Rules generated with different min support VII. CONCLUSIONS In This paper, we have employed fuzzy set concepts, multiple level taxonomy, different membership function for each item to find fuzzy Multi level association rules in a given transaction data set. The rules mined in this algorithm are desirable for a specific time interval, but it is clear that with the elapse of time, the conditions for sale of items shall be different. As an example, based on different seasons of the year, the number of sales of a series of product may be variant. Therefore in our next wor we are going to present a new method to generate such membership function dynamically to cope with the environment with changing conditions. Moreover, not only we can define the minimum support value for each individual level of the predefined taxonomy for the products but also we are able to define the minimum support for each item which maes output rules to get closer to the user s demanded rules. 752

7 REFERENCES [1] Agrawal, R., T. Imielinsi and A. Swami, Mining associations between sets of items in massive databases. In The 1993 ACM SIGMOD Conference on Management of Data, Washington DC, USA, pp: [2] Ha, I., Y. Cai and N. Cercone, Data-driven of quantitative rules in relational databases. IEEE Tram. Knowledge and Data Eng., 5: [3] Ying Lin, K., B. Chian Chien and T. Pei Hong, Mining Fuzzy Multiple-Level Association Rules from Quantitative Data. Applied Intelligence, 18: [4] Han, J. and M. Kamber, Data Mining:Concepts and Techniques. The Morgan Kaufmann Series. [5] Agrawal, R. and R. Sriant, Fast algorithms for mining association rules. 20th VLDB Conference, pp: [6] Intan, R., Mining Multidimensional Fuzzy Association Rules from a Normalized Database. International Conference on Convergence and Hybrid Information Technology. [7] Ping Huang, Y. and L. Kao, Using Fuzzy Support and Confidence Setting to Mine Interesting Association Rules. IEEE Annual Meeting, 2: [8] Khare, N., N. Adlaha and K. R. Pardasani, An Algorithm for Mining Multidimensional Fuzzy Association Rules. International Journal of Computer Science and Information Security, 5: [9] Watanabe, T., A Fast Fuzzy Association Rules Mining Algorithm Utilizing Output Field Specification. Biomedical Soft Computing and Human Sciences, 16 (2): [10] [10] Liu, B., W. Hsu and Y. Ma, Mining association rules with multiple minimum supports. Fifth ACM SIGKDD International Conference Knowledge Discovery and Data Mining, pp: [11] Pei Hong, T., T. Jung Huang and Ch. Sheng Chang, Mining Multiple-level Association Rules Based on Pre-large Concepts. I- Tech, Vienna, Austria, pp: 438. [12] Han, J. and Y. Fu, Discovery of Multiple- Level Association Rules from Large Databases. 21st Very Large Data Bases Conference, Morgan Kaufmann, pp: