Project Report. 1. Application Scenario

Transcription

1 Project Report In this report, we briefly introduce the application scenario of association rule mining, give details of apriori algorithm implementation and comment on the mined rules. Also some instructions of using this program are given. 1. Application Scenario Association rule mining finds interesting association relationships among a large set of data items. With massive amounts of data continuously being collected and stored in databases, many industries are becoming interested in mining association rules from their databases. [1] 1.1 Market Basket Analysis A typical example of association rule mining is market basket analysis. This process analyzes customer-buying habits by finding associations among the different items that customers place in their shopping baskets. The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, market basket analysis may help managers optimize different store layouts. If customers who purchase milk also tend to buy bread at the same time, then placing the milk close or opposite to bread may help to increase the sales of both of these items. [1] 1.2 Database This program apriori is designed to generate strong association rules from Boolean-valued database that looks like this: Item1 Item2 Item3 Item4 Item5 y y y n n y n y y y That is, the first line consists of all different item names, and each of the remaining lines is a Boolean-valued vector where y indicates the corresponding item appears in this line and n indicates not. However, If the database looks like this: Item1 Item2 Item3 Item1 Item3 Item4 Item5 We should first use the pre-processing program convert to set it into Boolean-valued file. The program convert works as follows: 1

2 First, find all the different items by scanning the whole database and save them as an item name line into the first line of a new text file newdata.txt. Next, for each line of the source file, set it into a Boolean-valued vector consisted of y or n depending on whether each item of the item name line appears or not in this line. Thus each vector has the same length that is exactly the total number of different items. Then save this vector into newdata.txt. In this project, we use a supermarket transaction database transaction.txt to mine association rules. This database comes from the software package CBA2.0 of National University of Singapore. [2] It looks like this: newspaper, cd, battery, sweets, soya_sauce, rice rice, sugar, tomato_sauce, apple, pamper, pacifier First we use the pre-processing program convert to set transaction.txt into Boolean-valued file supmart.txt. Then we run the program apriori upon supmart.txt to get all the association rules we might be interested. To test the robustness of our program, we also use a much larger database votes.txt (Congressional Voting Records of United States in 1984 from UCI Machine Learning Repository). In this database all attributes are already Boolean-valued. We delete the first column because this file is originally for classification purpose of Republican and Democrat. Then run the program apriori upon votes.txt to get the association rules about the voting records. [3] 2. Implementation of Algorithms In this project, we use many C functions to implement the apriori algorithm and generate association rules. 2.1 Data Structure To implement this project, the key point is setting up good data structures to represent each itemset and store all the frequent itemsets: First, we use struct MATRIX to store the number of different items, the number of transaction records, all the different item names and all the Boolean values of the database. The size of data matrix is dynamically determined. Second, in order to represent a certain itemset, we use struct VECTOR, which includes the itemset frequency and itemset vector whose length equals to the number of different items. 2

3 Third, in order to link all the frequent k-itemsets into a list, we use struct ITEMSETS which includes the struct VECTOR and a pointer which points to next frequent k-itemset in the list. So by referring to the head pointer Lk of the list for frequent k-itemsets, we can make proper operations. For all the other supplementary data structures, please see detail in the source code. 2.2 Algorithm Market basket analysis can be divided into two sub-problems: 1. Find all frequent itemsets that have support above minimum support threshold. 2. Generate strong association rules that satisfy minimum confidence threshold from the frequent itemsets. [1] Data Processing First, we use the function file_size to scan the database to determine the number of different items and the number of transaction records. Second, we use the function init_struct to initialize the data matrix, all the head pointers L 1 to L k and some other supplementary data structures. Third, use the function read_data to store all the different item names and Boolean values into data matrix Apriori Algorithm Apriori is an influential algorithm to find frequent itemsets. The first pass of the algorithm simply uses the function getl1 to count item occurrences to determine the frequent 1- itemsets. A subsequent pass, for example pass k, consists of two phases: First, the frequent itemsets Lk-1 found in the (k-1)th pass are used to generate the candidate itemsets C k using the function getck described below. Next, the data matrix is scanned and the support of candidates in Ck is counted. For fast counting, we use the function be_subset to efficiently determine whether the candidates in Ck are contained in a given transaction or not. [4] There are two steps in the function getck. First, in the join step, we join Lk-1 and Lk-1 to generate potential candidates. Next, in the prune step, we use the function infqn_subset to remove all candidates that have a subset that is not frequent. The pruning is based on the apriori property that all non-empty subsets of a frequent itemset must be requent as well. [1] The function getck returns a superset of the set of all frequent k-itemsets. We also use the function display_itemsets to save all the frequent itemsets into a new text file itemsets.txt. 3

4 2.2.3 Generate Strong Association Rules Once all the frequent itemsets have been found, it is straightforward to generate strong association rules from them as follows: For each frequent itemset l in Lk (k 2), generate all non-empty subsets of l. For every non-empty subset s of l, output the rule s(l-s) if support(l)/support(s) min_conf. [1] In the function get_rules, we modify the algorithm to further prune the search space based on the apriori knowledge as follows: Since all the subsets of l must be frequent 1 to k-1 itemsets, we only need to visit each of the frequent 1 to k-1 itemset lists, and for each itemset of any list just check if it is the subset of l (it s easy by vector representation). If so and support(l)/support(s) min_conf, then output the rule s(l-s). Also it s very easy to generate l-s where both l and s are represented by the vectors consisted of 1 and 0. All the generated rules are saved into a new text file rules.txt by the function display_rule Free Memory After all the strong association rules are generated, we use the function free_struct to free all the memory dynamically allocated for the frequent itemsets lists and data matrix. 3. Comments and Discussion In our project, if we set minimum support to be 0.3 and minimum confidence to be 0.5, then there are 18 frequent itemsets and 17 strong association rules generated from supmart.txt, respectively. One strong association rule looks like this: cd ==> soya_sauce (Support:39.06%, Confidence:66.67%) This rule means 39.06% of all the transaction records contain both cd and soya_sauce, and 66.67% of the customers who purchased cd also bought soya_sauce. So it s great fun to find many interesting patterns. If we apply the same minimum support and confidence threshold to the second database votes.txt (Congressional Voting Records of United States in 1984), we get 91 frequent itemsets and 354 strong association rules, respectively. One strong association rule looks like this: education-spending ==> crime (Support:36.32%, Confidence:92.40%) This rule means 36.32% of all the voters supported both the education-spending policy and the crime policy, and 92.40% of the voters who supported the education-spending policy also supported the crime policy. 4

5 If we set lower minimum support and confidence, much more frequent itemsets and strong association rules might be generated. Also the run time will be a little longer. In other words, raising the minimum support and confidence will have a secondary effect of reducing computation time, which may be desirable for large data sets. [4] 4. Instructions to Use the Tool 4.1 Installation Type: gcc convert.c o convert.exe to get the pre-precessing program convert.exe. Type: gcc apriori.c o apriori to get the executive program apriori. Please see detail in the README file in the proj directory. 4.2 How to use it? If the database is not Boolean-valued, we should first use the pre-processing program convert.exe to set it into a Boolean-valued one. Then just type: apriori. At the prompt, enter the name of data file (supmart.txt or votes.txt). Then input minimum support (say 0.3, not 30%) and minimum confidence (say 0.5, not 50%). The program will analyze this file and display all the frequent itemsets and association rules on the screen. Then you can check two result files itemsets.txt and rules.txt for detail. References: [1]: Jiawei Han and Micheline Kamber Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers 2000, ISBN [2]: Data Mining Interestingness and Interaction. Available: Dec. 5, 2000 [3]: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/voting-records UCI Machine Learning Repository Available: Dec. 5, 2000 [4]: Introduction to Data Mining. Available: Dec. 5,