New Matrix Approach to Improve Apriori Algorithm

Transcription

1 New Matrix Approach to Improve Apriori Algorithm A. Rehab H. Alwa, B. Anasuya V Patil Associate Prof., IT Faculty, Majan College-University College Muscat, Oman, rehab.alwan@majancolleg.edu.om Associate Lecturer, EC Department, Waljat College Of Applied Sciences,Muscat, Oman anasuya.patil@gmail.com Abstract In this paper a novel approach is proposed to improve the Apriori algorithm through the creation of Matrix- File using MATLAB, where the database transactions are saved. Thus repeated scanning is avoided. and particular rows & columns are extracted and perform a function on that rather than scanning entire database. results can be easily visualized and interpreted using graphical form display, The novel approach showed a very good result in comparison to the traditional Apriori algorithm because there is a pruning process to those columns whose item count is less than minimum support. Hence the size of the Matrix reduces drastically which saves a lot of time, and a noticeable improvement in the speed by reducing the redundant scanning of the database. Keywords: Apriori Algoritm, Associate Rule, Matlab, Matrix I. Introduction Data Mining or Knowledge Discovery in Databases (KDD) is a process of discovering knowledge from huge amount of data. The huge growth in electronic information leads to large memory storage represented by large databases or data warehouses or information repositories, with such growth, all enterprises are accumulating large amount of data everyday and there is potential business information hidden in this database. Therefore it is necessary to discover knowledge from these databases which might assist in decision making. Data mining uses various techniques to discover knowledge, the most popular data mining method is association rule, a typical and widely-used example of association rule mining is Market Basket Analysis [1]. The main intention is to determine correlations among large set of items in a database, Apriori algorithm is the first proposed algorithm used to extract association rules from large database [2]. It consists of two procedures: First, finding the frequent itemset in the database using a minimum support and constructing the association rule from the frequent itemsets with specified confidence. The limitations of the algorithm summarized by the generation of a lot of candidate itemsets and scans database every time. In other words if database contains huge number of transactions then scanning the database for finding the frequent itemsets will be time costly[1]. These limitations give the opportunities for the researchers to find efficient algorithm with a motive of minimizing the time and number of database scans for Knowledge Discovery. II. Related Works Sheila A. Abaya [3] proposed a modified algorithm that introduces factors such as the set size, and set size frequency which in turn are being used to eliminate non significant candidate keys. With the use of these factors, the modified algorithm introduces a more efficient and effective way of minimizing candidate keys. These factors helped in a more rapid generation of possible association of frequent items. In terms of database passes, the modified Apriori provides less database access compared with the original one that makes its execution faster. Currently, further research in finding faster way of pruning candidate keys is undergoing in finding the ideal starting size of combination size. Another approach to improve the performance of Apriori. algorithm is introduced by Sunil Kumar et al. [4] Based on bottom up approach using Probability and Matrix to identify frequent item set, Probability measure of each item occurrence to total number of transactions is used along 102

2 with the bottom up approach to find the frequent item set from largest frequent Item set to smallest frequent itemset. A significant reduction in computation time was achieved [4] Ms. Sanober Shaikh et al. [1] approach was to scan the database at the start only once, and then make the undirected itemsets graph. From this graph, the frequent itemsets is found by considering minimum support and by considering the minimum confidence; it generates the association rule without generating candidate items, execution efficiency was improved distinctly compared to traditional algorithm. Mamta Dhanda, Sonali Guglani, et al.[5], used the attributes to improve Apriroi algorithm s efficiency like profit,quantity which gives the valuable information to the customer as well as the business, this approach extracts novel interesting association patterns with emphasis on significant, quantity, profit and confidence. [5],[6] Libing Wu, Kui Gong, et al.[7] suggested new algorithm based on interested tables where interested items, which mainly construct an ordered interested table and traverse it to excavate frequent item sets quickly. Based on the study of the limitations of Apriori algorithm and the different approaches done to improve the algorithm. This research paper proposes a new approach to improve the functioning of the algorithm explained in the following sections. III. Suggested Apriori Algorithm using Matrix Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence. Association rule generation is usually split up into two separate steps: First, minimum support is applied to find all frequent item sets in a database. Second, these frequent item sets and the minimum confidence constraint are used to form rules. The general structure of the new approach is shown in Figure 1 Figure 1 General Structure of New Apriori Algorithm From the figure above, the new suggested approach consists of two parts: First part, find the frequent itemsets in the database, this achieves in two steps 1. Find the total number of times each itemsets occurs, 2. Find among these itemsets the one that satisfy the condition which is greater or equal to % Minimum Support. 103

3 Figure 2 Generation Frequent 1-Itemsets Second part, prune columns of the Matrix whose frequency count are less than %Minimum Support, a new Matrix are formed with item sets which satisfies the Association rule. The new Matrix consists of frequent item sets only. Hence the size of the Matrix reduces drastically. Figure 3- New Matrix Generation The new Matrix approach is an enhancement to Apriori algorithm in terms of reducing the computation time and memory space, detailed explanation is in the following steps: Frequent 1-Itemsetss 1. Matrix A, contains the Transaction database where each column represents Item Number and row represents transaction of the customer. If the customer has purchased a particular item then it is represented by 1. If the customer has not purchased a particular item then is represented by 0. a. Frequency of all item sets which is called as Candidates for frequent item sets is found. b. Matrix B, contains the sum of individual columns, or in other words it counts item frequency, which is frequency of all item sets. As a result, frequency of item is found without scanning the database once again because the matrix already exists. c. From Matrix B, a selection is done to frequencies which are greater or equal to the %Minimum Support, and prune the columns which are not frequent. d. A new Matrix C, is generated which is nothing but Frequent 1-Itemsets Matrix. Simultaneously in another Matrix D, the item number of frequent item sets is stored. Consider the following example: %Minimum Support= 50% Transaction ID Items T 1 I 1, I 2, I 3, I 4 T 2 I 1, I 2, I 4 T 3 I 1, I 5, I 6 T 4 I 1,I 4, I 5 T 5 I 2, I 4, I 5 104

4 TRANSACTION DATABASE MATRIX A Transaction I 1 I 2 I 3 I 4 I 5 I 6 T T T T T MATRIX B Frequency %Minimum Support= 50%, from Matrix B, select the frequency greater than or equal to (50/100)*5 = 2.5 i.e must occur in at least 3 transaction. I 3 and I 6 items are not frequent. FREQUENCY 1-ITEMSETS MATRIX MATRIX C Frequency FREQUENT 1-ITEMSETS MATRIX MATRIX D Frequent I 1 I 2 I 4 I 5 2. The Item numbers present in Matrix D are frequent Item sets, hence from Transaction database Matrix A, a selection is done only to those columns which are specified in Matrix D and create a new Matrix Z which has only frequent item sets. Hence the size of new Matrix Z is much smaller than Transaction databasing Matrix A. Frequent 2-Itemsetss 1. Matrix Z is the new Transaction database Matrix which is used to find frequent 2-Itemsets. MATRIX Z Transactio I 1 I 2 I 4 I 5 n T T T T T The major advantage of using MATLAB software is the availability of built-in functions that save a lot of time and memory space. A built-in function is used to generate potential set of 2 frequent item pairs from Matrix Z, i.e. (I 1, I 2 ), (I 1, I 4 ), (I 1 I 5 ), (I 2, I 4 ) (I 2, I 5 ) and (I 4, I 5 ). Then pair of columns is added and find how many times 2 has appeared. Finding frequent 2-itemsets is the main step to consume computation time. Then finally the count of that pair of item sets 105

5 is checked: If the count is >= %Minimum Support item pair is frequent. If the item pair satisfies the Association rule then that pair of item sets is stored in new matrix E. The process is continued till the end. FREQUENT 2-ITEMSETS MATRIX MATRIX E Frequency_2 Itemsets Matrix = Once processing of the entire column pairs are completed, the frequent 2-Itemsets is obtained. If Matrix E equals null then process is stopped, otherwise proceed to find Frequent 3-Itemsets. Frequent 3-Itemsetss The item numbers present in Matrix E is frequent 2- Item sets, say (2, 3) and (2, 4). From above, the first item number of first item pair is the same as first item number of second item pair, then potential frequent 3-Itemsets is (2, 3, 4). Once two such matching pairs are found, then those three columns are taken from transaction database and added, then the sum of each row is compared with 3, if it is 3 the count is incremented by one. When all transactions are processed, the count is checked: if count >= %Minimum support it is frequent 3- item sets. The process is continued following the same computation steps of frequent -2 item sets, until frequent item sets becomes null. IV. Experimental Results The results presented in this section were obtained from running the new proposed Apriori algorithm using MATLAB script and software used was MATLAB Version 6 Release 12, Sept Compared with results obtained from traditional Apriori algorithm implemented using Java language and the software used was JCreator version 2.5, build

6 Figure 4 Matlab Script for the new Apriori Algorithm Figure 5 Steps for Frequency n item computation 107

7 Table-1 shows the result of the new Approach compared to traditional Apriori algorithm Number of Transaction = 10,000, Number of Items = 16 Figure 6 Traditional Apriori result TABLE- 1 NEW APRIORI ALGORITHM EXPERIMENTAL RESULTS %Minimum Support New Apriori Using Matrix Time in msec Traditional Apriori Algorithm Time in msec

8 Figure 7 Performance evaluation From the performance evaluation above, it is very clear that the efficiency of the new algorithm showed a big difference in time reduction compared to the traditional one. V. Conclusion In this paper, a study has been done to improve the performance of Apriori algorithm, and a novel approach is explained using MATLAB tools to create a Matrix file. Results were clearly showed that the main transaction database matrix is reduced from the first scan by creating new matrix which contains only the frequent itemsets. A comparative study of traditional Apriori Algorithm and the new approach method was done and found that the proposed algorithm using Matrix is faster, thus we can conclude that, there is a noticeable improvement in the speed by reducing the redundant scanning of the database as well as memory space. References [1] Ms. Sanober Shaikh1 Ms. Madhuri Rao2 and Dr. S. S. Mantha3, David Bracewell, A new association rule mining based on frequent item set,cs & IT Vol. 03, pp , [2] Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases, Proc.ACM on Management of Data, Washington, D.C, pp , May [3] Sheila A. Abaya, Association Rule Mining based on Apriori Algorithm in Minimizing Candidate Generation, International Journal of Scientific & Engineering Research Volume 3, Issue 7, July-2012 [4] Sunil Kumar s, Improved Aprori Algorithm Based on bottom up approach using Probability and Matrix, International Journal of Computer Science Issues, Vol. 9, Issue 2, No 3, March 2012 pp [5] Mamta Dhanda, Sonali Guglani, Gaurav Gupta, Mining Efficient Association Rules Through Apriori Algorithm Using Attributes, IJCS Vol. 2, Issue 3, September 2011 [6] Mamta Dhanda,An approach to extract efficient frequent patterns from transactional database, International Journal of Engineering Science and Technology, Vol. 3 no.7 July 2011 pp [7] Libing Wu, Kui Gong, Fuliang Guo, Xiaohua Ge, Yilei Shan, Research on Improving Apriori Algorithm Based on Interested Table IEEE Conf.,pp July