New Matrix Approach to Improve Apriori Algorithm



Similar documents
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

A Survey on Association Rule Mining in Market Basket Analysis

Building A Smart Academic Advising System Using Association Rule Mining

Mining an Online Auctions Data Warehouse

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Binary Coded Web Access Pattern Tree in Education Domain

Discovery of Maximal Frequent Item Sets using Subset Creation

FREQUENT PATTERN MINING FOR EFFICIENT LIBRARY MANAGEMENT

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

Improving Apriori Algorithm to get better performance with Cloud Computing

Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Distributed Data Mining Algorithm Parallelization

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

A Time Efficient Algorithm for Web Log Analysis

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Mine Your Business A Novel Application of Association Rules for Insurance Claims Analytics

Enhancement of Security in Distributed Data Mining

Business Lead Generation for Online Real Estate Services: A Case Study

Data Mining Application in Advertisement Management of Higher Educational Institutes

A Knowledge Management Framework Using Business Intelligence Solutions

An Efficient Frequent Item Mining using Various Hybrid Data Mining Techniques in Super Market Dataset

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

Association rules for improving website effectiveness: case analysis

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Future Trend Prediction of Indian IT Stock Market using Association Rule Mining of Transaction data

Unique column combinations

How To Write A Summary Of A Review

II. OLAP(ONLINE ANALYTICAL PROCESSING)

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

A Statistical Text Mining Method for Patent Analysis

Data Mining in Telecommunication

A Review on Data Mining in Cloud Computing Environment

Web Usage Association Rule Mining System

A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment

NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE

Indirect Positive and Negative Association Rules in Web Usage Mining

Databases - Data Mining. (GF Royle, N Spadaccini ) Databases - Data Mining 1 / 25

Comparison of Data Mining Techniques for Money Laundering Detection System

An Enhanced Quality Approach for Ecommerce Website using DEA and High Utility Item Set Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Assessing Learners Behavior by Monitoring Online Tests through Data Visualization

Association Rule Mining: A Survey

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Project Report. 1. Application Scenario

How To Ensure Correctness Of Data In The Cloud

TECHNOLOGY ANALYSIS FOR INTERNET OF THINGS USING BIG DATA LEARNING

Mining changes in customer behavior in retail marketing

Multi-table Association Rules Hiding

Data Mining Solutions for the Business Environment

Data Mining Approach in Security Information and Event Management

Knowledge-Driven Decision Support System Based on Knowledge Warehouse and Data Mining for Market Management

Data Outsourcing based on Secure Association Rule Mining Processes

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

ORGANIZATIONAL KNOWLEDGE MAPPING BASED ON LIBRARY INFORMATION SYSTEM

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Building Data Cubes and Mining Them. Jelena Jovanovic

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

A Survey on Intrusion Detection System with Data Mining Techniques

Mining Association Rules: A Database Perspective

Search Result Optimization using Annotators

Improved Data mining approach to find Frequent Itemset Using Support count table

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Data Mining Applications in Manufacturing

A Divided Regression Analysis for Big Data

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

A Survey on Product Aspect Ranking

Data Mining Algorithms And Medical Sciences

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Prediction of Heart Disease Using Naïve Bayes Algorithm

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

Specific Usage of Visual Data Analysis Techniques

A hybrid algorithm combining weighted and hasht apriori algorithms in Map Reduce model using Eucalyptus cloud platform

Data Mining: A Preprocessing Engine

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

Transcription:

New Matrix Approach to Improve Apriori Algorithm A. Rehab H. Alwa, B. Anasuya V Patil Associate Prof., IT Faculty, Majan College-University College Muscat, Oman, rehab.alwan@majancolleg.edu.om Associate Lecturer, EC Department, Waljat College Of Applied Sciences,Muscat, Oman anasuya.patil@gmail.com Abstract In this paper a novel approach is proposed to improve the Apriori algorithm through the creation of Matrix- File using MATLAB, where the database transactions are saved. Thus repeated scanning is avoided. and particular rows & columns are extracted and perform a function on that rather than scanning entire database. results can be easily visualized and interpreted using graphical form display, The novel approach showed a very good result in comparison to the traditional Apriori algorithm because there is a pruning process to those columns whose item count is less than minimum support. Hence the size of the Matrix reduces drastically which saves a lot of time, and a noticeable improvement in the speed by reducing the redundant scanning of the database. Keywords: Apriori Algoritm, Associate Rule, Matlab, Matrix I. Introduction Data Mining or Knowledge Discovery in Databases (KDD) is a process of discovering knowledge from huge amount of data. The huge growth in electronic information leads to large memory storage represented by large databases or data warehouses or information repositories, with such growth, all enterprises are accumulating large amount of data everyday and there is potential business information hidden in this database. Therefore it is necessary to discover knowledge from these databases which might assist in decision making. Data mining uses various techniques to discover knowledge, the most popular data mining method is association rule, a typical and widely-used example of association rule mining is Market Basket Analysis [1]. The main intention is to determine correlations among large set of items in a database, Apriori algorithm is the first proposed algorithm used to extract association rules from large database [2]. It consists of two procedures: First, finding the frequent itemset in the database using a minimum support and constructing the association rule from the frequent itemsets with specified confidence. The limitations of the algorithm summarized by the generation of a lot of candidate itemsets and scans database every time. In other words if database contains huge number of transactions then scanning the database for finding the frequent itemsets will be time costly[1]. These limitations give the opportunities for the researchers to find efficient algorithm with a motive of minimizing the time and number of database scans for Knowledge Discovery. II. Related Works Sheila A. Abaya [3] proposed a modified algorithm that introduces factors such as the set size, and set size frequency which in turn are being used to eliminate non significant candidate keys. With the use of these factors, the modified algorithm introduces a more efficient and effective way of minimizing candidate keys. These factors helped in a more rapid generation of possible association of frequent items. In terms of database passes, the modified Apriori provides less database access compared with the original one that makes its execution faster. Currently, further research in finding faster way of pruning candidate keys is undergoing in finding the ideal starting size of combination size. Another approach to improve the performance of Apriori. algorithm is introduced by Sunil Kumar et al. [4] Based on bottom up approach using Probability and Matrix to identify frequent item set, Probability measure of each item occurrence to total number of transactions is used along 102

with the bottom up approach to find the frequent item set from largest frequent Item set to smallest frequent itemset. A significant reduction in computation time was achieved [4] Ms. Sanober Shaikh et al. [1] approach was to scan the database at the start only once, and then make the undirected itemsets graph. From this graph, the frequent itemsets is found by considering minimum support and by considering the minimum confidence; it generates the association rule without generating candidate items, execution efficiency was improved distinctly compared to traditional algorithm. Mamta Dhanda, Sonali Guglani, et al.[5], used the attributes to improve Apriroi algorithm s efficiency like profit,quantity which gives the valuable information to the customer as well as the business, this approach extracts novel interesting association patterns with emphasis on significant, quantity, profit and confidence. [5],[6] Libing Wu, Kui Gong, et al.[7] suggested new algorithm based on interested tables where interested items, which mainly construct an ordered interested table and traverse it to excavate frequent item sets quickly. Based on the study of the limitations of Apriori algorithm and the different approaches done to improve the algorithm. This research paper proposes a new approach to improve the functioning of the algorithm explained in the following sections. III. Suggested Apriori Algorithm using Matrix Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence. Association rule generation is usually split up into two separate steps: First, minimum support is applied to find all frequent item sets in a database. Second, these frequent item sets and the minimum confidence constraint are used to form rules. The general structure of the new approach is shown in Figure 1 Figure 1 General Structure of New Apriori Algorithm From the figure above, the new suggested approach consists of two parts: First part, find the frequent itemsets in the database, this achieves in two steps 1. Find the total number of times each itemsets occurs, 2. Find among these itemsets the one that satisfy the condition which is greater or equal to % Minimum Support. 103

Figure 2 Generation Frequent 1-Itemsets Second part, prune columns of the Matrix whose frequency count are less than %Minimum Support, a new Matrix are formed with item sets which satisfies the Association rule. The new Matrix consists of frequent item sets only. Hence the size of the Matrix reduces drastically. Figure 3- New Matrix Generation The new Matrix approach is an enhancement to Apriori algorithm in terms of reducing the computation time and memory space, detailed explanation is in the following steps: Frequent 1-Itemsetss 1. Matrix A, contains the Transaction database where each column represents Item Number and row represents transaction of the customer. If the customer has purchased a particular item then it is represented by 1. If the customer has not purchased a particular item then is represented by 0. a. Frequency of all item sets which is called as Candidates for frequent item sets is found. b. Matrix B, contains the sum of individual columns, or in other words it counts item frequency, which is frequency of all item sets. As a result, frequency of item is found without scanning the database once again because the matrix already exists. c. From Matrix B, a selection is done to frequencies which are greater or equal to the %Minimum Support, and prune the columns which are not frequent. d. A new Matrix C, is generated which is nothing but Frequent 1-Itemsets Matrix. Simultaneously in another Matrix D, the item number of frequent item sets is stored. Consider the following example: %Minimum Support= 50% Transaction ID Items T 1 I 1, I 2, I 3, I 4 T 2 I 1, I 2, I 4 T 3 I 1, I 5, I 6 T 4 I 1,I 4, I 5 T 5 I 2, I 4, I 5 104

TRANSACTION DATABASE MATRIX A Transaction I 1 I 2 I 3 I 4 I 5 I 6 T 1 1 1 1 1 0 0 T 2 1 1 0 1 0 0 T 3 1 0 0 0 1 1 T 4 1 0 0 1 1 0 T 5 0 1 0 1 1 0 MATRIX B Frequency 4 3 1 4 3 1 %Minimum Support= 50%, from Matrix B, select the frequency greater than or equal to (50/100)*5 = 2.5 i.e must occur in at least 3 transaction. I 3 and I 6 items are not frequent. FREQUENCY 1-ITEMSETS MATRIX MATRIX C Frequency 4 3 4 3 FREQUENT 1-ITEMSETS MATRIX MATRIX D Frequent I 1 I 2 I 4 I 5 2. The Item numbers present in Matrix D are frequent Item sets, hence from Transaction database Matrix A, a selection is done only to those columns which are specified in Matrix D and create a new Matrix Z which has only frequent item sets. Hence the size of new Matrix Z is much smaller than Transaction databasing Matrix A. Frequent 2-Itemsetss 1. Matrix Z is the new Transaction database Matrix which is used to find frequent 2-Itemsets. MATRIX Z Transactio I 1 I 2 I 4 I 5 n T 1 1 1 1 0 T 2 1 1 1 0 T 3 1 0 0 1 T 4 1 0 1 1 T 5 0 1 1 1 The major advantage of using MATLAB software is the availability of built-in functions that save a lot of time and memory space. A built-in function is used to generate potential set of 2 frequent item pairs from Matrix Z, i.e. (I 1, I 2 ), (I 1, I 4 ), (I 1 I 5 ), (I 2, I 4 ) (I 2, I 5 ) and (I 4, I 5 ). Then pair of columns is added and find how many times 2 has appeared. Finding frequent 2-itemsets is the main step to consume computation time. Then finally the count of that pair of item sets 105

is checked: If the count is >= %Minimum Support item pair is frequent. If the item pair satisfies the Association rule then that pair of item sets is stored in new matrix E. The process is continued till the end. FREQUENT 2-ITEMSETS MATRIX MATRIX E 1 4 2 4 Frequency_2 Itemsets Matrix = 3 3 2. Once processing of the entire column pairs are completed, the frequent 2-Itemsets is obtained. If Matrix E equals null then process is stopped, otherwise proceed to find Frequent 3-Itemsets. Frequent 3-Itemsetss The item numbers present in Matrix E is frequent 2- Item sets, say (2, 3) and (2, 4). From above, the first item number of first item pair is the same as first item number of second item pair, then potential frequent 3-Itemsets is (2, 3, 4). Once two such matching pairs are found, then those three columns are taken from transaction database and added, then the sum of each row is compared with 3, if it is 3 the count is incremented by one. When all transactions are processed, the count is checked: if count >= %Minimum support it is frequent 3- item sets. The process is continued following the same computation steps of frequent -2 item sets, until frequent item sets becomes null. IV. Experimental Results The results presented in this section were obtained from running the new proposed Apriori algorithm using MATLAB script and software used was MATLAB Version 6 Release 12, Sept 2000. Compared with results obtained from traditional Apriori algorithm implemented using Java language and the software used was JCreator version 2.5, build 9. 106

Figure 4 Matlab Script for the new Apriori Algorithm Figure 5 Steps for Frequency n item computation 107

Table-1 shows the result of the new Approach compared to traditional Apriori algorithm Number of Transaction = 10,000, Number of Items = 16 Figure 6 Traditional Apriori result TABLE- 1 NEW APRIORI ALGORITHM EXPERIMENTAL RESULTS %Minimum Support New Apriori Using Matrix Time in msec Traditional Apriori Algorithm Time in msec 20 120 912 25 100 792 30 100 741 35 90 691 40 90 530 45 80 420 50 80 390 55 70 300 60 70 300 108

Figure 7 Performance evaluation From the performance evaluation above, it is very clear that the efficiency of the new algorithm showed a big difference in time reduction compared to the traditional one. V. Conclusion In this paper, a study has been done to improve the performance of Apriori algorithm, and a novel approach is explained using MATLAB tools to create a Matrix file. Results were clearly showed that the main transaction database matrix is reduced from the first scan by creating new matrix which contains only the frequent itemsets. A comparative study of traditional Apriori Algorithm and the new approach method was done and found that the proposed algorithm using Matrix is faster, thus we can conclude that, there is a noticeable improvement in the speed by reducing the redundant scanning of the database as well as memory space. References [1] Ms. Sanober Shaikh1 Ms. Madhuri Rao2 and Dr. S. S. Mantha3, David Bracewell, A new association rule mining based on frequent item set,cs & IT Vol. 03, pp. 81 95, 2011. [2] Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases, Proc.ACM on Management of Data, Washington, D.C, pp 207-216, May 1993. [3] Sheila A. Abaya, Association Rule Mining based on Apriori Algorithm in Minimizing Candidate Generation, International Journal of Scientific & Engineering Research Volume 3, Issue 7, July-2012 [4] Sunil Kumar s, Improved Aprori Algorithm Based on bottom up approach using Probability and Matrix, International Journal of Computer Science Issues, Vol. 9, Issue 2, No 3, March 2012 pp 242-246 [5] Mamta Dhanda, Sonali Guglani, Gaurav Gupta, Mining Efficient Association Rules Through Apriori Algorithm Using Attributes, IJCS Vol. 2, Issue 3, September 2011 [6] Mamta Dhanda,An approach to extract efficient frequent patterns from transactional database, International Journal of Engineering Science and Technology, Vol. 3 no.7 July 2011 pp 5652-5658 [7] Libing Wu, Kui Gong, Fuliang Guo, Xiaohua Ge, Yilei Shan, Research on Improving Apriori Algorithm Based on Interested Table IEEE Conf.,pp 422-426. July 2010 109