Project Report. 1. Application Scenario



Similar documents
DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

IJRFM Volume 2, Issue 1 (January 2012) (ISSN )

A Survey on Association Rule Mining in Market Basket Analysis

Distributed Apriori in Hadoop MapReduce Framework

Mining Association Rules: A Database Perspective

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Using Data Mining Methods to Predict Personally Identifiable Information in s

Building A Smart Academic Advising System Using Association Rule Mining

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Data Mining Apriori Algorithm

Association Rule Mining: A Survey

Discovery of Maximal Frequent Item Sets using Subset Creation

Distributed Data Mining Algorithm Parallelization

Improving Apriori Algorithm to get better performance with Cloud Computing

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Implementation of Data Mining Techniques to Perform Market Analysis

Databases - Data Mining. (GF Royle, N Spadaccini ) Databases - Data Mining 1 / 25

Association Analysis: Basic Concepts and Algorithms

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Mining an Online Auctions Data Warehouse

Market Basket Analysis for a Supermarket based on Frequent Itemset Mining

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

WEBLOG RECOMMENDATION USING ASSOCIATION RULES

Mining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Customer Classification And Prediction Based On Data Mining Technique

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

Data Mining to Recognize Fail Parts in Manufacturing Process

Association Rule Mining

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Data Mining Solutions for the Business Environment

DATA MINING TECHNIQUES: A SOURCE FOR CONSUMER BEHAVIOR ANALYSIS

Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Enhancement of Security in Distributed Data Mining

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Classification and Prediction

MASTER'S THESIS. Mining Changes in Customer Purchasing Behavior

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

DATA MINING TECHNIQUES: A SOURCE FOR CONSUMER BEHAVIOR ANALYSIS

Application of Data Mining Techniques For Diabetic DataSet

Data Mining Analytics for Business Intelligence and Decision Support

A Time Efficient Algorithm for Web Log Analysis

COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS

Implementing Improved Algorithm Over APRIORI Data Mining Association Rule Algorithm

New Matrix Approach to Improve Apriori Algorithm

Data Mining Application in Advertisement Management of Higher Educational Institutes

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

CAS CS 565, Data Mining

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Comparative Study in Building of Associations Rules from Commercial Transactions through Data Mining Techniques

How To Solve The Kd Cup 2010 Challenge

Random forest algorithm in big data environment

Fuzzy Association Rules

Market Basket Analysis and Mining Association Rules

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Analytics on Big Data

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases

Data Mining Approach in Security Information and Event Management

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Mining Applications in Manufacturing

DATA MINING TECHNIQUES AND APPLICATIONS

Association Rule Mining as a Data Mining Technique

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

Building an Iris Plant Data Classifier Using Neural Network Associative Classification

Data Mining for Retail Website Design and Enhanced Marketing

CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V

Data Outsourcing based on Secure Association Rule Mining Processes

Frequent item set mining

Mine Your Business A Novel Application of Association Rules for Insurance Claims Analytics

Data Mining: An Overview. David Madigan

The Scientific Data Mining Process

Data Mining Individual Assignment report

Analysis of Customer Behavior using Clustering and Association Rules

Ensemble of Classifiers Based on Association Rule Mining

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Chapter 20: Data Analysis

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

APPLYING GMDH ALGORITHM TO EXTRACT RULES FROM EXAMPLES

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Association Rule Mining using Apriori Algorithm for Distributed System: a Survey

Transcription:

Project Report In this report, we briefly introduce the application scenario of association rule mining, give details of apriori algorithm implementation and comment on the mined rules. Also some instructions of using this program are given. 1. Application Scenario Association rule mining finds interesting association relationships among a large set of data items. With massive amounts of data continuously being collected and stored in databases, many industries are becoming interested in mining association rules from their databases. [1] 1.1 Market Basket Analysis A typical example of association rule mining is market basket analysis. This process analyzes customer-buying habits by finding associations among the different items that customers place in their shopping baskets. The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, market basket analysis may help managers optimize different store layouts. If customers who purchase milk also tend to buy bread at the same time, then placing the milk close or opposite to bread may help to increase the sales of both of these items. [1] 1.2 Database This program apriori is designed to generate strong association rules from Boolean-valued database that looks like this: Item1 Item2 Item3 Item4 Item5 y y y n n y n y y y That is, the first line consists of all different item names, and each of the remaining lines is a Boolean-valued vector where y indicates the corresponding item appears in this line and n indicates not. However, If the database looks like this: Item1 Item2 Item3 Item1 Item3 Item4 Item5 We should first use the pre-processing program convert to set it into Boolean-valued file. The program convert works as follows: 1

First, find all the different items by scanning the whole database and save them as an item name line into the first line of a new text file newdata.txt. Next, for each line of the source file, set it into a Boolean-valued vector consisted of y or n depending on whether each item of the item name line appears or not in this line. Thus each vector has the same length that is exactly the total number of different items. Then save this vector into newdata.txt. In this project, we use a supermarket transaction database transaction.txt to mine association rules. This database comes from the software package CBA2.0 of National University of Singapore. [2] It looks like this: newspaper, cd, battery, sweets, soya_sauce, rice rice, sugar, tomato_sauce, apple, pamper, pacifier First we use the pre-processing program convert to set transaction.txt into Boolean-valued file supmart.txt. Then we run the program apriori upon supmart.txt to get all the association rules we might be interested. To test the robustness of our program, we also use a much larger database votes.txt (Congressional Voting Records of United States in 1984 from UCI Machine Learning Repository). In this database all attributes are already Boolean-valued. We delete the first column because this file is originally for classification purpose of Republican and Democrat. Then run the program apriori upon votes.txt to get the association rules about the voting records. [3] 2. Implementation of Algorithms In this project, we use many C functions to implement the apriori algorithm and generate association rules. 2.1 Data Structure To implement this project, the key point is setting up good data structures to represent each itemset and store all the frequent itemsets: First, we use struct MATRIX to store the number of different items, the number of transaction records, all the different item names and all the Boolean values of the database. The size of data matrix is dynamically determined. Second, in order to represent a certain itemset, we use struct VECTOR, which includes the itemset frequency and itemset vector whose length equals to the number of different items. 2

Third, in order to link all the frequent k-itemsets into a list, we use struct ITEMSETS which includes the struct VECTOR and a pointer which points to next frequent k-itemset in the list. So by referring to the head pointer Lk of the list for frequent k-itemsets, we can make proper operations. For all the other supplementary data structures, please see detail in the source code. 2.2 Algorithm Market basket analysis can be divided into two sub-problems: 1. Find all frequent itemsets that have support above minimum support threshold. 2. Generate strong association rules that satisfy minimum confidence threshold from the frequent itemsets. [1] 2.2.1 Data Processing First, we use the function file_size to scan the database to determine the number of different items and the number of transaction records. Second, we use the function init_struct to initialize the data matrix, all the head pointers L 1 to L k and some other supplementary data structures. Third, use the function read_data to store all the different item names and Boolean values into data matrix. 2.2.2 Apriori Algorithm Apriori is an influential algorithm to find frequent itemsets. The first pass of the algorithm simply uses the function getl1 to count item occurrences to determine the frequent 1- itemsets. A subsequent pass, for example pass k, consists of two phases: First, the frequent itemsets Lk-1 found in the (k-1)th pass are used to generate the candidate itemsets C k using the function getck described below. Next, the data matrix is scanned and the support of candidates in Ck is counted. For fast counting, we use the function be_subset to efficiently determine whether the candidates in Ck are contained in a given transaction or not. [4] There are two steps in the function getck. First, in the join step, we join Lk-1 and Lk-1 to generate potential candidates. Next, in the prune step, we use the function infqn_subset to remove all candidates that have a subset that is not frequent. The pruning is based on the apriori property that all non-empty subsets of a frequent itemset must be requent as well. [1] The function getck returns a superset of the set of all frequent k-itemsets. We also use the function display_itemsets to save all the frequent itemsets into a new text file itemsets.txt. 3

2.2.3 Generate Strong Association Rules Once all the frequent itemsets have been found, it is straightforward to generate strong association rules from them as follows: For each frequent itemset l in Lk (k 2), generate all non-empty subsets of l. For every non-empty subset s of l, output the rule s(l-s) if support(l)/support(s) min_conf. [1] In the function get_rules, we modify the algorithm to further prune the search space based on the apriori knowledge as follows: Since all the subsets of l must be frequent 1 to k-1 itemsets, we only need to visit each of the frequent 1 to k-1 itemset lists, and for each itemset of any list just check if it is the subset of l (it s easy by vector representation). If so and support(l)/support(s) min_conf, then output the rule s(l-s). Also it s very easy to generate l-s where both l and s are represented by the vectors consisted of 1 and 0. All the generated rules are saved into a new text file rules.txt by the function display_rule. 2.2.4 Free Memory After all the strong association rules are generated, we use the function free_struct to free all the memory dynamically allocated for the frequent itemsets lists and data matrix. 3. Comments and Discussion In our project, if we set minimum support to be 0.3 and minimum confidence to be 0.5, then there are 18 frequent itemsets and 17 strong association rules generated from supmart.txt, respectively. One strong association rule looks like this: cd ==> soya_sauce (Support:39.06%, Confidence:66.67%) This rule means 39.06% of all the transaction records contain both cd and soya_sauce, and 66.67% of the customers who purchased cd also bought soya_sauce. So it s great fun to find many interesting patterns. If we apply the same minimum support and confidence threshold to the second database votes.txt (Congressional Voting Records of United States in 1984), we get 91 frequent itemsets and 354 strong association rules, respectively. One strong association rule looks like this: education-spending ==> crime (Support:36.32%, Confidence:92.40%) This rule means 36.32% of all the voters supported both the education-spending policy and the crime policy, and 92.40% of the voters who supported the education-spending policy also supported the crime policy. 4

If we set lower minimum support and confidence, much more frequent itemsets and strong association rules might be generated. Also the run time will be a little longer. In other words, raising the minimum support and confidence will have a secondary effect of reducing computation time, which may be desirable for large data sets. [4] 4. Instructions to Use the Tool 4.1 Installation Type: gcc convert.c o convert.exe to get the pre-precessing program convert.exe. Type: gcc apriori.c o apriori to get the executive program apriori. Please see detail in the README file in the proj directory. 4.2 How to use it? If the database is not Boolean-valued, we should first use the pre-processing program convert.exe to set it into a Boolean-valued one. Then just type: apriori. At the prompt, enter the name of data file (supmart.txt or votes.txt). Then input minimum support (say 0.3, not 30%) and minimum confidence (say 0.5, not 50%). The program will analyze this file and display all the frequent itemsets and association rules on the screen. Then you can check two result files itemsets.txt and rules.txt for detail. References: [1]: Jiawei Han and Micheline Kamber Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers 2000, ISBN 1-55860-489-8 [2]: http://www.comp.nus.edu.sg/~dm2/ Data Mining Interestingness and Interaction. Available: Dec. 5, 2000 [3]: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/voting-records UCI Machine Learning Repository Available: Dec. 5, 2000 [4]: http://siva.bpa.arizona.edu/data_mining/data_mining.htm Introduction to Data Mining. Available: Dec. 5, 2000 5