Data Mining Apriori Algorithm



Similar documents
Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Association Analysis: Basic Concepts and Algorithms

Mining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining

Market Basket Analysis and Mining Association Rules

Unique column combinations

Association Rule Mining

Data Mining: Foundation, Techniques and Applications

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Finding Frequent Itemsets using Apriori Algorihm to Detect Intrusions in Large Dataset

Data Mining2 Advanced Aspects and Applications

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Mining Association Rules to Evade Network Intrusion in Network Audit Data

Frequent item set mining

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Project Report. 1. Application Scenario

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Boolean Algebra (cont d) UNIT 3 BOOLEAN ALGEBRA (CONT D) Guidelines for Multiplying Out and Factoring. Objectives. Iris Hui-Ru Jiang Spring 2010

Unit 3 Boolean Algebra (Continued)

Knowledge Discovery and Data Mining

Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

A Survey on Association Rule Mining in Market Basket Analysis

Lecture Notes on Database Normalization

Building A Smart Academic Advising System Using Association Rule Mining

Fuzzy Association Rules

Foundations of Artificial Intelligence. Introduction to Data Mining

An Efficient Market Basket Analysis based on Adaptive Association Rule Mining with Faster Rule Generation Algorithm

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Association Rule Mining: A Survey

Big Data Analysis Technology

How to bet using different NairaBet Bet Combinations (Combo)

Mining an Online Auctions Data Warehouse

Implementing Improved Algorithm Over APRIORI Data Mining Association Rule Algorithm

Fuzzy Logic -based Pre-processing for Fuzzy Association Rule Mining

Discovering Data Quality Rules

Die Welt Multimedia-Reichweite

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Data Mining: An Overview. David Madigan

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Market Basket Analysis for a Supermarket based on Frequent Itemset Mining

Distributed Data Mining Algorithm Parallelization

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

An Efficient Frequent Item Mining using Various Hybrid Data Mining Techniques in Super Market Dataset

Introduction. The Quine-McCluskey Method Handout 5 January 21, CSEE E6861y Prof. Steven Nowick

Enhancement of Security in Distributed Data Mining

Theory behind Normalization & DB Design. Satisfiability: Does an FD hold? Lecture 12

MASTER'S THESIS. Mining Changes in Customer Purchasing Behavior

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases

Data Mining Classification: Decision Trees

Data Mining: Introduction

Chapter 7: Association Rule Mining in Learning Management Systems

Online EFFECTIVE AS OF JANUARY 2013

Analytics on Big Data

United States Naval Academy Electrical and Computer Engineering Department. EC262 Exam 1

Association Rule Mining using Apriori Algorithm for Distributed System: a Survey

Data Mining for Retail Website Design and Enhanced Marketing

Part 2: Community Detection

Chapter 6: Episode discovery process

Distributed Apriori in Hadoop MapReduce Framework

Association Rules Mining: A Recent Overview

Database Design and Normalization

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

Improving Apriori Algorithm to get better performance with Cloud Computing

Data Mining Applications in Manufacturing

Discovery of Maximal Frequent Item Sets using Subset Creation

Parallel Data Mining Algorithms for Association Rules and Clustering

IJRFM Volume 2, Issue 1 (January 2012) (ISSN )

Selection of Optimal Discount of Retail Assortments with Data Mining Approach


Data Mining Approach in Security Information and Event Management

CH3 Boolean Algebra (cont d)

Analysis of Customer Behavior using Clustering and Association Rules

Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis

Data Mining to Recognize Fail Parts in Manufacturing Process

Limitations of E-R Designs. Relational Normalization Theory. Redundancy and Other Problems. Redundancy. Anomalies. Example

New Matrix Approach to Improve Apriori Algorithm

Application of Data Mining Techniques For Diabetic DataSet

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

Scoring the Data Using Association Rules

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Mining Association Rules on Grid Platforms

Data Mining/Knowledge Discovery/ Intelligent Data Analysis. Data Mining. Key Issue: Feedback. Applications

The basic data mining algorithms introduced may be enhanced in a number of ways.

Association Rule Mining: Exercises and Answers

Keywords: Mobility Prediction, Location Prediction, Data Mining etc

Introduction to Data Mining

Using Data Mining Methods to Predict Personally Identifiable Information in s

澳 門 彩 票 有 限 公 司 SLOT Sociedade de Lotarias e Apostas Mútuas de Macau, Lda. Soccer Bet Types

Transcription:

10 Data Mining Apriori Algorithm Apriori principle Frequent itemsets generation Association rules generation Section 6 of course book TNM033: Introduction to Data Mining 1 Association Rule Mining (ARM) ARM is not only applied to market basket data There are algorithm that can find any association rules Criteria for selecting rules: confidence, number of tests in the left/right hand side of the rule (Cheat = no) (Refund = yes) (Marital Status = singel) (Taxable Income > 100) (Cheat = no) (Refund = yes) What is the difference between classification and ARM? ARM can be used for obtaining classification rules Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes TNM033: Introduction to Data Mining 2

Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements TNM033: Introduction to Data Mining 3 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support minsup 2. Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is computationally expensive TNM033: Introduction to Data Mining 4

Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets TNM033: Introduction to Data Mining 5 Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2 d!!! TNM033: Introduction to Data Mining 6

Computational Complexity Given d unique items: Total number of itemsets = 2 d Total number of possible association rules: d 1 d d k d R = 1 k= j= 1 k d d + 1 = 3 2 + 1 k j If d = 6, R = 602 rules TNM033: Introduction to Data Mining 7 Frequent Itemset Generation Strategies Reduce the number of candidate itemsets (M) Complete search: M = 2 d Use pruning techniques to reduce M Used in Apriori algorithm Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Used in the Apriori algorithm TNM033: Introduction to Data Mining 8

Apriori Algorithm Proposed by Agrawal R, Imielinski T, Swami AN "Mining Association Rules between Sets of Items in Large Databases. SIGMOD, June 1993 Available in Weka Other algorithms Dynamic Hash and Pruning (DHP), 1995 FP-Growth, 2000 H-Mine, 2001 TNM033: Introduction to Data Mining 9 Reducing Number of Candidate Itemsets Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent, or if an item set is infrequent then all its supersets must also be infrequent Apriori principle holds due to the following property of the support measure: X, Y : ( X Y ) s( X ) s( Y ) Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support TNM033: Introduction to Data Mining 10

Illustrating Apriori Principle Found to be Infrequent Pruned supersets TNM033: Introduction to Data Mining 11 Illustrating Apriori Principle Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3 Items (1-itemsets) Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, 6 + 6 + 1 = 13 Item set Count {B read,m ilk,d iaper} 3 What about {Beer, Diaper, Milk}? And, {Bread, Milk, Beer}? And, {Bread, Diaper, Beer}? TNM033: Introduction to Data Mining 12

Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE TNM033: Introduction to Data Mining 13 Apriori Algorithm Level-wise algorithm: 1. Let k = 1 2. Generate frequent itemsets of length 1 3. Repeat until no new frequent itemsets are identified 1. Generate length (k+1) candidate itemsets from length k frequent itemsets 2. Prune candidate itemsets containing subsets of length k that are infrequent How many k-itemsets contained in a (k+1)-itemset? 3. Count the support of each candidate by scanning the DB 4. Eliminate candidates that are infrequent, leaving only those that are frequent Note: steps 3.2 and 3.4 prune itemsets that are infrequent TNM033: Introduction to Data Mining 14

Generating Itemsets Efficiently How can we efficiently generate all (frequent) item sets at each iteration? Avoid generate repeated itemsets and infrequent itemsets Finding one-item sets easy Idea: use one-item sets to generate two-item sets, twoitem sets to generate three-item sets, If (A B) is frequent item set, then (A) and (B) have to be frequent item sets as well! In general: if X is a frequent k-item set, then all (k-1)-item subsets of X are also frequent Compute k-item set by merging two (k-1)-itemsets. Which ones? E.g. Merge {Bread, Milk} with {Bread, Diaper} to get {Bread, Diaper, Milk} TNM033: Introduction to Data Mining 15 Example: generating frequent itemsets Given: five frequent 3-itemsets (A A B C), (A A B D), (A A C D), (A A C E), (B C D) 1. Lexicographically ordered! 2. Merge (x 1, x 2,, x k-1 ) with (y 1, y 2,, y k-1 ), if x 1 = y 1, x 2 = y 2,, x k-2 = y k-2 Candidate 4-itemsets: (A B C D) OK because of (A B C), (A B D), (A C D), (B C D) (A C D E) Not OK because of (C D E) 3. Final check by counting instances in dataset! TNM033: Introduction to Data Mining 16

Reducing Number of Comparisons Candidate counting: Scan the database of transactions to determine the support of each generated candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke TNM033: Introduction to Data Mining 17 Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2candidate association rules (ignoring L and L) TNM033: Introduction to Data Mining 18

Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(abc D) can be larger or smaller than c(ab D) But confidence of rules generated from the same itemset have an anti-monotone property e.g., L = {A,B,C,D}: c(abc D) c(ab CD) c(a BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule TNM033: Introduction to Data Mining 19 Rule Generation for Apriori Algorithm Lattice of rules Low Confidence Rule Pruned Rules TNM033: Introduction to Data Mining 20

Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(cd=>ab, BD=>AC) would produce the candidate rule D => ABC CD=>AB BD=>AC Prune rule D=>ABC if does not have high confidence D=>ABC Support counts have been obtained during the frequent itemset generation step TNM033: Introduction to Data Mining 21