Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar



Similar documents
Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Data Mining Apriori Algorithm

Association Analysis: Basic Concepts and Algorithms

Mining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining

Market Basket Analysis and Mining Association Rules

Unique column combinations

Data Mining2 Advanced Aspects and Applications

Association Rule Mining

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Project Report. 1. Application Scenario

Frequent item set mining

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

Boolean Algebra (cont d) UNIT 3 BOOLEAN ALGEBRA (CONT D) Guidelines for Multiplying Out and Factoring. Objectives. Iris Hui-Ru Jiang Spring 2010

Lecture Notes on Database Normalization

Mining Association Rules to Evade Network Intrusion in Network Audit Data

Finding Frequent Itemsets using Apriori Algorihm to Detect Intrusions in Large Dataset

Building A Smart Academic Advising System Using Association Rule Mining

Fuzzy Association Rules

Chapter 6: Episode discovery process

Data Mining: Foundation, Techniques and Applications

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Die Welt Multimedia-Reichweite

Association Rule Mining: A Survey

Data Mining: An Overview. David Madigan

Knowledge Discovery and Data Mining

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases

An Efficient Market Basket Analysis based on Adaptive Association Rule Mining with Faster Rule Generation Algorithm

Gerry Hobbs, Department of Statistics, West Virginia University

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Web Users Session Analysis Using DBSCAN and Two Phase Utility Mining Algorithms

Functional Dependencies and Normalization

Part 2: Community Detection

Discovery of Maximal Frequent Item Sets using Subset Creation

Mining Association Rules on Grid Platforms

Data Mining Applications in Manufacturing

Big Data Analysis Technology

Distributed Apriori in Hadoop MapReduce Framework

Discovering Data Quality Rules

New Matrix Approach to Improve Apriori Algorithm

MASTER'S THESIS. Mining Changes in Customer Purchasing Behavior

Introduction to Data Mining

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Data Preprocessing. Week 2

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Distributed Data Mining Algorithm Parallelization

Benchmark Databases for Testing Big-Data Analytics In Cloud Environments

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Theory behind Normalization & DB Design. Satisfiability: Does an FD hold? Lecture 12

Week 11: Normal Forms. Logical Database Design. Normal Forms and Normalization. Examples of Redundancy

Unit 3 Boolean Algebra (Continued)

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

Chapter 20: Data Analysis

Databases - Data Mining. (GF Royle, N Spadaccini ) Databases - Data Mining 1 / 25

Index support for regular expression search. Alexander Korotkov PGCon 2012, Ottawa

Parallel Data Mining Algorithms for Association Rules and Clustering

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Association Rule Mining: Exercises and Answers

A Survey on Association Rule Mining in Market Basket Analysis

A Data Mining Tutorial

Limitations of E-R Designs. Relational Normalization Theory. Redundancy and Other Problems. Redundancy. Anomalies. Example

Enhancement of Security in Distributed Data Mining

Chapter 7: Association Rule Mining in Learning Management Systems

Mining Association Rules: A Database Perspective

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Multi-table Association Rules Hiding

How to bet using different NairaBet Bet Combinations (Combo)

Database Management Systems. Redundancy and Other Problems. Redundancy

A Data Mining Framework for Automatic Online Customer Lead Generation

Introduction to Market Basket Analysis Bill Qualls, First Analytics, Raleigh, NC

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

How To Find Out What A Key Is In A Database Engine

Soccer Bet Types Content

Design of Relational Database Schemas

Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Improving Apriori Algorithm to get better performance with Cloud Computing

Association Rules Mining: A Recent Overview

Database Design and Normalization

CH3 Boolean Algebra (cont d)

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Data Mining Classification: Decision Trees

Data Mining to Recognize Fail Parts in Manufacturing Process

澳 門 彩 票 有 限 公 司 SLOT Sociedade de Lotarias e Apostas Mútuas de Macau, Lda. Soccer Bet Types

Data Mining/Knowledge Discovery/ Intelligent Data Analysis. Data Mining. Key Issue: Feedback. Applications

Transcription:

Data Mining: Association Analysis Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Bread} {Milk} {Diaper} {Beer}, 2

Definition: Frequent Itemset Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset t that t contains k items Support count σ) Frequency of occurrence of an itemset E.g. σ{milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s{milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 3

Definition: Association Rule Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support s) Fraction of transactions that contain both X and Y Confidence c) TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example: { Milk, Diaper} Milk, Diaper, Beer) s = σ T 2 = = 5 Measures how often items in Y appear in transactions that 5 contain X Beer 0.4 σ Milk,Diaper,Beer) 2 c = = = 0.67 σ Milk, Diaper) 3 4

Association Rule Mining Task Given a set of transactions T, the goal of association rule mining i is to find all rules having support minsup threshold confidence minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! 5

Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} s=0.4, c=0.67) {Milk,Beer} {Diaper} s=0.4, c=1.0) {Diaper,Beer} {Milk} s=0.4, c=0.67) {Beer} {Milk,Diaper} s=0.4, c=0.67) {Diaper} {Milk,Beer} s=0.4, c=0.5) {Milk} {Diaper,Beer} s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements 6

Association Rules Mining cont.) Problem Decomposition 1. Find all items itemsets) with support larger than s% large itemset L k ) 2. Use the large itemsets to generate the rules Example if ABCD and AB are large itemsets then rule AB -> CD holds if supportabcd)/supportab) >= c% Existing Mining Algorithms focused on sub-problem 1 features read the database several passes focused on reducing the computation cost 7

Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets t 8

Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Match each transaction against every candidate Complexity ~ ONMw) => Expensive since M = 2 d!!! 9

Frequent Itemset Generation Strategies Reduce the number of candidates M) Complete search: M=2 d Use pruning techniques to reduce M Reduce the number of transactions N) Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms Reduce the number of comparisons NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction 10

Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of fthe support measure: X, Y : X Y ) s X ) s Y Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property p of support ) 11

Illustrating Apriori Principle Found to be Infrequent Pruned supersets 12

Illustrating Apriori Principle Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3 Items 1-itemsets) Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Pairs 2-itemsets) No need to generate candidates involving Coke or Eggs) Triplets 3-itemsets) If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, 6 + 6 + 1 = 13 Item set Count {B read,m ilk,d iaper} 3 13

Apriori Algorithm Method: Let k=1 Generate frequent itemsets t of length 1 Repeat until no new frequent itemsets are identified Generate length k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent 14

Association Rules Mining Generation of Candidate Itemsets C k 1) Join: L k-1 L k-1 2) Prune: delete all itemsets c C k such that some k-1)- subset of c is not in L k-1. <e.g.> If {A B} is large then both A and B are large Example Let L 3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}} 1) Join: C 4 = L 3 L 3 = {{1 2 3 4}, {1 3 4 5}} 2) Prune: {1 3 4 5} is deleted because {1 4 5} is not in L 3 => C 4 = {{1 2 3 4}} 15

Example of Apriori Mining Trans # Items 1 A B C D E 2 A B D 3 A C F 4 D G 5 A B H Minimum Support = 2/5 Step 1: Find L 1 L 1 =ABCD) Step 2: Find L 2 i) Generate Candidates C 2 = L 1 x L 1 = { A B), A C), A D),.} ii) Count SA B))=SA C))=SA D))=SB D))=2/5 SB C))=SC D))=1/5 = { A B), A C), A D), B D)} Step 3: Find L 3 i) C 3 = L 2 x L 2 = { A B C), A B D), A C D) } Prune: delete A B C), A C D) ii) Count SA B D)) = 2/5 L 3 = {A B D)} 16

Factors Affecting Complexity Choice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree number of subsets in a transaction increases with its width) 17

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border 18

Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset TID Items 1 {A,B} 2 {B,C,D} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D} Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2 19

Maximal vs Closed Itemsets TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE null 124 123 1234 245 345 A B C D E Transaction Ids 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE Not supported by any transactions ABCDE 20

Maximal vs Closed Frequent Itemsets Minimum support = 2 null Closed but not maximal 124 123 1234 245 345 A B C D E Closed and maximal 12 124 24 4 123 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 2 24 4 4 2 3 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE 2 4 ABCD ABCE ABDE ACDE BCDE # Closed = 9 # Maximal = 4 ABCDE 21

Association Rules Discovery Research Issues efficient i mining i methods eg) No candidates; I/O reduction Variations of association rules Multiple support, Multi-level rules rule interestingness & pruning incremental mining accuracy adjustment new constraints & measures integration considerations g with RDBMS fuzzy sets, rough sets,, etc. DB* DB DB + 22

What more applications for association rules mining? Discussions Fields, industry problems 23

Variations of Association Rules Multiple Support Multi-Level Association Rules 24

Effect of Support Distribution Many real data sets have skewed support distribution ib ti Support distribution of a retail data set 25

Effect of Support Distribution How to set the appropriate minsup threshold? If minsup is set too high, we could miss itemsets involving interesting rare items e.g., expensive products) If minsup is set too low, it is computationally ti expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective 26

Multiple Minimum Support How to apply multiple minimum supports? MSi): minimum support for item i e.g.: MSMilk)=5%, MSCoke) = 3%, MSBroccoli)=0.1%, MSSalmon)=0.5% MS{Milk, Broccoli}) = min MSMilk), MSBroccoli)) = 01% 0.1% Challenge: Support is no longer anti-monotone t Suppose: SupportMilk, Coke) = 1.5% and SupportMilk, Coke, Broccoli) = 0.5% {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent! 27

Multiple Minimum Support Liu 1999) Order the items according to their minimum support in ascending order) e.g.: MSMilk)=5%, MSCoke) = 3%, MSBroccoli)=0 0.1%, MSSalmon)=0 0.5% Ordering: Broccoli, Salmon, Coke, Milk Need to modify Apriori such that: L 1 : set of frequent items F 1 : set of items whose support is MS1) where MS1) is min i MSi)) ) C 2 : candidate itemsets of size 2 is generated from F 1 instead of L 1 28

Multiple Minimum Support Liu 1999) Modifications to Apriori: In traditional Apriori, i A candidate k+1)-itemset is generated by merging two frequent itemsets of size k The candidate is pruned if it contains any infrequent subsets of size k Pruning step has to be modified: d Prune only if subset contains the first item e.g.: Candidate={Broccoli {Broccoli, Coke, Milk} ordered according to minimum support) {Broccoli, Coke} and {Broccoli, Milk} are frequent but {Coke, Milk} is infrequent Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli. 29

Multi-level Association Rules Food Electronics Bread Milk Computers Home Wheat White Skim 2% Desktop Laptop Accessory TV DVD Foremost Kemps Printer Scanner 30

Multi-level Association Rules Why should we incorporate concept hierarchy? Rules at lower levels may not have enough support to appear in any frequent itemsets Rules at lower levels of the hierarchy are overly specific e.g., skim milk white bread, 2% milk wheat bread, skim milk wheat bread, etc. are indicative of association between milk and bread 31

Multi-level Association Rules How do support and confidence vary as we traverse the concept hierarchy? h If X is the parent item for both X1 and X2, then σx) σx1) + σx2) If σx1 Y1) minsup, and X is parent of X1, Y is parent of Y1 then σx Y1) minsup, σx1 Y) minsup σx Y) minsup If confx1 Y1) minconf, then confx1 Y) minconf 32

Multi-level Association Rules Approach 1: Extend current association rule formulation by augmenting each transaction with higher level items Original Transaction: {skim milk, wheat bread} Augmented Transaction: {skim milk, wheat bread, milk, bread, food} Issues: Items that reside at higher levels have much higher support counts if support threshold is low, too many frequent patterns involving items from the higher levels Increased dimensionality of the data 33

Multi-level Association Rules Approach 2: Generate frequent patterns at highest level first Then, generate frequent patterns at the next highest level, and so on Issues: I/O requirements will increase dramatically because we need to perform more passes over the data May miss some potentially interesting cross-level association patterns Other approaches: [Paper Study] 34

Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used 35

Application of Interestingness Measure Knowledge Interestingness Measures Postprocessing Patterns Measures Featur e Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Prod uct Featur e Featur e Featur F t Mining Preprocessed Data e e Featur e Featur e Featur e Featur e Featur e Featur e Mining Selected Data Preprocessing Data Selection 36

Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y f 11 : support of X and Y X f 11 f 10 f 1+ X f 01 f 00 f o+ f +1 f +0 T f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc. 37

Drawback of Confidence Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= PCoffee Tea) = 0.75 but PCoffee) = 0.9 Although confidence is high, rule is misleading PCoffee Tea) ea) = 0.9375 38

Statistical Independence Population of 1000 students 600 students know how to swim S) 700 students know how to bike B) 420 students know how to swim and bike S,B) PS B) = 420/1000 = 0.42 PS) PB) = 0.6 0.7 = 0.42 PS B) = PS) PB) => Statistical independence PS B) > PS) PB) => Positively correlated PS B) < PS) PB) => Negatively correlated 39

Statistical-based Measures Measures that take into account statistical d d dependence ) X Y P ) ) Y P X Y P Lift = ) ) ), Y P X P Y X P Interest = ) ) ) ) ) ), ) ) Y P X P Y X P Y P X P Y X P PS = )] )[1 )] )[1 ) ) ), Y P Y P X P X P Y P X P Y X P coefficient = φ 40

Example: Lift Coffee Coffee Tea 15 5 20 Tea 75 5 80 90 10 100 Association Rule: Tea Coffee Confidence= PCoffee Tea) = 0.75 but PCoffee) = 0.9 Lift = 0.75/0.9= 0.8333 < 1, therefore is negatively associated) 41

Example: Lift cont.) Y Y X 10 0 10 X 0 90 90 10 90 100 Y Y X 90 0 90 X 0 10 10 90 10 100 0.1 Lift = = 10 0.9 Lift = = 1. 11 0.1)0.1) 0.9)0.9) Statistical independence: If PX,Y)=PX)PY) => Lift = 1 42

Some Issues There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori- style support based pruning? How does it affect these measures?