Data Mining2 Advanced Aspects and Applications



Similar documents
Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining Apriori Algorithm

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Mining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining

Association Analysis: Basic Concepts and Algorithms

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Association Rule Mining

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Market Basket Analysis and Mining Association Rules

Project Report. 1. Application Scenario

Mining various patterns in sequential data in an SQL-like manner *

Databases - Data Mining. (GF Royle, N Spadaccini ) Databases - Data Mining 1 / 25

Market Basket Analysis for a Supermarket based on Frequent Itemset Mining

Data Mining: An Overview. David Madigan

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Chapter 20: Data Analysis

Introduction to Data Mining

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

MBA Data Mining & Knowledge Discovery

RDB-MINER: A SQL-Based Algorithm for Mining True Relational Databases

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Association Rule Mining: A Survey

A Survey on Association Rule Mining in Market Basket Analysis

Data Mining Applications in Manufacturing

Building A Smart Academic Advising System Using Association Rule Mining

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

Chapter 6: Episode discovery process

Fuzzy Association Rules

Data Mining: Foundation, Techniques and Applications

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Data Mining & Knowledge Discovery: Personalization and Profiling Technologies

Enhancement of Security in Distributed Data Mining

New Matrix Approach to Improve Apriori Algorithm

IJRFM Volume 2, Issue 1 (January 2012) (ISSN )

Gerry Hobbs, Department of Statistics, West Virginia University

An application for clickstream analysis

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Protein Protein Interaction Networks

Business Lead Generation for Online Real Estate Services: A Case Study

Information Management course

Unique column combinations

Distributed Apriori in Hadoop MapReduce Framework

Mining an Online Auctions Data Warehouse

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Implementation of Data Mining Techniques to Perform Market Analysis

Future Trend Prediction of Indian IT Stock Market using Association Rule Mining of Transaction data

Distributed Data Mining Algorithm Parallelization

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Data Mining for Retail Website Design and Enhanced Marketing

Lecture 6 - Data Mining Processes

Analytics on Big Data

A Fraud Detection Approach in Telecommunication using Cluster GA

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

A Data Mining Framework for Optimal Product Selection in Retail Supermarket Data: The Generalized PROFSET Model

Introduction to Data Mining

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

How To Write An Association Rules Mining For Business Intelligence

ONTOLOGY IN ASSOCIATION RULES PRE-PROCESSING AND POST-PROCESSING

Improving Apriori Algorithm to get better performance with Cloud Computing

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

An Efficient GA-Based Algorithm for Mining Negative Sequential Patterns

Association rules for improving website effectiveness: case analysis

DATA MINING TECHNIQUES: A SOURCE FOR CONSUMER BEHAVIOR ANALYSIS

A Data Mining Tutorial

Web Users Session Analysis Using DBSCAN and Two Phase Utility Mining Algorithms

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Data Mining Introduction

IBM SPSS Direct Marketing 22

Data Mining to Recognize Fail Parts in Manufacturing Process

Data Preprocessing. Week 2

An Overview of Knowledge Discovery Database and Data mining Techniques

Frequent item set mining

A Novel Technique of Privacy Protection. Mining of Association Rules from Outsourced. Transaction Databases

Implementing Improved Algorithm Over APRIORI Data Mining Association Rule Algorithm

Data Warehousing and Data Mining

Supply chain management by means of FLM-rules

IBM SPSS Direct Marketing 23

DATA MINING TECHNIQUES: A SOURCE FOR CONSUMER BEHAVIOR ANALYSIS

Ontology-Based Filtering Mechanisms for Web Usage Patterns Retrieval

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Discovery of Maximal Frequent Item Sets using Subset Creation

Multi-table Association Rules Hiding

Transcription:

Data Mining2 Advanced Aspects and Applications Fosca Giannotti and Mirco Nanni Pisa KDD Lab, ISTI-CNR & Univ. Pisa http://www-kdd.isti.cnr.it/ DIPARTIMENTO DI INFORMATICA - Università di Pisa anno accademico 2013/2014

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Association rules - module outline What are association rules (AR) and what are they used for: The paradigmatic application: Market Basket Analysis The single dimensional AR (intra-attribute) How to compute AR Basic Apriori Algorithm and its optimizations Multi-Dimension AR (inter-attribute) Quantitative AR Constrained AR How to reason on AR and how to evaluate their quality Multiple-level AR Interestingness Correlation vs. Association 3

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality!

Definition: Frequent Itemset Itemset A collection of one or more items u Example: {Milk, Bread, Diaper} k-itemset u An itemset that contains k items Support count (σ) Frequency of occurrence of an itemset E.g. σ({milk, Bread,Diaper}) = 2 σ(x) = {t i X contained in t i and ti is a trasaction} Support Fraction of transactions that contain an itemset E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Definition: Association Rule Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support (s) u Fraction of transactions that contain both X and Y Confidence (c) u Measures how often items in Y appear in transactions that contain X TID Example: Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke { Milk, Diaper} (Milk, Diaper,Beer) s = σ = T σ (Milk,Diaper,Beer) c = = σ (Milk,Diaper) 2 3 2 5 = Beer = 0.4 0.67

The Apriori Algorithm The classical Apriori algorithm [1994] exploits a nice property of frequency in order to prune the exponential search space of the problem: if an itemset is infrequent all its supersets will be infrequent as well This property is known as the antimonotonicity of frequency (aka the Apriori trick ). This property suggests a breadth-first level-wise computation. a,b,c,d a, b, c a, b, d a, c, d b, c, d a, b a, c a, d b, c b, d c, d a b c d 7

Apriori Execution Example (min_sup = 2) Database TDB TID Items 100 1 3 4 Scan TDB 200 2 3 5 300 1 2 3 5 400 2 5 itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 C 1 L 1 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 L 2 C 2 C 2 itemset {2 3 5} Scan TDB C 3 L 3 Scan TDB itemset sup {2 3 5} 2 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} 8

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; 9

Generating Association Rules from Frequent Itemsets Only strong association rules are generated Frequent itemsets satisfy minimum support threshold Strong rules are those that satisfy minimun confidence threshold support( A B) support( A) For each frequent itemset, f, generate all non-empty subsets of f For every non-empty subset s of f do if support(f)/support(s) min_confidence then output rule s ==> (f-s) end 10

Rule Generation Given a frequent itemset L, find all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC BD, AD BC, BC AD, BD AC, CD AB, If L = k, then there are 2 k 2 candidate association rules (ignoring L and L)

Multidimensional AR Associations between values of different attributes : CID nationality age income 1 Italian 50 low 2 French 40 high 3 French 30 high 4 Italian 50 medium 5 Italian 45 high 6 French 35 high RULES: nationality = French income = high [50%, 100%] income = high nationality = French [50%, 75%] age = 50 nationality = Italian [33%, 100%] 12

Discretization of quantitative attributes Solution: each value is replaced by the interval to which it belongs. height: 0-150cm, 151-170cm, 171-180cm, >180cm weight: 0-40kg, 41-60kg, 60-80kg, >80kg income: 0-10ML, 11-20ML, 20-25ML, 25-30ML, >30ML CID height weight income 1 151-171 60-80 >30 2 171-180 60-80 20-25 3 171-180 60-80 25-30 4 151-170 60-80 25-30 Problem: the discretization may be useless (see weight). 13

Multi-level Association Rules Food Electronics Bread Milk Computers Home Wheat White Skim 2% Desktop Laptop Accessory TV DVD Foremost Kemps Printer Scanner 14

Multilevel AR Is difficult to find interesting patterns at a too primitive level high support = too few rules low support = too many rules, most uninteresting Approach: reason at suitable level of abstraction A common form of background knowledge is that an attribute may be generalized or specialized according to a hierarchy of concepts Dimensions and levels can be efficiently encoded in transactions Multilevel Association Rules : rules which combine associations with hierarchy of concepts 15

Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence Interestingness measures can be used to prune/ rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used

Application of Interestingness Measure Interestingness Measures

Computing Interestingness Measure Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y X f 11 f 10 f 1+ X f 01 f 00 f o+ f +1 f +0 T f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures support, confidence, lift, Gini, J-measure, etc.

Statistical-based Measures Measures that take into account statistical dependence )] ( )[1 ( )] ( )[1 ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( Y P Y P X P X P Y P X P Y X P coefficient Y P X P Y X P PS Y P X P Y X P Interest Y P X Y P Lift = = = = φ

Conclusion (Market basket Analysis) MBA is a key factor of success in the competition of supermarket retailers. Knowledge of customers and their purchasing behavior brings potentially huge added value. 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 81% 20% 13% 30% 6% Light Medium Top how many customers how much they spend 50% 20

Which tools for market basket analysis? Association rule are needed but insufficient Market analysts ask for business rules: Is supermarket assortment adequate for the company s target class of customers? Is a promotional campaign effective in establishing a desired purchasing habit? 21

Business rules: temporal reasoning on AR Which rules are established by a promotion? How do rules change along time? 35 30 25 20 Support Pasta => Fresh Cheese 14 15 Bread Subsidiaries => Fresh Cheese 28 Biscuits => Fresh Cheese 14 10 Fresh Fruit => Fresh Cheese 14 5 Frozen Food => Fresh Cheese 14 0 25/11/97 26/11/97 27/11/97 28/11/97 29/11/97 30/11/97 01/12/97 02/12/97 03/12/97 04/12/97 05/12/97 22

Sequential Pattern Mining 23

Sequential Pattern Mining Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steibach, Tan,Steinbach, Kumar & Integration by (Giannott&Nanni) Introduction to Data DM2 Mining 2013-2014 # 4/18/2004 24

Sequential Patterns- module outline What are Sequential Patterns(SP) and what are they used for From Itemset to sequences Formal Definiton Computing Sequential Patterns Timing Constraints 25

Sequential / Navigational Patterns Sequential patterns add an extra dimension to frequent itemsets and association rules - time. Items can appear before, after, or at the same time as each other. General form: x% of the time, when A appears in a transaction, B appears within z transactions. u note that other items may appear between A and B, so sequential patterns do not necessarily imply consecutive appearances of items (in terms of time) Examples Renting Star Wars, then Empire Strikes Back, then Return of the Jedi in that order Collection of ordered events within an interval Most sequential pattern discovery algorithms are based on extensions of the Apriori algorithm for discovering itemsets Navigational Patterns they can be viewed as a special form of sequential patterns which capture navigational patterns among users of a site in this case a session is a consecutive sequence of pageview references for a user over a specified period of time 26 Tan,Steibach, Kumar & Integration by (Giannott&Nanni) Giannoti DM2 & Pedreschi 2013-2014 #

Examples of Sequence Data Sequence Database Customer Web Data Event data Genome sequences Element (Transaction ) Sequence Sequence Purchase history of a given customer Browsing activity of a particular Web visitor History of events generated by a given sensor DNA sequence of a particular species E1 E2 E1 E3 Element (Transaction) A set of items bought by a customer at time t A collection of files viewed by a Web visitor after a single mouse click Events triggered by a sensor at time t An element of the DNA sequence E2 E2 E3 E4 Event (Item) Books, diary products, CDs, etc Home page, index page, contact info, etc Types of alarms generated by sensors Bases A,T,G,C Event (Item ) 27

From Itemset to sequences Goal: customize, personalize the offerts according the personal history of any client Analysis: to study the temporal buying behaviour 5% of clients first has bought X, then Y then Z Requirements: to keep trace of the history for the clients (nome, fidelity cards, carte di credito, bancomat, e-mail, codice fiscale) Domanins: vendite al dettaglio, vendite per corrispondenza, vendite su internet, vendite di prodotti finanziari/bancari, analisi mediche 28

Transaction with Client Identifier (Pseudo) Intra-Transaction (Association Rules) Inter-Transaction (Sequential Patterns) items { i 1,, i k } Clients { c 1,, c m } Transaztion t { i 1,, i k } Client trasactions T = { (c 1, date 1, t 1 ),, (c n, date n, t n ) } Date may be replaced with a progressive number 29

Conceptual Model CRM & SP Logic Model Cliente Data Trans 3 10/09/1999 {10} 2 10/09/1999 {10, 20} 5 12/09/1999 {90} 2 15/09/1999 {30} 2 20/09/1999 {40,60,70} 1 25/09/1999 {30} 3 25/09/1999 {30,50,70} 4 25/09/1999 {30} 4 30/09/1999 {40,70} 1 30/09/1999 {90} 4 25/10/1999 {90} Data Cliente Articolo 10/09/1999 3 10 10/09/1999 2 10 10/09/1999 2 20 12/09/1999 5 90 15/09/1999 2 30 20/09/1999 2 40 20/09/1999 2 60 20/09/1999 2 70 25/09/1999 1 30 25/09/1999 3 30 25/09/1999 3 30 25/09/1999 3 70 25/09/1999 4 30 30/09/1999 4 40 30/09/1999 4 70 30/09/1999 1 90 25/10/1999 4 90 30

Sequence data from MB Insieme di transazioni cliente T = { (data 1, c 1, t 1 ),, (data n, c n, t n ) } Sequenza di transazioni per cliente c seq(c) = <t 1,, t i, t n > ordinate per data Cliente Sequenza 1 < {30},{90} > 2 < {10, 20}, {30}, {40,60,70}> 3 <{10}, {30,50,70}> 4 < {30}, {40,70}, {90} > 5 <{90}> Libro Titolo 10 Star Wars Episode I 20 La fondazione e l'impero 30 La seconda fondazione 40 Database systems 50 Algoritmi + Strutture Dati = 60 L'insostenibile leggerezza 70 Immortalita' 90 I buchi neri 31

Sequence Data Sequence Database: Timeline 10 15 20 25 30 35 Object Timestamp Events A 10 2, 3, 5 A 20 6, 1 A 23 1 B 11 4, 5, 6 B 17 2 B 21 7, 8, 1, 2 B 28 1, 6 C 14 1, 8, 7 Object A: Object B: 2 3 5 4 5 6 6 1 2 7 8 1 2 1 1 6 Object C: 1 7 8 Master MAINS, Marzo 2012 Reg. Ass. Tan,Steibach, Kumar & Integration by (Giannott&Nanni) Giannotti DM2 & Pedreschi 2013-2014 # 32

Sequences & Supports (intuition) <I 1, I 2,, I n > is contained in<j 1, J 2,, J m > If there exist h 1 < < h n such that I 1 J h1,, I n J hn < {30}, {90} > is contained in < {30}, {40,70}, {90} > < {30}, {40,70} > is contained in < {10,20}, {30}, {40,50,60,70} > and in < {30}, {40,70}, {90} > Support(s) = { c s contained in seq(c) } number of clients Support(< {20}, {70} > ) = 40% Supporto(< {90} > ) = 60% 33

Formal Definition of a Sequence A sequence is an ordered list of elements (transactions) s = < e 1 e 2 e 3 > Each element contains a collection of events (items) e i = {i 1, i 2,, i k } Each element is attributed to a specific time or location Length of a sequence, s, is given by the number of elements of the sequence A k-sequence is a sequence that contains k events (items) Tan,Steibach, Kumar & Integration by (Giannott&Nanni) Giannotti DM2 & Pedreschi 2013-2014 # 34

Examples of Sequence Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} > Sequence of initiating events causing the nuclear accident at 3-mile Island: (http://stellar-one.com/nuclear/staff_reports/summary_soe_the_initiating_event.htm) < {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases}> Sequence of books checked out at a library: <{Fellowship of the Ring} {The Two Towers} {Return of the King}> 35

Formal Definition of a Subsequence A sequence <a 1 a 2 a n > is contained in another sequence <b 1 b 2 b m > (m n) if there exist integers i 1 < i 2 < < i n such that a 1 b i1, a 2 b i1,, a n b in Data sequence Subsequence Contain? < {2,4} {3,5,6} {8} > < {2} {3,5} > Yes < {1,2} {3,4} > < {1} {2} > No < {2,4} {2,4} {2,5} > < {2} {4} > Yes The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is minsup) 36

Sequential Pattern Mining: Definition Given: a database of sequences a user-specified minimum support threshold, minsup Task: Find all subsequences with support minsup 37

Sequential Pattern Mining: Challenge Given a sequence: <{a b} {c d e} {f} {g h i}> Examples of subsequences: <{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc. How many k-subsequences can be extracted from a given n-sequence? <{a b} {c d e} {f} {g h i}> n = 9 k=4: Y Y Y _ Y Answer : n k 9 4 <{a} {d e} {i}> = = 126 Master MAINS, Marzo 2012 Reg. Ass. Tan,Steibach, Kumar & Integration by (Giannott&Nanni) Giannotti DM2 & Pedreschi 2013-2014 # 38

Sequential Pattern Mining: Example Object Timestamp Events A 1 1,2,4 A 2 2,3 A 3 5 B 1 1,2 B 2 2,3,4 C 1 1, 2 C 2 2,3,4 C 3 2,4,5 D 1 2 D 2 3, 4 D 3 4, 5 E 1 1, 3 E 2 2, 4, 5 Minsup = 50% Examples of Frequent Subsequences: < {1,2} > s=60% < {2,3} > s=60% < {2,4}> s=80% < {3} {5}> s=80% < {1} {2} > s=80% < {2} {2} > s=60% < {1} {2,3} > s=60% < {2} {2,3} > s=60% < {1,2} {2,3} > s=60% Master MAINS, Marzo 2012 Reg. Ass. Tan,Steibach, Kumar & Integration by (Giannott&Nanni) Giannotti DM2 & Pedreschi 2013-2014 # 39

Extracting Sequential Patterns Given n events: i 1, i 2, i 3,, i n Candidate 1-subsequences: <{i 1 }>, <{i 2 }>, <{i 3 }>,, <{i n }> Candidate 2-subsequences: <{i 1, i 2 }>, <{i 1, i 3 }>,, <{i 1 } {i 1 }>, <{i 1 } {i 2 }>,, <{i n-1 } {i n }> Candidate 3-subsequences: <{i 1, i 2, i 3 }>, <{i 1, i 2, i 4 }>,, <{i 1, i 2 } {i 1 }>, <{i 1, i 2 } {i 2 }>,, <{i 1 } {i 1, i 2 }>, <{i 1 } {i 1, i 3 }>,, <{i 1 } {i 1 } {i 1 }>, <{i 1 } {i 1 } {i 2 }>, Master MAINS, Marzo 2012 Reg. Ass. Tan,Steibach, Kumar & Integration by (Giannott&Nanni) Giannotti DM2 & Pedreschi 2013-2014 # 40

Generalized Sequential Pattern (GSP) Step 1: Make the first pass over the sequence database D to yield all the 1-element frequent sequences Step 2: Repeat until no new frequent sequences are found Candidate Generation: u Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items Candidate Pruning: u Prune candidate k-sequences that contain infrequent (k-1)-subsequences Support Counting: u Make a new pass over the sequence database D to find the support for these candidate sequences Candidate Elimination: u Eliminate candidate k-sequences whose actual support is less than minsup 41

Timing Constraints (I) {A B} {C} {D E} <= x g >n g x g : max-gap n g : min-gap <= m s m s : maximum span x g = 2, n g = 0, m Data sequence s = 4 Subsequence Contain? < {2,4} {3,5,6} {4,7} {4,5} {8} > < {6} {5} > Yes < {1} {2} {3} {4} {5}> < {1} {4} > No < {1} {2,3} {3,4} {4,5}> < {2} {3} {5} > Yes < {1,2} {3} {2,3} {3,4} {2,4} {4,5}> < {1,2} {5} > No 42

Time constraints (2) Sliding Windows (transazione contenuta in più transazioni) <I 1, I 2,, I n > è contenuta in <J 1, J 2,, J m > se esistono h 1 < u 1 < < h n < u n per cui I 1 U k = h1..u1 J k,, I n U k = hn..un J k transaction-time(j ui ) - transaction-time(j hi ) < window-size per i = 1..n < {30}, {40,70} > è contenuta in < {30}, {40}, {70} > se transaction-time({70}) - transaction-time({40}) < window-size Time Constraints (limite di tempo tra due transazioni) <I 1, I 2,, I n > è contenuta in <J 1, J 2,, J m > se esistono h 1 < < h n per cui I 1 J h1,, I n J hn mingap < transaction-time(j hi ) - transaction-time(j hi-1 ) < maxgap Tan,Steibach, Kumar & Integration by (Giannott&Nanni) Giannotti per i DM2 = 2..n & 2013-2014 Pedreschi # 43.

Sequences & Supports <I 1, I 2,, I n > is contained in<j 1, J 2,, J m > If there exist h 1 < < h n such that I 1 J h1,, I n J hn < {30}, {90} > is contained in < {30}, {40,70}, {90} > < {30}, {40,70} > is contained in < {10,20}, {30}, {40,50,60,70} > and in < {30}, {40,70}, {90} > Support(s) = { c s contained in seq(c) } number of clients Support(< {20}, {70} > ) = 40% Supporto(< {90} > ) = 60% 44

Sequential Patterns Given MinSupport and a set of sequences S = { s Support(s) >= MinSupport } A sequence in S is a Sequential Pattern if is not contained in any other sequence of S MinSupport = 40% < {30}, {90} > is a sequantial pattern Supporto< {30} >) = 80% is not a sequantial pattern as it is contained in < {30}, {90} > MinSupporto = 50% < {30}, {90} > non è in S < {30} > è un pattern sequenziale 45

Altre Generalizzazioni Sliding Windows (transazione contenuta in più transazioni) <I 1, I 2,, I n > è contenuta in <J 1, J 2,, J m > se esistono h 1 < u 1 < < h n < u n per cui I 1 U k = h1..u1 J k,, I n U k = hn..un J k transaction-time(j ui ) - transaction-time(j hi ) < window-size per i = 1..n < {30}, {40,70} > è contenuta in < {30}, {40}, {70} > se transaction-time({70}) - transaction-time({40}) < window-size Time Constraints (limite di tempo tra due transazioni) <I 1, I 2,, I n > è contenuta in <J 1, J 2,, J m > se esistono h 1 < < h n per cui I 1 J h1,, I n J hn mingap < transaction-time(j hi ) - transaction-time(j hi-1 ) < maxgap Tan,Steibach, Kumar & Integration by (Giannott&Nanni) per i DM2 = 2..n 2013-2014 # 46

Sequential Pattern Mining: Cases and Parameters Duration of a time sequence T Sequential pattern mining can then be confined to the data within a specified duration Ex. Subsequence corresponding to the year of 1999 Ex. Partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption Event folding window w If w = T, time-insensitive frequent patterns are found If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant If 0 < w < T, sequences occurring within the same period w are folded in the analysis 47

Sequential Pattern Mining: Cases and Parameters Time interval, int, between events in the discovered pattern int = 0: no interval gap is allowed, i.e., only strictly consecutive sequences are found u Ex. Find frequent patterns occurring in consecutive weeks min_int int max_int: find patterns that are separated by at least min_int but at most max_int u Ex. If a person rents movie A, it is likely she will rent movie B within 30 days (int 30) int = c 0: find patterns carrying an exact interval u Ex. Every time when Dow Jones drops more than 5%, what will happen exactly two days later? (int = 2) 48

Aspetti Computazionali Mail Order: Clothes 16.000 items 2.900.000 transazioni 214.000 clienti 10 anni Algoritmo GSP (Shrikant e Agrawal) su IBM RS/6000 250 60 50 40 30 20 10 0 1 0,5 0,25 0,2 0,15 0,1 49