Lift-based search for significant dependencies in dense data sets

Similar documents

Data Mining Techniques Chapter 6: Decision Trees

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Logistic Regression for Spam Filtering

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Lecture 10: Regression Trees

An optimisation framework for determination of capacity in railway networks

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE

Association Rule Mining

Chapter 6: Episode discovery process

Decision Trees from large Databases: SLIQ

Some Essential Statistics The Lure of Statistics

Computer Game Programming Intelligence I: Basic Decision-Making Mechanisms

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Classification and Prediction

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Scoring the Data Using Association Rules

HUFFMAN CODING AND HUFFMAN TREE

The Goldberg Rao Algorithm for the Maximum Flow Problem

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

CHAPTER 11. Proposed Project. Incremental Cash Flow for a Project. Treatment of Financing Costs. Estimating cash flows:

Fuld Skolerapport for Søhusskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 9. med reference Tilsvarende klassetrin i kommunen

Fuld Skolerapport for Hunderupskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 7. med reference Tilsvarende klassetrin i kommunen

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

1. What are Data Structures? Introduction to Data Structures. 2. What will we Study? CITS2200 Data Structures and Algorithms

6.2 Normal distribution. Standard Normal Distribution:

Data Mining Classification: Decision Trees

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

SQL Query Evaluation. Winter Lecture 23

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

How Does My TI-84 Do That

FilterBoost: Regression and Classification on Large Datasets

Inverted Indexes: Trading Precision for Efficiency

Master s Program in Information Systems

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Is it statistically significant? The chi-square test

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Decision-Tree Learning

Two Correlated Proportions (McNemar Test)

Understand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test.

Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

VI. Introduction to Logistic Regression

Data Mining: A Preprocessing Engine

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

On the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Constrained Least Squares

New Matrix Approach to Improve Apriori Algorithm

Personalized Predictive Medicine and Genomic Clinical Trials

Computation of the Aggregate Claim Amount Distribution Using R and actuar. Vincent Goulet, Ph.D.

MATROIDS AND MYSQL WHAT TO DO

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Level Set Framework, Signed Distance Function, and Various Tools

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Protein Protein Interaction Networks

Finite Automata. Reading: Chapter 2

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Data Structures. Algorithm Performance and Big O Analysis

Binary search algorithm

Algorithms Chapter 12 Binary Search Trees

OPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Detection of changes in variance using binary segmentation and optimal partitioning

A Study of Efficacy of Apex Learning Cherokee County School District

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

7 Gaussian Elimination and LU Factorization

Line and Polygon Clipping. Foley & Van Dam, Chapter 3

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Dynamic Adaptive Feedback of Load Balancing Strategy

CONTINGENCY (CROSS- TABULATION) TABLES

1.204 Lecture 10. Greedy algorithms: Job scheduling. Greedy method

Dynamic Trust Management for the Internet of Things Applications

8.1 Min Degree Spanning Tree

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

Loop Invariants and Binary Search

Multinomial and Ordinal Logistic Regression

Discovering Local Subgroups, with an Application to Fraud Detection

Gamma Distribution Fitting

Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory

Transcription:

Lift-based search for significant dependencies in dense data sets W. Hämäläinen epartment of omputer Science University of Helsinki Finland whamalai@cs.helsinki.fi StReio 09 (K 09) p.1/48

1 Problem Find a good set of rules X A which express positive dependence also in the future data! R = {A 1,...,A k } = set of all attributes, where A i R is binary (binarized), X R and A R 1. P(XA) > P(X)P(A) (positive dependence) 2. dependence is genious (holds in the future data) statistical significance tests cross-validation 3. redundant rules are pruned StReio 09 (K 09) p.2/48

1.1 Positive dependence Lift γ(x,a) = P(XA) P(X)P(A) = P(A X) P(A) > 1 if the rule has high confidence cf = P(A X) > P(A) (in the future data), it suits for prediction Independence rules where P(A X) = P(A) are trivial (useles for predicting A) Negative dependencies P(A X) < P(A) are harmful for predicting A StReio 09 (K 09) p.3/48

If cf is low, rule can still be important for predictive models e.g. reveals (undesired) dependencies between variables. always useful for descriptive purposes Traditional frequency-based methods often find independence rules or even negative dependency rules in dense data sets! StReio 09 (K 09) p.4/48

xample: Most general significant rules in hess 3 30 41 61 67 47 18 29 2 8 11... 19 13 17 4 8 8 8 73 MIN 68 71 13 13 8 MIN 17 8 8 11 R 8 11 R MIN MIN MIN MIN MIN MIN MIN 21 73 8 71 13 MIN MIN MIN MIN MIN MIN 19 1 68 11 R 64 MIN 8 8 11 MIN 6 MIN R 4 17 21 1 26 26 3 73 64 6 64 6 11 11 68 11 64 71 38 46 MIN MIN MIN MIN MIN 68 44 11 11 11 17 32 67 16 68 36 64...... 17... 84 696... 73 23... 20 17 17 20 76 MIN MIN MIN 20 46... 2 3181 318 319 8 3169 8 8 3170 3170 3184 StReio 09 (K 09) p./48

1.2 Pruning rules The number of rules can be too large! computational burden (time & space requirement) the user cannot scan through all rules simple rules avoid over-fitting (Occam s Razor principle) Search only non-redundant rules! StReio 09 (K 09) p.6/48

Redundancy (classically) epends on the goodness measure M. Several definitions! Rule or set is redundant if it contains useles attributes (which at most decrease the goodness). If M is increasing Set X is redundant if Y X such that M(Y ) M(X). Rule X A is redundant if Y X such that M(Y A) M(X A) StReio 09 (K 09) p.7/48

Redundancy (here) efinition 1. Set X is redundant if Y X such that M(estRule(X)) M(estrule(Y )) Rule X \ A A is redundant, if Y X such that M(X \ A A) M(Y \ ). estrule(x) = argmax{m(x \ A A)} (best rule which can be constructed from X) e.g. A can redundant in respect of A or A StReio 09 (K 09) p.8/48

Why this definition?? are the best among classically non-redundant rules! computationally fast & memory friendly significant rules are often permutations of each other the algorithm can be applied to classical definition, but computationally more difficult (not tested yet) StReio 09 (K 09) p.9/48

1.3 Statistical significance Idea: If X A expresses positive dependence in the sample data, what is probability that it has occured by chance? (i.e. that X and A were actually independent) Let m(xa) = n P(XA) (absolute frequency) p-value = probability that (XA) occurs at least m(xa) times in data set r, r = n, if P(XA) = P(X)P(A) (independence) If p is very low, X A is likely to be genuine StReio 09 (K 09) p./48

How to estimate p? inomial probability: p = n ( n i ) (P(X)P(A)) i (1 P(X)P(A)) n i i=m(x,a) prob. that XA occures at least m(xa) times in the whole data of size n StReio 09 (K 09) p.11/48

Alternatively (not suitable) p 2 = m(x) i=m(x,a) ( ) m(x) i (P(A)) i (1 P(A)) m(x) i prob. that A occures at least m(xa) times on rows where X is true rules with different X cannot be compared! StReio 09 (K 09) p.12/48

z-score p is computationally difficult! can be estimated by z-score: z(x,a) = = m(xa) np(x)p(a) np(x)p(a)(1 P(X)P(A)) n(γ(x,a) 1) γ(x,a) P(XA) Now p 1 Φ(z(X,A)), where Φ is the standard normal cumulative distribution function. StReio 09 (K 09) p.13/48

Using z-score z can be used as a ranking function as such! z is monotonically increasing function of m(xa) and γ suits for brach and bound search works well, when expected counts m(x)p(a) are sufficiently large (e.g. ) when m(x)p(a) is small, z is over-optimistic other functions might work better for search purposes the measure function should be monotonically increasing or decreasing function of m(xa) and γ(x,a) StReio 09 (K 09) p.14/48

2 Searching significant rules All possible attribute sets can be listed by an enumeration tree: A 2 2 40 60 1 4 20 30 20 20 1 1 20 1 20 1 20 StReio 09 (K 09) p.1/48

2.1 How to traverse the tree? Given set X we want to know an upperbound for M(estRule(XQ)) m(xq) m(x) always γ(estrule(xq)) 1 P(A min ), where P(A min ) = min{p(a i ), A i XQ}, because γ(xq \ A i A i ) = P(A i ) P(A min ) P(XQ) P(XQ \ A i )P(A i ) StReio 09 (K 09) p.16/48

U(M(estRule(XQ)))! min{p(a i ) A i XQ} min{p(a j ) A j X} U(M(estRule(XQ))) U(M(estRule(X))) 1. if uppebound U(M(estRule(XQ))) < min M, rules of XQ are insignificant 2. if U (M(estRule(XQ))) max{m(estrule(y )) Y X}, rules of XQ are redundant 3. if estrule(x) has maximal lift P(A min ) 1, it is minimal and all more specific rules will be redundant StReio 09 (K 09) p.17/48

Property P S potentially significant (X) U(M(estRule(X))) min M Property is monotonic, if we traverse the tree in certain order! Meaning: if even one from Y s parents is or minimal, Y (or its children) cannot be non-redundant P S. Y can be pruned StReio 09 (K 09) p.18/48

Traversal order attributes are in descending order search top down from right to left both frequencies and maximum lifts can only decrease parent sets X have always better upperbounds than their children XQ have! StReio 09 (K 09) p.19/48

Relations of P S sets t i = sets under A i t i t j = sets under A i A j P(A i )...P(A j 1 ) P(A j ) A i A j (t ) i j 1 A j A j 1 A j (t i j ) ( t ) j... A j 1 A j ( t ) j (t ) j 1 ( t ) j t j (tj 1 ) (t ) i j U StReio 09 (K 09) p.20/48

Frequency counting data itself can be used to initialize the tree later frequencies can be counted from the tree (no need to check original data anymore) StReio 09 (K 09) p.21/48

Frequency tree for data A F 60 30 11 2 1 AF 1 2 A 9 A 1 A 1 A 20 20 F F 1 StReio 09 (K 09) p.22/48

Pruning attributes checking all 2-sets can prune out low frequency attributes maximal U (γ)s are decreased A can be pruned, if for all A i A M(m(AA i ), min{p(a),p(a i )} 1 ) < min M StReio 09 (K 09) p.23/48

3. Simulation A F 60 30 11 2 1 AF 1 2 A 9 A 1 A 1 A 20 20 F F 1 StReio 09 (K 09) p.24/48

Simulation step 1 A F 60 30 11 2 1 2 F 1 m(fa )=1 for all A <>F i i StReio 09 (K 09) p.2/48

Simulation step 2 A 60 2 1 30 20 20 20 2 z=8.2 MIN added StReio 09 (K 09) p.26/48

Simulation step 3 A 60 2 1 2 30 4 1 z=2.1 20 20 20 z=8.2 MIN added StReio 09 (K 09) p.27/48

Simulation step 4 A 60 2 1 2 30 1 z=2.1 4 20 20 1 20 z=8.2 MIN added is not created! StReio 09 (K 09) p.28/48

Simulation step A 60 2 1 20 1 4 1 20 20 20 z=8.2 MIN 2 added z=2.9 StReio 09 (K 09) p.29/48

Simulation step 6 A 60 2 1 20 20 1 4 1 20 20 20 z=8.2 MIN 2 20 removed z=2.9 StReio 09 (K 09) p.30/48

Simulation step 7 A 60 2 1 z=3.8 2 1 30 1 20 20 1 4 1 20 20 20 z=8.2 MIN removed added z=3.8 z=1.2 Rules > and > found StReio 09 (K 09) p.31/48

Simulation step 8 A 2 60 1 30 20 20 1 4 1 20 20 20 z=8.2 MIN 2 1 1 z=3.8 z=3.8 added added oth, z<0 StReio 09 (K 09) p.32/48

Simulation step 9 A 2 40 60 30 20 20 1 4 1 20 20 20 z=8.2 MIN 2 1 1 z=3.8 z=3.8 added oth have z=2.7 StReio 09 (K 09) p.33/48

Simulation step A 2 40 60 30 20 20 1 4 1 20 20 20 z=8.2 MIN 2 0 1 1 z=3.8 z=3.8 added and removed (fr=0) added z=2.3 StReio 09 (K 09) p.34/48

Simulation step 11 A z=4.4 2 2 60 40 30 1 1 20 20 1 4 1 20 20 20 z=8.2 MIN z=2.3 z=3.8 z=3.8 removed Rule A > found z=4.4 StReio 09 (K 09) p.3/48

Simulation step 12 A z=4.4 2 2 60 40 30 1 1 20 20 1 4 1 20 20 20 z=8.2 MIN z=2.3 z=3.8 z=3.8 z=2.3 StReio 09 (K 09) p.36/48

Simulation final result A z=4.4 2 2 60 40 30 1 1 20 20 1 4 1 20 20 20 z=8.2 MIN z=2.3 z=3.8 z=3.8 z=2.3 z = 8.2 cf = 1.0 fr = 0.20 γ =.0 A z = 4.4 cf = 1.0 fr = 0.2 γ =.0 z = 3.8 cf = 0. fr = 0.1 γ = 2. z = 3.8 cf = 0. fr = 0.1 γ = 2. StReio 09 (K 09) p.37/48

4. xperiments: Goals Quality of rules compared to traditional methods what can we gain when minfr is not used? Performance: how fast is it? How complex data sets can we handle? StReio 09 (K 09) p.38/48

Proportions of useful and harmful rules Rule is at least slightly useful, if expresses positive dependency in test data useful, if expresses clear positive dependency (requirement: z 1) at least slightly harmful if expresses negative dependency in test data useful, if expresses clear negative dependency (requirement: z 1) StReio 09 (K 09) p.39/48

ata sets iological + medical + hess as a patological case Set n k Heart 17 23 Hearneg 17 46 (Garden 1340 2372) Plants 1088 70 Mushroom 416 120 hess 2130 76 StReio 09 (K 09) p.40/48

Results 140 120 0 80 60 40 20 0-20 Proportions of useful and harmful rules Slightly useful rules Useful rules Slightly harmful rules Harmful rules cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b Heart HeartNeg Plants Mushroom hess StReio 09 (K 09) p.41/48

Observations Selecting rules with U (ln(p)) improves results (vs. z) Using min cf in the search can distort results if parent has higher z but too low cf, the set is not pruned as redundant if min f is not used in the search (only in the end), the number of rules can be too small often still better approah (smaller prediction error in the test sets) StReio 09 (K 09) p.42/48

omparison to traditional frequency-based search with as low min fr as possible + pruning with different measure functions Set minf r Heart 0.0 Hearneg 0.32 Plants 0.12 Mushroom 0.22 hess 0.7 StReio 09 (K 09) p.43/48

Results Proportions of useful and harmful rules when cf=0.9 10 Slightly useful rules Useful rules Slightly harmful rules Harmful rules 0 0 0-0 χ 2 J Heart z fr χ 2 J z HeartNeg fr χ 2 J z Plants fr χ 2 J z Mushroom fr χ 2 J z hess fr StReio 09 (K 09) p.44/48

Results Proportions of useful and harmful rules when cf=0.6 10 Slightly useful rules Useful rules Slightly harmful rules Harmful rules 0 0 0-0 χ 2 J Heart z fr χ 2 J z HeartNeg fr χ 2 J z Plants fr χ 2 J z Mushroom fr χ 2 J z hess fr StReio 09 (K 09) p.4/48

. onclusions both eeplue and StatApriori are useful, when nothing else works! (dense data) find genious dependencies without minimum frequencies or other restrictions interesting new information eeplue can solve problems which are infeasible with traditional approaches... but the newest version of StatApriori is even faster useful theoretical properties may apply to searching general association rules StReio 09 (K 09) p.46/48

6. Future research non-redundant rules when the consequent is taken into account + comparison negative dependencies X A rules between sets X Y, Y > 1 general association rules A new application areas (have you interesting data?) StReio 09 (K 09) p.47/48

Are you interested in collecting biodiversity data? the goal is to collect a large database of naturally occuring plant combinations location information can be interesting for geographical M just reading and extracting data (plant communities and associations) from texts also technical support (collecting system) is welcome! ontact Wilhelmiina! StReio 09 (K 09) p.48/48