Lift-based search for significant dependencies in dense data sets

Size: px
Start display at page:

Download "Lift-based search for significant dependencies in dense data sets"

Transcription

1 Lift-based search for significant dependencies in dense data sets W. Hämäläinen epartment of omputer Science University of Helsinki Finland StReio 09 (K 09) p.1/48

2 1 Problem Find a good set of rules X A which express positive dependence also in the future data! R = {A 1,...,A k } = set of all attributes, where A i R is binary (binarized), X R and A R 1. P(XA) > P(X)P(A) (positive dependence) 2. dependence is genious (holds in the future data) statistical significance tests cross-validation 3. redundant rules are pruned StReio 09 (K 09) p.2/48

3 1.1 Positive dependence Lift γ(x,a) = P(XA) P(X)P(A) = P(A X) P(A) > 1 if the rule has high confidence cf = P(A X) > P(A) (in the future data), it suits for prediction Independence rules where P(A X) = P(A) are trivial (useles for predicting A) Negative dependencies P(A X) < P(A) are harmful for predicting A StReio 09 (K 09) p.3/48

4 If cf is low, rule can still be important for predictive models e.g. reveals (undesired) dependencies between variables. always useful for descriptive purposes Traditional frequency-based methods often find independence rules or even negative dependency rules in dense data sets! StReio 09 (K 09) p.4/48

5 xample: Most general significant rules in hess MIN MIN R 8 11 R MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN R 64 MIN MIN 6 MIN R MIN MIN MIN MIN MIN MIN MIN MIN StReio 09 (K 09) p./48

6 1.2 Pruning rules The number of rules can be too large! computational burden (time & space requirement) the user cannot scan through all rules simple rules avoid over-fitting (Occam s Razor principle) Search only non-redundant rules! StReio 09 (K 09) p.6/48

7 Redundancy (classically) epends on the goodness measure M. Several definitions! Rule or set is redundant if it contains useles attributes (which at most decrease the goodness). If M is increasing Set X is redundant if Y X such that M(Y ) M(X). Rule X A is redundant if Y X such that M(Y A) M(X A) StReio 09 (K 09) p.7/48

8 Redundancy (here) efinition 1. Set X is redundant if Y X such that M(estRule(X)) M(estrule(Y )) Rule X \ A A is redundant, if Y X such that M(X \ A A) M(Y \ ). estrule(x) = argmax{m(x \ A A)} (best rule which can be constructed from X) e.g. A can redundant in respect of A or A StReio 09 (K 09) p.8/48

9 Why this definition?? are the best among classically non-redundant rules! computationally fast & memory friendly significant rules are often permutations of each other the algorithm can be applied to classical definition, but computationally more difficult (not tested yet) StReio 09 (K 09) p.9/48

10 1.3 Statistical significance Idea: If X A expresses positive dependence in the sample data, what is probability that it has occured by chance? (i.e. that X and A were actually independent) Let m(xa) = n P(XA) (absolute frequency) p-value = probability that (XA) occurs at least m(xa) times in data set r, r = n, if P(XA) = P(X)P(A) (independence) If p is very low, X A is likely to be genuine StReio 09 (K 09) p./48

11 How to estimate p? inomial probability: p = n ( n i ) (P(X)P(A)) i (1 P(X)P(A)) n i i=m(x,a) prob. that XA occures at least m(xa) times in the whole data of size n StReio 09 (K 09) p.11/48

12 Alternatively (not suitable) p 2 = m(x) i=m(x,a) ( ) m(x) i (P(A)) i (1 P(A)) m(x) i prob. that A occures at least m(xa) times on rows where X is true rules with different X cannot be compared! StReio 09 (K 09) p.12/48

13 z-score p is computationally difficult! can be estimated by z-score: z(x,a) = = m(xa) np(x)p(a) np(x)p(a)(1 P(X)P(A)) n(γ(x,a) 1) γ(x,a) P(XA) Now p 1 Φ(z(X,A)), where Φ is the standard normal cumulative distribution function. StReio 09 (K 09) p.13/48

14 Using z-score z can be used as a ranking function as such! z is monotonically increasing function of m(xa) and γ suits for brach and bound search works well, when expected counts m(x)p(a) are sufficiently large (e.g. ) when m(x)p(a) is small, z is over-optimistic other functions might work better for search purposes the measure function should be monotonically increasing or decreasing function of m(xa) and γ(x,a) StReio 09 (K 09) p.14/48

15 2 Searching significant rules All possible attribute sets can be listed by an enumeration tree: A StReio 09 (K 09) p.1/48

16 2.1 How to traverse the tree? Given set X we want to know an upperbound for M(estRule(XQ)) m(xq) m(x) always γ(estrule(xq)) 1 P(A min ), where P(A min ) = min{p(a i ), A i XQ}, because γ(xq \ A i A i ) = P(A i ) P(A min ) P(XQ) P(XQ \ A i )P(A i ) StReio 09 (K 09) p.16/48

17 U(M(estRule(XQ)))! min{p(a i ) A i XQ} min{p(a j ) A j X} U(M(estRule(XQ))) U(M(estRule(X))) 1. if uppebound U(M(estRule(XQ))) < min M, rules of XQ are insignificant 2. if U (M(estRule(XQ))) max{m(estrule(y )) Y X}, rules of XQ are redundant 3. if estrule(x) has maximal lift P(A min ) 1, it is minimal and all more specific rules will be redundant StReio 09 (K 09) p.17/48

18 Property P S potentially significant (X) U(M(estRule(X))) min M Property is monotonic, if we traverse the tree in certain order! Meaning: if even one from Y s parents is or minimal, Y (or its children) cannot be non-redundant P S. Y can be pruned StReio 09 (K 09) p.18/48

19 Traversal order attributes are in descending order search top down from right to left both frequencies and maximum lifts can only decrease parent sets X have always better upperbounds than their children XQ have! StReio 09 (K 09) p.19/48

20 Relations of P S sets t i = sets under A i t i t j = sets under A i A j P(A i )...P(A j 1 ) P(A j ) A i A j (t ) i j 1 A j A j 1 A j (t i j ) ( t ) j... A j 1 A j ( t ) j (t ) j 1 ( t ) j t j (tj 1 ) (t ) i j U StReio 09 (K 09) p.20/48

21 Frequency counting data itself can be used to initialize the tree later frequencies can be counted from the tree (no need to check original data anymore) StReio 09 (K 09) p.21/48

22 Frequency tree for data A F AF 1 2 A 9 A 1 A 1 A F F 1 StReio 09 (K 09) p.22/48

23 Pruning attributes checking all 2-sets can prune out low frequency attributes maximal U (γ)s are decreased A can be pruned, if for all A i A M(m(AA i ), min{p(a),p(a i )} 1 ) < min M StReio 09 (K 09) p.23/48

24 3. Simulation A F AF 1 2 A 9 A 1 A 1 A F F 1 StReio 09 (K 09) p.24/48

25 Simulation step 1 A F F 1 m(fa )=1 for all A <>F i i StReio 09 (K 09) p.2/48

26 Simulation step 2 A z=8.2 MIN added StReio 09 (K 09) p.26/48

27 Simulation step 3 A z= z=8.2 MIN added StReio 09 (K 09) p.27/48

28 Simulation step 4 A z= z=8.2 MIN added is not created! StReio 09 (K 09) p.28/48

29 Simulation step A z=8.2 MIN 2 added z=2.9 StReio 09 (K 09) p.29/48

30 Simulation step 6 A z=8.2 MIN 2 20 removed z=2.9 StReio 09 (K 09) p.30/48

31 Simulation step 7 A z= z=8.2 MIN removed added z=3.8 z=1.2 Rules > and > found StReio 09 (K 09) p.31/48

32 Simulation step 8 A z=8.2 MIN z=3.8 z=3.8 added added oth, z<0 StReio 09 (K 09) p.32/48

33 Simulation step 9 A z=8.2 MIN z=3.8 z=3.8 added oth have z=2.7 StReio 09 (K 09) p.33/48

34 Simulation step A z=8.2 MIN z=3.8 z=3.8 added and removed (fr=0) added z=2.3 StReio 09 (K 09) p.34/48

35 Simulation step 11 A z= z=8.2 MIN z=2.3 z=3.8 z=3.8 removed Rule A > found z=4.4 StReio 09 (K 09) p.3/48

36 Simulation step 12 A z= z=8.2 MIN z=2.3 z=3.8 z=3.8 z=2.3 StReio 09 (K 09) p.36/48

37 Simulation final result A z= z=8.2 MIN z=2.3 z=3.8 z=3.8 z=2.3 z = 8.2 cf = 1.0 fr = 0.20 γ =.0 A z = 4.4 cf = 1.0 fr = 0.2 γ =.0 z = 3.8 cf = 0. fr = 0.1 γ = 2. z = 3.8 cf = 0. fr = 0.1 γ = 2. StReio 09 (K 09) p.37/48

38 4. xperiments: Goals Quality of rules compared to traditional methods what can we gain when minfr is not used? Performance: how fast is it? How complex data sets can we handle? StReio 09 (K 09) p.38/48

39 Proportions of useful and harmful rules Rule is at least slightly useful, if expresses positive dependency in test data useful, if expresses clear positive dependency (requirement: z 1) at least slightly harmful if expresses negative dependency in test data useful, if expresses clear negative dependency (requirement: z 1) StReio 09 (K 09) p.39/48

40 ata sets iological + medical + hess as a patological case Set n k Heart Hearneg (Garden ) Plants Mushroom hess StReio 09 (K 09) p.40/48

41 Results Proportions of useful and harmful rules Slightly useful rules Useful rules Slightly harmful rules Harmful rules cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b Heart HeartNeg Plants Mushroom hess StReio 09 (K 09) p.41/48

42 Observations Selecting rules with U (ln(p)) improves results (vs. z) Using min cf in the search can distort results if parent has higher z but too low cf, the set is not pruned as redundant if min f is not used in the search (only in the end), the number of rules can be too small often still better approah (smaller prediction error in the test sets) StReio 09 (K 09) p.42/48

43 omparison to traditional frequency-based search with as low min fr as possible + pruning with different measure functions Set minf r Heart 0.0 Hearneg 0.32 Plants 0.12 Mushroom 0.22 hess 0.7 StReio 09 (K 09) p.43/48

44 Results Proportions of useful and harmful rules when cf= Slightly useful rules Useful rules Slightly harmful rules Harmful rules χ 2 J Heart z fr χ 2 J z HeartNeg fr χ 2 J z Plants fr χ 2 J z Mushroom fr χ 2 J z hess fr StReio 09 (K 09) p.44/48

45 Results Proportions of useful and harmful rules when cf= Slightly useful rules Useful rules Slightly harmful rules Harmful rules χ 2 J Heart z fr χ 2 J z HeartNeg fr χ 2 J z Plants fr χ 2 J z Mushroom fr χ 2 J z hess fr StReio 09 (K 09) p.4/48

46 . onclusions both eeplue and StatApriori are useful, when nothing else works! (dense data) find genious dependencies without minimum frequencies or other restrictions interesting new information eeplue can solve problems which are infeasible with traditional approaches... but the newest version of StatApriori is even faster useful theoretical properties may apply to searching general association rules StReio 09 (K 09) p.46/48

47 6. Future research non-redundant rules when the consequent is taken into account + comparison negative dependencies X A rules between sets X Y, Y > 1 general association rules A new application areas (have you interesting data?) StReio 09 (K 09) p.47/48

48 Are you interested in collecting biodiversity data? the goal is to collect a large database of naturally occuring plant combinations location information can be interesting for geographical M just reading and extracting data (plant communities and associations) from texts also technical support (collecting system) is welcome! ontact Wilhelmiina! StReio 09 (K 09) p.48/48

Lift-based search for significant dependencies in dense data sets

Lift-based search for significant dependencies in dense data sets Lift-based search for significant dependencies in dense data sets W. Hämäläinen epartment of omputer Science University of Helsinki Finland whamalai@cs.helsinki.fi ABSTRAT ependency analysis is an important

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Effective Pruning for the Discovery of Conditional Functional Dependencies

Effective Pruning for the Discovery of Conditional Functional Dependencies Effective Pruning for the Discovery of Conditional Functional Dependencies Jiuyong Li 1, Jiuxue Liu 1, Hannu Toivonen 2, Jianming Yong 3 1 School of Computer and Information Science, University of South

More information

Batch Scheduling for Identical Multi-Tasks Jobs on Heterogeneous Platforms

Batch Scheduling for Identical Multi-Tasks Jobs on Heterogeneous Platforms atch Scheduling for Identical Multi-Tasks Jobs on Heterogeneous Platforms Jean-Marc Nicod (Jean-Marc.Nicod@lifc.univ-fcomte.fr) Sékou iakité, Laurent Philippe - 16/05/2008 Laboratoire d Informatique de

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

An optimisation framework for determination of capacity in railway networks

An optimisation framework for determination of capacity in railway networks CASPT 2015 An optimisation framework for determination of capacity in railway networks Lars Wittrup Jensen Abstract Within the railway industry, high quality estimates on railway capacity is crucial information,

More information

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I Victor Adamchi Danny Sleator Great Theoretical Ideas In Computer Science Probability Theory I CS 5-25 Spring 200 Lecture Feb. 6, 200 Carnegie Mellon University We will consider chance experiments with

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE

HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE Subodha Kumar University of Washington subodha@u.washington.edu Varghese S. Jacob University of Texas at Dallas vjacob@utdallas.edu

More information

Association Rule Mining

Association Rule Mining Association Rule Mining Association Rules and Frequent Patterns Frequent Pattern Mining Algorithms Apriori FP-growth Correlation Analysis Constraint-based Mining Using Frequent Patterns for Classification

More information

Chapter 6: Episode discovery process

Chapter 6: Episode discovery process Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

15-466 Computer Game Programming Intelligence I: Basic Decision-Making Mechanisms

15-466 Computer Game Programming Intelligence I: Basic Decision-Making Mechanisms 15-466 Computer Game Programming Intelligence I: Basic Decision-Making Mechanisms Maxim Likhachev Robotics Institute Carnegie Mellon University AI Architecture from Artificial Intelligence for Games by

More information

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Purpose: key concepts in mining frequent itemsets understand the Apriori algorithm run Apriori in Weka GUI and in programatic way 1 Theoretical

More information

Statistical Testing of Randomness Masaryk University in Brno Faculty of Informatics

Statistical Testing of Randomness Masaryk University in Brno Faculty of Informatics Statistical Testing of Randomness Masaryk University in Brno Faculty of Informatics Jan Krhovják Basic Idea Behind the Statistical Tests Generated random sequences properties as sample drawn from uniform/rectangular

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13 External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing

More information

CÉLINE LE BAILLY DE TILLEGHEM. Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium)

CÉLINE LE BAILLY DE TILLEGHEM. Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium) STATISTICAL CONTRIBUTION TO THE VIRTUAL MULTICRITERIA OPTIMISATION OF COMBINATORIAL MOLECULES LIBRARIES AND TO THE VALIDATION AND APPLICATION OF QSAR MODELS CÉLINE LE BAILLY DE TILLEGHEM Institut de statistique

More information

Scoring the Data Using Association Rules

Scoring the Data Using Association Rules Scoring the Data Using Association Rules Bing Liu, Yiming Ma, and Ching Kian Wong School of Computing National University of Singapore 3 Science Drive 2, Singapore 117543 {liub, maym, wongck}@comp.nus.edu.sg

More information

HUFFMAN CODING AND HUFFMAN TREE

HUFFMAN CODING AND HUFFMAN TREE oding: HUFFMN OING N HUFFMN TR Reducing strings over arbitrary alphabet Σ o to strings over a fixed alphabet Σ c to standardize machine operations ( Σ c < Σ o ). inary representation of both operands and

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

CHAPTER 11. Proposed Project. Incremental Cash Flow for a Project. Treatment of Financing Costs. Estimating cash flows:

CHAPTER 11. Proposed Project. Incremental Cash Flow for a Project. Treatment of Financing Costs. Estimating cash flows: CHAPTER 11 Cash Flow Estimation and Risk Analysis Estimating cash flows: Relevant cash flows Working capital treatment Inflation Risk Analysis: Sensitivity Analysis, Scenario Analysis, and Simulation Analysis

More information

Fuld Skolerapport for Søhusskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 9. med reference Tilsvarende klassetrin i kommunen

Fuld Skolerapport for Søhusskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 9. med reference Tilsvarende klassetrin i kommunen Side 1 af 41 Side 2 af 41 Side 3 af 41 Side 4 af 41 Side 5 af 41 Side 6 af 41 Side 7 af 41 Side 8 af 41 Side 9 af 41 Side 10 af 41 Side 11 af 41 Side 12 af 41 Side 13 af 41 Side 14 af 41 Side 15 af 41

More information

Fuld Skolerapport for Hunderupskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 7. med reference Tilsvarende klassetrin i kommunen

Fuld Skolerapport for Hunderupskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 7. med reference Tilsvarende klassetrin i kommunen Side 1 af 43 Side 2 af 43 Side 3 af 43 Side 4 af 43 Side 5 af 43 Side 6 af 43 Side 7 af 43 Side 8 af 43 Side 9 af 43 Side 10 af 43 Side 11 af 43 Side 12 af 43 Side 13 af 43 Side 14 af 43 Side 15 af 43

More information

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

More information

1. What are Data Structures? Introduction to Data Structures. 2. What will we Study? CITS2200 Data Structures and Algorithms

1. What are Data Structures? Introduction to Data Structures. 2. What will we Study? CITS2200 Data Structures and Algorithms 1 What are ata Structures? ata Structures and lgorithms ata structures are software artifacts that allow data to be stored, organized and accessed Topic 1 They are more high-level than computer memory

More information

6.2 Normal distribution. Standard Normal Distribution:

6.2 Normal distribution. Standard Normal Distribution: 6.2 Normal distribution Slide Heights of Adult Men and Women Slide 2 Area= Mean = µ Standard Deviation = σ Donation: X ~ N(µ,σ 2 ) Standard Normal Distribution: Slide 3 Slide 4 a normal probability distribution

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R Binary Search Trees A Generic Tree Nodes in a binary search tree ( B-S-T) are of the form P parent Key A Satellite data L R B C D E F G H I J The B-S-T has a root node which is the only node whose parent

More information

{ Mining, Sets, of, Patterns }

{ Mining, Sets, of, Patterns } { Mining, Sets, of, Patterns } A tutorial at ECMLPKDD2010 September 20, 2010, Barcelona, Spain by B. Bringmann, S. Nijssen, N. Tatti, J. Vreeken, A. Zimmermann 1 Overview Tutorial 00:00 00:45 Introduction

More information

SQL Query Evaluation. Winter 2006-2007 Lecture 23

SQL Query Evaluation. Winter 2006-2007 Lecture 23 SQL Query Evaluation Winter 2006-2007 Lecture 23 SQL Query Processing Databases go through three steps: Parse SQL into an execution plan Optimize the execution plan Evaluate the optimized plan Execution

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

How Does My TI-84 Do That

How Does My TI-84 Do That How Does My TI-84 Do That A guide to using the TI-84 for statistics Austin Peay State University Clarksville, Tennessee How Does My TI-84 Do That A guide to using the TI-84 for statistics Table of Contents

More information

FilterBoost: Regression and Classification on Large Datasets

FilterBoost: Regression and Classification on Large Datasets FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 jkbradle@cs.cmu.edu Robert E. Schapire Department

More information

Inverted Indexes: Trading Precision for Efficiency

Inverted Indexes: Trading Precision for Efficiency Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013 After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids

More information

Master s Program in Information Systems

Master s Program in Information Systems The University of Jordan King Abdullah II School for Information Technology Department of Information Systems Master s Program in Information Systems 2006/2007 Study Plan Master Degree in Information Systems

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

Is it statistically significant? The chi-square test

Is it statistically significant? The chi-square test UAS Conference Series 2013/14 Is it statistically significant? The chi-square test Dr Gosia Turner Student Data Management and Analysis 14 September 2010 Page 1 Why chi-square? Tests whether two categorical

More information

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 General Integer Linear Program: (ILP) min c T x Ax b x 0 integer Assumption: A, b integer The integrality condition

More information

Decision-Tree Learning

Decision-Tree Learning Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

More information

Two Correlated Proportions (McNemar Test)

Two Correlated Proportions (McNemar Test) Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

More information

Understand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test.

Understand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test. HYPOTHESIS TESTING Learning Objectives Understand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test. Know how to perform a hypothesis test

More information

Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules

Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules Market basket analysis.................................................... 2 Market basket data I.....................................................

More information

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

More information

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. 1 Subject Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. This document extends a previous tutorial dedicated to the comparison of various implementations

More information

root node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain

root node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain inary Trees 1 A binary tree is either empty, or it consists of a node called the root together with two binary trees called the left subtree and the right subtree of the root, which are disjoint from each

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan

More information

On the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations

On the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations Technical Report On the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations September 8, 2008 Ingo Althofer Institute of Applied Mathematics Faculty of Mathematics and Computer Science Friedrich-Schiller

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Constrained Least Squares

Constrained Least Squares Constrained Least Squares Authors: G.H. Golub and C.F. Van Loan Chapter 12 in Matrix Computations, 3rd Edition, 1996, pp.580-587 CICN may05/1 Background The least squares problem: min Ax b 2 x Sometimes,

More information

New Matrix Approach to Improve Apriori Algorithm

New Matrix Approach to Improve Apriori Algorithm New Matrix Approach to Improve Apriori Algorithm A. Rehab H. Alwa, B. Anasuya V Patil Associate Prof., IT Faculty, Majan College-University College Muscat, Oman, rehab.alwan@majancolleg.edu.om Associate

More information

Personalized Predictive Medicine and Genomic Clinical Trials

Personalized Predictive Medicine and Genomic Clinical Trials Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations

More information

Computation of the Aggregate Claim Amount Distribution Using R and actuar. Vincent Goulet, Ph.D.

Computation of the Aggregate Claim Amount Distribution Using R and actuar. Vincent Goulet, Ph.D. Computation of the Aggregate Claim Amount Distribution Using R and actuar Vincent Goulet, Ph.D. Actuarial Risk Modeling Process 1 Model costs at the individual level Modeling of loss distributions 2 Aggregate

More information

MATROIDS AND MYSQL WHAT TO DO

MATROIDS AND MYSQL WHAT TO DO WHAT TO DO WITH BIG DATA SETS? Gordon Royle School of Mathematics & Statistics University of Western Australia La Vacquerie-et-Saint-Martin-de-Castries AUSTRALIA PERTH 37ACCMCC 37TH AUSTRALASIAN CONFERENCE

More information

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Features Of The Chi-Square Statistic The chi-square test is non-parametric. That is, it makes no assumptions about the distribution

More information

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P) Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...

More information

Level Set Framework, Signed Distance Function, and Various Tools

Level Set Framework, Signed Distance Function, and Various Tools Level Set Framework Geometry and Calculus Tools Level Set Framework,, and Various Tools Spencer Department of Mathematics Brigham Young University Image Processing Seminar (Week 3), 2010 Level Set Framework

More information

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets

Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets Pramod S. Reader, Information Technology, M.P.Christian College of Engineering, Bhilai,C.G. INDIA.

More information

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Outline. Dispersion Bush lupine survival Quasi-Binomial family Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Finite Automata. Reading: Chapter 2

Finite Automata. Reading: Chapter 2 Finite Automata Reading: Chapter 2 1 Finite Automaton (FA) Informally, a state diagram that comprehensively captures all possible states and transitions that a machine can take while responding to a stream

More information

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples

More information

Data Structures. Algorithm Performance and Big O Analysis

Data Structures. Algorithm Performance and Big O Analysis Data Structures Algorithm Performance and Big O Analysis What s an Algorithm? a clearly specified set of instructions to be followed to solve a problem. In essence: A computer program. In detail: Defined

More information

Binary search algorithm

Binary search algorithm Binary search algorithm Definition Search a sorted array by repeatedly dividing the search interval in half. Begin with an interval covering the whole array. If the value of the search key is less than

More information

Verifying Extreme Rainfall Alerts for surface water flooding potential. Marion Mittermaier, Nigel Roberts and Clive Pierce

Verifying Extreme Rainfall Alerts for surface water flooding potential. Marion Mittermaier, Nigel Roberts and Clive Pierce Verifying Extreme Rainfall Alerts for surface water flooding potential Marion Mittermaier, Nigel Roberts and Clive Pierce Pluvial (surface water) flooding Why Extreme Rainfall Alerts? Much of the damage

More information

Algorithms Chapter 12 Binary Search Trees

Algorithms Chapter 12 Binary Search Trees Algorithms Chapter 1 Binary Search Trees Outline Assistant Professor: Ching Chi Lin 林 清 池 助 理 教 授 chingchi.lin@gmail.com Department of Computer Science and Engineering National Taiwan Ocean University

More information

OPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING

OPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING OPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING Abstract: As cloud computing becomes more and more popular, understanding the economics of cloud computing becomes critically

More information

Ensuring Collective Availability in Volatile Resource Pools via Forecasting

Ensuring Collective Availability in Volatile Resource Pools via Forecasting Ensuring Collective Availability in Volatile Resource Pools via Forecasting Artur Andrzejak andrzejak[at]zib.de Derrick Kondo David P. Anderson Zuse Institute Berlin (ZIB) INRIA UC Berkeley Motivation

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

More information

A Study of Efficacy of Apex Learning Cherokee County School District

A Study of Efficacy of Apex Learning Cherokee County School District A Study of Efficacy of Apex Learning Cherokee County School District May 2015 Copyright 2015 Apex Learning Inc. Apex Learning, the Apex Learning logo, ClassTools, ClassTools Achieve, ClassTools Virtual,

More information

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

Approaches to Qualitative Evaluation of the Software Quality Attributes: Overview

Approaches to Qualitative Evaluation of the Software Quality Attributes: Overview 4th International Conference on Software Methodologies, Tools and Techniques Approaches to Qualitative Evaluation of the Software Quality Attributes: Overview Presented by: Denis Kozlov Department of Computer

More information

Line and Polygon Clipping. Foley & Van Dam, Chapter 3

Line and Polygon Clipping. Foley & Van Dam, Chapter 3 Line and Polygon Clipping Foley & Van Dam, Chapter 3 Topics Viewing Transformation Pipeline in 2D Line and polygon clipping Brute force analytic solution Cohen-Sutherland Line Clipping Algorithm Cyrus-Beck

More information

MATH 140 Lab 4: Probability and the Standard Normal Distribution

MATH 140 Lab 4: Probability and the Standard Normal Distribution MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes

More information

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation Parkland College A with Honors Projects Honors Program 2014 Calculating P-Values Isela Guerra Parkland College Recommended Citation Guerra, Isela, "Calculating P-Values" (2014). A with Honors Projects.

More information

Dynamic Adaptive Feedback of Load Balancing Strategy

Dynamic Adaptive Feedback of Load Balancing Strategy Journal of Information & Computational Science 8: 10 (2011) 1901 1908 Available at http://www.joics.com Dynamic Adaptive Feedback of Load Balancing Strategy Hongbin Wang a,b, Zhiyi Fang a,, Shuang Cui

More information

Knowledge Acquisition Approach Based on Rough Set in Online Aided Decision System for Food Processing Quality and Safety

Knowledge Acquisition Approach Based on Rough Set in Online Aided Decision System for Food Processing Quality and Safety , pp. 381-388 http://dx.doi.org/10.14257/ijunesst.2014.7.6.33 Knowledge Acquisition Approach Based on Rough Set in Online Aided ecision System for Food Processing Quality and Safety Liu Peng, Liu Wen,

More information

Assessment of robust capacity utilisation in railway networks

Assessment of robust capacity utilisation in railway networks Assessment of robust capacity utilisation in railway networks Lars Wittrup Jensen 2015 Agenda 1) Introduction to WP 3.1 and PhD project 2) Model for measuring capacity consumption in railway networks a)

More information

CONTINGENCY (CROSS- TABULATION) TABLES

CONTINGENCY (CROSS- TABULATION) TABLES CONTINGENCY (CROSS- TABULATION) TABLES Presents counts of two or more variables A 1 A 2 Total B 1 a b a+b B 2 c d c+d Total a+c b+d n = a+b+c+d 1 Joint, Marginal, and Conditional Probability We study methods

More information

1.204 Lecture 10. Greedy algorithms: Job scheduling. Greedy method

1.204 Lecture 10. Greedy algorithms: Job scheduling. Greedy method 1.204 Lecture 10 Greedy algorithms: Knapsack (capital budgeting) Job scheduling Greedy method Local improvement method Does not look at problem globally Takes best immediate step to find a solution Useful

More information

Dynamic Trust Management for the Internet of Things Applications

Dynamic Trust Management for the Internet of Things Applications Dynamic Trust Management for the Internet of Things Applications Fenye Bao and Ing-Ray Chen Department of Computer Science, Virginia Tech Self-IoT 2012 1 Sept. 17, 2012, San Jose, CA, USA Contents Introduction

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

8.1 Min Degree Spanning Tree

8.1 Min Degree Spanning Tree CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree

More information

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE International Journal of Computer Science and Applications, Vol. 5, No. 4, pp 57-69, 2008 Technomathematics Research Foundation PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

More information

Loop Invariants and Binary Search

Loop Invariants and Binary Search Loop Invariants and Binary Search Chapter 4.3.3 and 9.3.1-1 - Outline Ø Iterative Algorithms, Assertions and Proofs of Correctness Ø Binary Search: A Case Study - 2 - Outline Ø Iterative Algorithms, Assertions

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

Discovering Local Subgroups, with an Application to Fraud Detection

Discovering Local Subgroups, with an Application to Fraud Detection Discovering Local Subgroups, with an Application to Fraud Detection Abstract. In Subgroup Discovery, one is interested in finding subgroups that behave differently from the average behavior of the entire

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory

Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory LA-UR-12-24572 Approved for public release; distribution is unlimited Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory Alicia Garcia-Lopez Steven R. Booth September 2012

More information