Lift-based search for significant dependencies in dense data sets
|
|
- Fay O’Brien’
- 8 years ago
- Views:
Transcription
1 Lift-based search for significant dependencies in dense data sets W. Hämäläinen epartment of omputer Science University of Helsinki Finland StReio 09 (K 09) p.1/48
2 1 Problem Find a good set of rules X A which express positive dependence also in the future data! R = {A 1,...,A k } = set of all attributes, where A i R is binary (binarized), X R and A R 1. P(XA) > P(X)P(A) (positive dependence) 2. dependence is genious (holds in the future data) statistical significance tests cross-validation 3. redundant rules are pruned StReio 09 (K 09) p.2/48
3 1.1 Positive dependence Lift γ(x,a) = P(XA) P(X)P(A) = P(A X) P(A) > 1 if the rule has high confidence cf = P(A X) > P(A) (in the future data), it suits for prediction Independence rules where P(A X) = P(A) are trivial (useles for predicting A) Negative dependencies P(A X) < P(A) are harmful for predicting A StReio 09 (K 09) p.3/48
4 If cf is low, rule can still be important for predictive models e.g. reveals (undesired) dependencies between variables. always useful for descriptive purposes Traditional frequency-based methods often find independence rules or even negative dependency rules in dense data sets! StReio 09 (K 09) p.4/48
5 xample: Most general significant rules in hess MIN MIN R 8 11 R MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN MIN R 64 MIN MIN 6 MIN R MIN MIN MIN MIN MIN MIN MIN MIN StReio 09 (K 09) p./48
6 1.2 Pruning rules The number of rules can be too large! computational burden (time & space requirement) the user cannot scan through all rules simple rules avoid over-fitting (Occam s Razor principle) Search only non-redundant rules! StReio 09 (K 09) p.6/48
7 Redundancy (classically) epends on the goodness measure M. Several definitions! Rule or set is redundant if it contains useles attributes (which at most decrease the goodness). If M is increasing Set X is redundant if Y X such that M(Y ) M(X). Rule X A is redundant if Y X such that M(Y A) M(X A) StReio 09 (K 09) p.7/48
8 Redundancy (here) efinition 1. Set X is redundant if Y X such that M(estRule(X)) M(estrule(Y )) Rule X \ A A is redundant, if Y X such that M(X \ A A) M(Y \ ). estrule(x) = argmax{m(x \ A A)} (best rule which can be constructed from X) e.g. A can redundant in respect of A or A StReio 09 (K 09) p.8/48
9 Why this definition?? are the best among classically non-redundant rules! computationally fast & memory friendly significant rules are often permutations of each other the algorithm can be applied to classical definition, but computationally more difficult (not tested yet) StReio 09 (K 09) p.9/48
10 1.3 Statistical significance Idea: If X A expresses positive dependence in the sample data, what is probability that it has occured by chance? (i.e. that X and A were actually independent) Let m(xa) = n P(XA) (absolute frequency) p-value = probability that (XA) occurs at least m(xa) times in data set r, r = n, if P(XA) = P(X)P(A) (independence) If p is very low, X A is likely to be genuine StReio 09 (K 09) p./48
11 How to estimate p? inomial probability: p = n ( n i ) (P(X)P(A)) i (1 P(X)P(A)) n i i=m(x,a) prob. that XA occures at least m(xa) times in the whole data of size n StReio 09 (K 09) p.11/48
12 Alternatively (not suitable) p 2 = m(x) i=m(x,a) ( ) m(x) i (P(A)) i (1 P(A)) m(x) i prob. that A occures at least m(xa) times on rows where X is true rules with different X cannot be compared! StReio 09 (K 09) p.12/48
13 z-score p is computationally difficult! can be estimated by z-score: z(x,a) = = m(xa) np(x)p(a) np(x)p(a)(1 P(X)P(A)) n(γ(x,a) 1) γ(x,a) P(XA) Now p 1 Φ(z(X,A)), where Φ is the standard normal cumulative distribution function. StReio 09 (K 09) p.13/48
14 Using z-score z can be used as a ranking function as such! z is monotonically increasing function of m(xa) and γ suits for brach and bound search works well, when expected counts m(x)p(a) are sufficiently large (e.g. ) when m(x)p(a) is small, z is over-optimistic other functions might work better for search purposes the measure function should be monotonically increasing or decreasing function of m(xa) and γ(x,a) StReio 09 (K 09) p.14/48
15 2 Searching significant rules All possible attribute sets can be listed by an enumeration tree: A StReio 09 (K 09) p.1/48
16 2.1 How to traverse the tree? Given set X we want to know an upperbound for M(estRule(XQ)) m(xq) m(x) always γ(estrule(xq)) 1 P(A min ), where P(A min ) = min{p(a i ), A i XQ}, because γ(xq \ A i A i ) = P(A i ) P(A min ) P(XQ) P(XQ \ A i )P(A i ) StReio 09 (K 09) p.16/48
17 U(M(estRule(XQ)))! min{p(a i ) A i XQ} min{p(a j ) A j X} U(M(estRule(XQ))) U(M(estRule(X))) 1. if uppebound U(M(estRule(XQ))) < min M, rules of XQ are insignificant 2. if U (M(estRule(XQ))) max{m(estrule(y )) Y X}, rules of XQ are redundant 3. if estrule(x) has maximal lift P(A min ) 1, it is minimal and all more specific rules will be redundant StReio 09 (K 09) p.17/48
18 Property P S potentially significant (X) U(M(estRule(X))) min M Property is monotonic, if we traverse the tree in certain order! Meaning: if even one from Y s parents is or minimal, Y (or its children) cannot be non-redundant P S. Y can be pruned StReio 09 (K 09) p.18/48
19 Traversal order attributes are in descending order search top down from right to left both frequencies and maximum lifts can only decrease parent sets X have always better upperbounds than their children XQ have! StReio 09 (K 09) p.19/48
20 Relations of P S sets t i = sets under A i t i t j = sets under A i A j P(A i )...P(A j 1 ) P(A j ) A i A j (t ) i j 1 A j A j 1 A j (t i j ) ( t ) j... A j 1 A j ( t ) j (t ) j 1 ( t ) j t j (tj 1 ) (t ) i j U StReio 09 (K 09) p.20/48
21 Frequency counting data itself can be used to initialize the tree later frequencies can be counted from the tree (no need to check original data anymore) StReio 09 (K 09) p.21/48
22 Frequency tree for data A F AF 1 2 A 9 A 1 A 1 A F F 1 StReio 09 (K 09) p.22/48
23 Pruning attributes checking all 2-sets can prune out low frequency attributes maximal U (γ)s are decreased A can be pruned, if for all A i A M(m(AA i ), min{p(a),p(a i )} 1 ) < min M StReio 09 (K 09) p.23/48
24 3. Simulation A F AF 1 2 A 9 A 1 A 1 A F F 1 StReio 09 (K 09) p.24/48
25 Simulation step 1 A F F 1 m(fa )=1 for all A <>F i i StReio 09 (K 09) p.2/48
26 Simulation step 2 A z=8.2 MIN added StReio 09 (K 09) p.26/48
27 Simulation step 3 A z= z=8.2 MIN added StReio 09 (K 09) p.27/48
28 Simulation step 4 A z= z=8.2 MIN added is not created! StReio 09 (K 09) p.28/48
29 Simulation step A z=8.2 MIN 2 added z=2.9 StReio 09 (K 09) p.29/48
30 Simulation step 6 A z=8.2 MIN 2 20 removed z=2.9 StReio 09 (K 09) p.30/48
31 Simulation step 7 A z= z=8.2 MIN removed added z=3.8 z=1.2 Rules > and > found StReio 09 (K 09) p.31/48
32 Simulation step 8 A z=8.2 MIN z=3.8 z=3.8 added added oth, z<0 StReio 09 (K 09) p.32/48
33 Simulation step 9 A z=8.2 MIN z=3.8 z=3.8 added oth have z=2.7 StReio 09 (K 09) p.33/48
34 Simulation step A z=8.2 MIN z=3.8 z=3.8 added and removed (fr=0) added z=2.3 StReio 09 (K 09) p.34/48
35 Simulation step 11 A z= z=8.2 MIN z=2.3 z=3.8 z=3.8 removed Rule A > found z=4.4 StReio 09 (K 09) p.3/48
36 Simulation step 12 A z= z=8.2 MIN z=2.3 z=3.8 z=3.8 z=2.3 StReio 09 (K 09) p.36/48
37 Simulation final result A z= z=8.2 MIN z=2.3 z=3.8 z=3.8 z=2.3 z = 8.2 cf = 1.0 fr = 0.20 γ =.0 A z = 4.4 cf = 1.0 fr = 0.2 γ =.0 z = 3.8 cf = 0. fr = 0.1 γ = 2. z = 3.8 cf = 0. fr = 0.1 γ = 2. StReio 09 (K 09) p.37/48
38 4. xperiments: Goals Quality of rules compared to traditional methods what can we gain when minfr is not used? Performance: how fast is it? How complex data sets can we handle? StReio 09 (K 09) p.38/48
39 Proportions of useful and harmful rules Rule is at least slightly useful, if expresses positive dependency in test data useful, if expresses clear positive dependency (requirement: z 1) at least slightly harmful if expresses negative dependency in test data useful, if expresses clear negative dependency (requirement: z 1) StReio 09 (K 09) p.39/48
40 ata sets iological + medical + hess as a patological case Set n k Heart Hearneg (Garden ) Plants Mushroom hess StReio 09 (K 09) p.40/48
41 Results Proportions of useful and harmful rules Slightly useful rules Useful rules Slightly harmful rules Harmful rules cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b cf=0.9 cf=0.9b cf=0.6 cf=0.6b cf=0.0 cf=0.0b Heart HeartNeg Plants Mushroom hess StReio 09 (K 09) p.41/48
42 Observations Selecting rules with U (ln(p)) improves results (vs. z) Using min cf in the search can distort results if parent has higher z but too low cf, the set is not pruned as redundant if min f is not used in the search (only in the end), the number of rules can be too small often still better approah (smaller prediction error in the test sets) StReio 09 (K 09) p.42/48
43 omparison to traditional frequency-based search with as low min fr as possible + pruning with different measure functions Set minf r Heart 0.0 Hearneg 0.32 Plants 0.12 Mushroom 0.22 hess 0.7 StReio 09 (K 09) p.43/48
44 Results Proportions of useful and harmful rules when cf= Slightly useful rules Useful rules Slightly harmful rules Harmful rules χ 2 J Heart z fr χ 2 J z HeartNeg fr χ 2 J z Plants fr χ 2 J z Mushroom fr χ 2 J z hess fr StReio 09 (K 09) p.44/48
45 Results Proportions of useful and harmful rules when cf= Slightly useful rules Useful rules Slightly harmful rules Harmful rules χ 2 J Heart z fr χ 2 J z HeartNeg fr χ 2 J z Plants fr χ 2 J z Mushroom fr χ 2 J z hess fr StReio 09 (K 09) p.4/48
46 . onclusions both eeplue and StatApriori are useful, when nothing else works! (dense data) find genious dependencies without minimum frequencies or other restrictions interesting new information eeplue can solve problems which are infeasible with traditional approaches... but the newest version of StatApriori is even faster useful theoretical properties may apply to searching general association rules StReio 09 (K 09) p.46/48
47 6. Future research non-redundant rules when the consequent is taken into account + comparison negative dependencies X A rules between sets X Y, Y > 1 general association rules A new application areas (have you interesting data?) StReio 09 (K 09) p.47/48
48 Are you interested in collecting biodiversity data? the goal is to collect a large database of naturally occuring plant combinations location information can be interesting for geographical M just reading and extracting data (plant communities and associations) from texts also technical support (collecting system) is welcome! ontact Wilhelmiina! StReio 09 (K 09) p.48/48
Lift-based search for significant dependencies in dense data sets
Lift-based search for significant dependencies in dense data sets W. Hämäläinen epartment of omputer Science University of Helsinki Finland whamalai@cs.helsinki.fi ABSTRAT ependency analysis is an important
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationEffective Pruning for the Discovery of Conditional Functional Dependencies
Effective Pruning for the Discovery of Conditional Functional Dependencies Jiuyong Li 1, Jiuxue Liu 1, Hannu Toivonen 2, Jianming Yong 3 1 School of Computer and Information Science, University of South
More informationBatch Scheduling for Identical Multi-Tasks Jobs on Heterogeneous Platforms
atch Scheduling for Identical Multi-Tasks Jobs on Heterogeneous Platforms Jean-Marc Nicod (Jean-Marc.Nicod@lifc.univ-fcomte.fr) Sékou iakité, Laurent Philippe - 16/05/2008 Laboratoire d Informatique de
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationExternal Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationAn optimisation framework for determination of capacity in railway networks
CASPT 2015 An optimisation framework for determination of capacity in railway networks Lars Wittrup Jensen Abstract Within the railway industry, high quality estimates on railway capacity is crucial information,
More informationProbabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I
Victor Adamchi Danny Sleator Great Theoretical Ideas In Computer Science Probability Theory I CS 5-25 Spring 200 Lecture Feb. 6, 200 Carnegie Mellon University We will consider chance experiments with
More informationInsurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationHYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE
HYBRID GENETIC ALGORITHMS FOR SCHEDULING ADVERTISEMENTS ON A WEB PAGE Subodha Kumar University of Washington subodha@u.washington.edu Varghese S. Jacob University of Texas at Dallas vjacob@utdallas.edu
More informationAssociation Rule Mining
Association Rule Mining Association Rules and Frequent Patterns Frequent Pattern Mining Algorithms Apriori FP-growth Correlation Analysis Constraint-based Mining Using Frequent Patterns for Classification
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationSome Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
More information15-466 Computer Game Programming Intelligence I: Basic Decision-Making Mechanisms
15-466 Computer Game Programming Intelligence I: Basic Decision-Making Mechanisms Maxim Likhachev Robotics Institute Carnegie Mellon University AI Architecture from Artificial Intelligence for Games by
More informationLaboratory Module 8 Mining Frequent Itemsets Apriori Algorithm
Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm Purpose: key concepts in mining frequent itemsets understand the Apriori algorithm run Apriori in Weka GUI and in programatic way 1 Theoretical
More informationStatistical Testing of Randomness Masaryk University in Brno Faculty of Informatics
Statistical Testing of Randomness Masaryk University in Brno Faculty of Informatics Jan Krhovják Basic Idea Behind the Statistical Tests Generated random sequences properties as sample drawn from uniform/rectangular
More informationClassification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
More informationExternal Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13
External Sorting Chapter 13 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing
More informationCÉLINE LE BAILLY DE TILLEGHEM. Institut de statistique Université catholique de Louvain Louvain-la-Neuve (Belgium)
STATISTICAL CONTRIBUTION TO THE VIRTUAL MULTICRITERIA OPTIMISATION OF COMBINATORIAL MOLECULES LIBRARIES AND TO THE VALIDATION AND APPLICATION OF QSAR MODELS CÉLINE LE BAILLY DE TILLEGHEM Institut de statistique
More informationScoring the Data Using Association Rules
Scoring the Data Using Association Rules Bing Liu, Yiming Ma, and Ching Kian Wong School of Computing National University of Singapore 3 Science Drive 2, Singapore 117543 {liub, maym, wongck}@comp.nus.edu.sg
More informationHUFFMAN CODING AND HUFFMAN TREE
oding: HUFFMN OING N HUFFMN TR Reducing strings over arbitrary alphabet Σ o to strings over a fixed alphabet Σ c to standardize machine operations ( Σ c < Σ o ). inary representation of both operands and
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
More informationCHAPTER 11. Proposed Project. Incremental Cash Flow for a Project. Treatment of Financing Costs. Estimating cash flows:
CHAPTER 11 Cash Flow Estimation and Risk Analysis Estimating cash flows: Relevant cash flows Working capital treatment Inflation Risk Analysis: Sensitivity Analysis, Scenario Analysis, and Simulation Analysis
More informationFuld Skolerapport for Søhusskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 9. med reference Tilsvarende klassetrin i kommunen
Side 1 af 41 Side 2 af 41 Side 3 af 41 Side 4 af 41 Side 5 af 41 Side 6 af 41 Side 7 af 41 Side 8 af 41 Side 9 af 41 Side 10 af 41 Side 11 af 41 Side 12 af 41 Side 13 af 41 Side 14 af 41 Side 15 af 41
More informationFuld Skolerapport for Hunderupskolen, i Odense kommune, for skoleår 2013/2014 for klassetrin(ene) 7. med reference Tilsvarende klassetrin i kommunen
Side 1 af 43 Side 2 af 43 Side 3 af 43 Side 4 af 43 Side 5 af 43 Side 6 af 43 Side 7 af 43 Side 8 af 43 Side 9 af 43 Side 10 af 43 Side 11 af 43 Side 12 af 43 Side 13 af 43 Side 14 af 43 Side 15 af 43
More informationBranch-and-Price Approach to the Vehicle Routing Problem with Time Windows
TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents
More information1. What are Data Structures? Introduction to Data Structures. 2. What will we Study? CITS2200 Data Structures and Algorithms
1 What are ata Structures? ata Structures and lgorithms ata structures are software artifacts that allow data to be stored, organized and accessed Topic 1 They are more high-level than computer memory
More information6.2 Normal distribution. Standard Normal Distribution:
6.2 Normal distribution Slide Heights of Adult Men and Women Slide 2 Area= Mean = µ Standard Deviation = σ Donation: X ~ N(µ,σ 2 ) Standard Normal Distribution: Slide 3 Slide 4 a normal probability distribution
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationBinary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R
Binary Search Trees A Generic Tree Nodes in a binary search tree ( B-S-T) are of the form P parent Key A Satellite data L R B C D E F G H I J The B-S-T has a root node which is the only node whose parent
More information{ Mining, Sets, of, Patterns }
{ Mining, Sets, of, Patterns } A tutorial at ECMLPKDD2010 September 20, 2010, Barcelona, Spain by B. Bringmann, S. Nijssen, N. Tatti, J. Vreeken, A. Zimmermann 1 Overview Tutorial 00:00 00:45 Introduction
More informationSQL Query Evaluation. Winter 2006-2007 Lecture 23
SQL Query Evaluation Winter 2006-2007 Lecture 23 SQL Query Processing Databases go through three steps: Parse SQL into an execution plan Optimize the execution plan Evaluate the optimized plan Execution
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
More informationHow Does My TI-84 Do That
How Does My TI-84 Do That A guide to using the TI-84 for statistics Austin Peay State University Clarksville, Tennessee How Does My TI-84 Do That A guide to using the TI-84 for statistics Table of Contents
More informationFilterBoost: Regression and Classification on Large Datasets
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley Machine Learning Department Carnegie Mellon University Pittsburgh, PA 523 jkbradle@cs.cmu.edu Robert E. Schapire Department
More informationInverted Indexes: Trading Precision for Efficiency
Inverted Indexes: Trading Precision for Efficiency Yufei Tao KAIST April 1, 2013 After compression, an inverted index is often small enough to fit in memory. This benefits query processing because it avoids
More informationMaster s Program in Information Systems
The University of Jordan King Abdullah II School for Information Technology Department of Information Systems Master s Program in Information Systems 2006/2007 Study Plan Master Degree in Information Systems
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationIs it statistically significant? The chi-square test
UAS Conference Series 2013/14 Is it statistically significant? The chi-square test Dr Gosia Turner Student Data Management and Analysis 14 September 2010 Page 1 Why chi-square? Tests whether two categorical
More information5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1
5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 General Integer Linear Program: (ILP) min c T x Ax b x 0 integer Assumption: A, b integer The integrality condition
More informationDecision-Tree Learning
Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values
More informationTwo Correlated Proportions (McNemar Test)
Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with
More informationUnderstand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test.
HYPOTHESIS TESTING Learning Objectives Understand the role that hypothesis testing plays in an improvement project. Know how to perform a two sample hypothesis test. Know how to perform a hypothesis test
More informationData Mining Techniques Chapter 9: Market Basket Analysis and Association Rules
Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules Market basket analysis.................................................... 2 Market basket data I.....................................................
More informationModel-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and
More informationDidacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.
1 Subject Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka. This document extends a previous tutorial dedicated to the comparison of various implementations
More informationroot node level: internal node edge leaf node CS@VT Data Structures & Algorithms 2000-2009 McQuain
inary Trees 1 A binary tree is either empty, or it consists of a node called the root together with two binary trees called the left subtree and the right subtree of the root, which are disjoint from each
More informationVI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
More informationData Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
More informationMINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
More informationOn the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations
Technical Report On the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations September 8, 2008 Ingo Althofer Institute of Applied Mathematics Faculty of Mathematics and Computer Science Friedrich-Schiller
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationConstrained Least Squares
Constrained Least Squares Authors: G.H. Golub and C.F. Van Loan Chapter 12 in Matrix Computations, 3rd Edition, 1996, pp.580-587 CICN may05/1 Background The least squares problem: min Ax b 2 x Sometimes,
More informationNew Matrix Approach to Improve Apriori Algorithm
New Matrix Approach to Improve Apriori Algorithm A. Rehab H. Alwa, B. Anasuya V Patil Associate Prof., IT Faculty, Majan College-University College Muscat, Oman, rehab.alwan@majancolleg.edu.om Associate
More informationPersonalized Predictive Medicine and Genomic Clinical Trials
Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations
More informationComputation of the Aggregate Claim Amount Distribution Using R and actuar. Vincent Goulet, Ph.D.
Computation of the Aggregate Claim Amount Distribution Using R and actuar Vincent Goulet, Ph.D. Actuarial Risk Modeling Process 1 Model costs at the individual level Modeling of loss distributions 2 Aggregate
More informationMATROIDS AND MYSQL WHAT TO DO
WHAT TO DO WITH BIG DATA SETS? Gordon Royle School of Mathematics & Statistics University of Western Australia La Vacquerie-et-Saint-Martin-de-Castries AUSTRALIA PERTH 37ACCMCC 37TH AUSTRALASIAN CONFERENCE
More informationBivariate Statistics Session 2: Measuring Associations Chi-Square Test
Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Features Of The Chi-Square Statistic The chi-square test is non-parametric. That is, it makes no assumptions about the distribution
More informationDistributed Computing over Communication Networks: Topology. (with an excursion to P2P)
Distributed Computing over Communication Networks: Topology (with an excursion to P2P) Some administrative comments... There will be a Skript for this part of the lecture. (Same as slides, except for today...
More informationLevel Set Framework, Signed Distance Function, and Various Tools
Level Set Framework Geometry and Calculus Tools Level Set Framework,, and Various Tools Spencer Department of Mathematics Brigham Young University Image Processing Seminar (Week 3), 2010 Level Set Framework
More informationPerformance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets
Performance Evaluation of some Online Association Rule Mining Algorithms for sorted and unsorted Data sets Pramod S. Reader, Information Technology, M.P.Christian College of Engineering, Bhilai,C.G. INDIA.
More informationOutline. Dispersion Bush lupine survival Quasi-Binomial family
Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationFinite Automata. Reading: Chapter 2
Finite Automata Reading: Chapter 2 1 Finite Automaton (FA) Informally, a state diagram that comprehensively captures all possible states and transitions that a machine can take while responding to a stream
More informationIntroduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples
More informationData Structures. Algorithm Performance and Big O Analysis
Data Structures Algorithm Performance and Big O Analysis What s an Algorithm? a clearly specified set of instructions to be followed to solve a problem. In essence: A computer program. In detail: Defined
More informationBinary search algorithm
Binary search algorithm Definition Search a sorted array by repeatedly dividing the search interval in half. Begin with an interval covering the whole array. If the value of the search key is less than
More informationVerifying Extreme Rainfall Alerts for surface water flooding potential. Marion Mittermaier, Nigel Roberts and Clive Pierce
Verifying Extreme Rainfall Alerts for surface water flooding potential Marion Mittermaier, Nigel Roberts and Clive Pierce Pluvial (surface water) flooding Why Extreme Rainfall Alerts? Much of the damage
More informationAlgorithms Chapter 12 Binary Search Trees
Algorithms Chapter 1 Binary Search Trees Outline Assistant Professor: Ching Chi Lin 林 清 池 助 理 教 授 chingchi.lin@gmail.com Department of Computer Science and Engineering National Taiwan Ocean University
More informationOPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING
OPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING Abstract: As cloud computing becomes more and more popular, understanding the economics of cloud computing becomes critically
More informationEnsuring Collective Availability in Volatile Resource Pools via Forecasting
Ensuring Collective Availability in Volatile Resource Pools via Forecasting Artur Andrzejak andrzejak[at]zib.de Derrick Kondo David P. Anderson Zuse Institute Berlin (ZIB) INRIA UC Berkeley Motivation
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationDetection of changes in variance using binary segmentation and optimal partitioning
Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the
More informationA Study of Efficacy of Apex Learning Cherokee County School District
A Study of Efficacy of Apex Learning Cherokee County School District May 2015 Copyright 2015 Apex Learning Inc. Apex Learning, the Apex Learning logo, ClassTools, ClassTools Achieve, ClassTools Virtual,
More informationLecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method
Lecture 3 3B1B Optimization Michaelmas 2015 A. Zisserman Linear Programming Extreme solutions Simplex method Interior point method Integer programming and relaxation The Optimization Tree Linear Programming
More information7 Gaussian Elimination and LU Factorization
7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method
More informationApproaches to Qualitative Evaluation of the Software Quality Attributes: Overview
4th International Conference on Software Methodologies, Tools and Techniques Approaches to Qualitative Evaluation of the Software Quality Attributes: Overview Presented by: Denis Kozlov Department of Computer
More informationLine and Polygon Clipping. Foley & Van Dam, Chapter 3
Line and Polygon Clipping Foley & Van Dam, Chapter 3 Topics Viewing Transformation Pipeline in 2D Line and polygon clipping Brute force analytic solution Cohen-Sutherland Line Clipping Algorithm Cyrus-Beck
More informationMATH 140 Lab 4: Probability and the Standard Normal Distribution
MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes
More informationCalculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation
Parkland College A with Honors Projects Honors Program 2014 Calculating P-Values Isela Guerra Parkland College Recommended Citation Guerra, Isela, "Calculating P-Values" (2014). A with Honors Projects.
More informationDynamic Adaptive Feedback of Load Balancing Strategy
Journal of Information & Computational Science 8: 10 (2011) 1901 1908 Available at http://www.joics.com Dynamic Adaptive Feedback of Load Balancing Strategy Hongbin Wang a,b, Zhiyi Fang a,, Shuang Cui
More informationKnowledge Acquisition Approach Based on Rough Set in Online Aided Decision System for Food Processing Quality and Safety
, pp. 381-388 http://dx.doi.org/10.14257/ijunesst.2014.7.6.33 Knowledge Acquisition Approach Based on Rough Set in Online Aided ecision System for Food Processing Quality and Safety Liu Peng, Liu Wen,
More informationAssessment of robust capacity utilisation in railway networks
Assessment of robust capacity utilisation in railway networks Lars Wittrup Jensen 2015 Agenda 1) Introduction to WP 3.1 and PhD project 2) Model for measuring capacity consumption in railway networks a)
More informationCONTINGENCY (CROSS- TABULATION) TABLES
CONTINGENCY (CROSS- TABULATION) TABLES Presents counts of two or more variables A 1 A 2 Total B 1 a b a+b B 2 c d c+d Total a+c b+d n = a+b+c+d 1 Joint, Marginal, and Conditional Probability We study methods
More information1.204 Lecture 10. Greedy algorithms: Job scheduling. Greedy method
1.204 Lecture 10 Greedy algorithms: Knapsack (capital budgeting) Job scheduling Greedy method Local improvement method Does not look at problem globally Takes best immediate step to find a solution Useful
More informationDynamic Trust Management for the Internet of Things Applications
Dynamic Trust Management for the Internet of Things Applications Fenye Bao and Ing-Ray Chen Department of Computer Science, Virginia Tech Self-IoT 2012 1 Sept. 17, 2012, San Jose, CA, USA Contents Introduction
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution
More information8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
More informationPREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
International Journal of Computer Science and Applications, Vol. 5, No. 4, pp 57-69, 2008 Technomathematics Research Foundation PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
More informationLoop Invariants and Binary Search
Loop Invariants and Binary Search Chapter 4.3.3 and 9.3.1-1 - Outline Ø Iterative Algorithms, Assertions and Proofs of Correctness Ø Binary Search: A Case Study - 2 - Outline Ø Iterative Algorithms, Assertions
More informationMultinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
More informationDiscovering Local Subgroups, with an Application to Fraud Detection
Discovering Local Subgroups, with an Application to Fraud Detection Abstract. In Subgroup Discovery, one is interested in finding subgroups that behave differently from the average behavior of the entire
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationStatistical Impact of Slip Simulator Training at Los Alamos National Laboratory
LA-UR-12-24572 Approved for public release; distribution is unlimited Statistical Impact of Slip Simulator Training at Los Alamos National Laboratory Alicia Garcia-Lopez Steven R. Booth September 2012
More information