FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly

Size: px
Start display at page:

Download "FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly"

Transcription

1 FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly 1

2 1. INTRODUCTION and MOTIVATION 2. PROPOSED METHOD Random Forests Classification and Regression Trees 3. SIMULATED DATA 2

3 SETTING: RANDOMIZED CLINICAL TRIAL Two treatment groups Binary outcome Efficacy Toxicity Lots of baseline covariates Range 5 to 100 Trial is already completed, small or marginal overall effect 3

4 GOAL Find subgroup of patients with enhanced treatment effect, if it exists Issues What do you mean by enhanced? Desire subgroup to be based on a small number of covariates What is the strategy for finding the subgroup Can you provide honest estimates of how good the subgroup is. 4

5 SEARCHING FOR SUBGROUPS IN RANDOMIZED CLINICAL TRIAL DATA IS A STATISTICAL NO-NO Data dredging Mining the data Overfitting the data Look hard enough you will find something Sample sizes tend not to be large enough to find subgroups 5

6 Large literature on dangers of subgroup analysis Pocock et al 2002 Rothwell et al 2005 Lagakos 2009 Brookes et al 2001, 2004 Cui et al 2002 Yusuf et al 1991 Assman et al 2000 Examples of people finding sign of the zodaic being important (Peto et al 1995) Message: use extreme caution in interpreting subgroups 6

7 Consensus opinion: Need a predefined plan for subgroup analysis Interpretation 1. Predefine the subgroups you are going to look at Interpretation 2. Predefine the strategy for searching for subgroups 7

8 A Clinicostatistical Tragedy (Feinstein 1998) Believes there is a patho-physiology reason for existence of categories Believes statistical doctrines have become too dominant Potential tragedy now is what may seem to be good statistics will be bad science 8

9 Need for validation External validation: Ideally find subgroup in one trial, validate in other trials Internal validation: Try to give honest estimate of quality of subgroup using the same dataset 9

10 Notation Y = outcome, binary T = treatment group, binary X 1,...,X p baseline covariates P(Y = 1 T,X) A subgroup (A) is a region of the design space eg A = {X 1 > 3} eg A = {X 1 > 3 and X 7 < 6)} eg A = {2X 1 + 3X 4 < 2} 10

11 Artificial data 1000 observations 15 X s Generated from logit(p(y = 1)) = X X X T X2 X TI(X A) A = {X 1 > 0,X 2 < 0}, 25% of observations Treatment group response rate = Control group response rate =

12 Scatterplot for controls X Responders Non responders X1 12

13 Scatterplot for treated individuals X Responders Non responders X1 13

14 Response rate by X1 Response rate Treated Controls ( Inf, 1.5] ( 1.5, 0.5] ( 0.5,0.5] (0.5,1.5] (1.5, Inf] X1 14

15 Response rate by X2 Response rate Treated Controls ( Inf, 1.5] ( 1.5, 0.5] ( 0.5,0.5] (0.5,1.5] (1.5, Inf] X2 15

16 Enhanced treatment effect: no unique definition Difference in absolute risk, P(Y = 1 T = 1,X) P(Y = 1 T = 0,X) Relative risk, P(Y = 1 T = 1,X)/P(Y = 1 T = 0,X) Difference in log-odds, logit(p(y = 1 T = 1,X)) logit(p(y = 1 T = 0,X)) 16

17 Simple model Treatment effect X 1 is prognostic No interaction logit(p(y = 1 T,X 1 )) = 1 + T + X 1 17

18 Sample response probabilities with no region A advantage Response probability Treated Controls X1 18

19 Interactions in statistical models Main Effects Model logit(p(y = 1 T,X)) = α + βt + γ 1 X 1 + γ 2 X 2 Note, no interaction on logit scale may have interaction on absolute risk scale Main Effects + Interaction logit(p(y = 1 T,X)) = α+βt +γ 1 X 1 +γ 2 X 2 +δtx 1 logit(p(y = 1 T,X)) = α + βt + γ 1 X 1 + γ 2 X 2 + δti(x A) Need large sample sizes to find interactions 19

20 Naive method: Forward stepwise logistic regression Include Main effects for T and X s Interactions X j X k Interactions T X j Interactions T X j X k Estimate ˆP 1i = P(Y i = 1 T i = 1,X i ) and ˆP 0i = P(Y i = 1 T i = 0,X i ) for each person i. New variable Z i = ˆP 1i ˆP 0i is then created, Subjects i in group A if Z i > c (c=cutoff, say 0.15) 20

21 Table 1: Logistic Regression Results Coefficients Estimate SE p-value X X < X < X X 2 : X < T X 1 : T X 7 : T

22 X-by-T interaction found Region  estimated as Ẑi >

23 Histogram of difference in estimated response probabilities from logit model Frequency Hat P_1i Hat P_0i 23

24 Table 2: Number of subjects in 4 cells (Logit)  not  Treatment Control Overall

25 Table 3: Response rate in 4 cells (Logit) Treatment Control  not  Overall

26 Table 4: How close is  to A (Logit)  not  A not A Overall Sensitivity = 0.24 Specificity = 0.80 Positive Predictive Value = 0.27 Negative Predictive Value =

27 Virtual Twins method For each person think about outcome if they got treatment and outcome if they got placebo Two steps Step 1. Use Random Forests (RF) on all the data Step 2. Run output from RF down a regression tree to find region A 27

28 Random Forests: A type of non-parametric regression A statistical learning algorithm for estimating function f(.) in the model P(Y = 1) = f(t,x 1,..,X p ) An ensemble method combining 250 trees Combine many simple trees Uses Bootstrap samples Uses randon subsets of covariates at each split Combine 250 predictions A black box Input, values of T and X s Output, estimate of P(Y = 1 T, X) 28

29 Step 1. Apply Random Forests to all the data Covariates X, T, X*I(T=0), X*I(T=1) Produces black box predictor For each subject apply predictor twice Once to (T = 1,X 1i,...,X pi ) Once to (T = 0,X 1i,...,X pi ) Gives ˆP 1i = P(Y i = 1 T i = 1,X i ) and ˆP 0i = P(Y i = 1 T i = 0,X i ) Form Z i = ˆP 1i ˆP 0i A measure of the treatment effect for subject i 29

30 Estimated P_1i vs. Estimated P_0i Hat P_1i Hat P_0i 30

31 Estimated P_0i vs. True P_0i Hat P_0i True P_0i 31

32 Estimated P_1i vs. True P_1i Hat P_1i True P_1i 32

33 Estimated P_1i vs. True P_1i Estimated P_1i Treated Control True P_1i 33

34 Estimated P_1i vs. True P_1i for Treated Estimated P_1i Responders Non Responders True P_1i 34

35 Histogram of difference in estimated response probabilities Frequency Hat P_1i Hat P_0i 35

36 Step 2. Regression Trees Find small number of variables that are most associated with Z i Estimate regression tree for Z i with covariates X 1i,...,X pi The result, a small number of X s with cutpoints 36

37 x1< Virtual twins tree x7< x12>= n= n= n= n=28 x7< n=233 x7>= x2>= n= n=101 37

38  classification Table 5: Number of subjects in 4 cells (VT)  not  Treatment Control Overall

39 Scatterplot for control individuals by estimated A membership X A hat Not A hat X1 39

40 Scatterplot for treated individuals by estimated A membership X A hat Not A hat X1 40

41 Scatterplot for control individuals in A hat X Responders Non responders X1 41

42 Scatterplot for treated individuals in A hat X Responders Non responders X1 42

43 Scatterplot for control individuals not in A hat X Responders Non responders X1 43

44 Scatterplot for treated individuals not in A hat X Responders Non responders X1 44

45 Table 6: Response rate in 4 cells (VT) Treatment Control  not  Overall

46 Properties of subgroup How big is � What X s does it depend on Quantify magnitude of enhanced treatment effect If know true A Are we finding the correct X s How close is  to true A Sensitivity, Specificity Positive Predictive Value, Negative Predictive Value 46

47 Table 7: How close is  to A (VT)  not  A not A Overall Sensitivity = 0.68 Specificity = 0.77 Positive Predictive Value = 0.48 Negative Predictive Value =

48 Metrics for enhanced treatment effects Q(Â) = (P(Y = 1 T = 1,X Â) P(Y = 1 T = 0,X Â)) (P(Y = 1 T = 1) P(Y = 1 T = 0)) Table 8: Response rate in 4 cells (VT) Treatment Control  not  Overall ˆQ(Â) V T = ( )-( )=

49 Table 9: Response rate in 4 cells (Logit) Treatment Control  not  Overall ˆQ(Â) Logit = ( ) - ( ) =

50 Notation Q(Â) = true value of Q for  ˆQ(Â) = estimate of Q for  Want estimates to have low bias and small variability (small SE) 50

51 Resubstitution estimates ˆQ(Â) V T = ˆQ(Â) Logit = Almost certainly optimistically biased estimates Need honest estimate of Q(Â) What would large trial Q(Â) be with this  in the next very 51

52 Methods of estimating Q(Â) Resubstitution method Simulate new data Cross-validation of ˆP 1i and ˆP 0i. Full Cross-validation 52

53 Simulate new data Simulate new data, that looks like original data, but is independent Generate binary Y i using either ˆP 1i or ˆP 0i. if T i = 1 then Y i Bernoulli( ˆP 1i ) if T i = 0 then Y i Bernoulli( ˆP 0i ) Calculate ˆQ(Â) from these new data Repeat many times and average ˆQ(Â) V T = 0.095, ˆQ( Â) Logit =

54 Cross-validation of ˆP 1i and ˆP 0i. Same as Simulate New Data, except ˆP 1i and ˆP 0i are derived after cross-validation Take 9/10 of data, run Random Forest (or Logit model with forward selection), predict for left out 1/10 Repeat 10 times ˆQ(Â) V T = 0.124, ˆQ( Â) Logit =

55 Full Cross-validation Take 9/10 of data Find region  k Find ˆQ( k ) for left out 1/10 Repeat 10 times Combine 10 separate ˆQ(Âk) to final ˆQ( Â) ˆQ(Â) V T = 0.089, ˆQ( Â) Logit =

56 Table 10: Summary of estimates of Q(Â) Estimation Method Virtual Twins Logit Resubstitution Simulate new data Cross-validation of ˆP 1i and ˆP 0i Full Cross-validation True value of Q(Â) True value of Q(A)

57 Sampling variability of ˆQ(Â) Also desirable to attach standard errors to ˆQ(Â) We propose the following: 1. Simulate many datasets using ˆP 1i and ˆP 0i from random forest/logistic method 2. For each data set, estimate a new  and calculate ˆQ(Â) 3. Estimated standard error equals the standard deviation of these ˆQ(Â) 57

58 Table 11: Standard errors for ˆQ(Â) Virtual Twins Logit Method Est. SE Est. SE Resubstitution Sim. new data

59 Null distribution of ˆQ(Â) and p-values Null distribution no region of enhanced treatment effect Null should allow possibility of main effects for X and T, but with no interaction We propose the following: 1. Define V i = logit( ˆP 1i ) logit( ˆP 0i ) and V = 1 n Vi 2. Define ˆP 1i N = expit(logit( ˆP 1i )+logit( ˆP 0i ) 2 + V 2 ˆP 0i N = expit(logit( ˆP 1i )+logit( ˆP 0i ) 2 V 2 ) 3. Simulate many datasets using ˆP N 1i and ˆP N 0i ), and 59

60 4. For each data set, first estimate  and calculate ˆQ(Â) 5. P-value is the fraction of these than the observed ˆQ(Â) ˆQ(Â) that are larger 6. If original  is empty, take p-value to be

61 Table 12: P-values for ˆQ(Â) Method Virtual Twins Logit Resubstitution Simulate new data

62 Simulation study Generate multiple datasets from logit(p(y = 1 T,X) = α + βt + γh(x) + θti(x A) A is a known region in the design space defined by a small number of X s Possible factors to consider sample size, number of X s, correlation between X s, size of true A, number of X s that determine true A, strength of enhanced treatment effect 62

63 Properties of subgroup Simulations (know true A) Are we finding the correct X s How big is Â? How close is  to true A Sensitivity, Specificity Positive Predictive Value, Negative Predictive Value Accuracy of estimates of Q(Â) 63

64 Used model: logit(p(y = 1)) = 1+0.5X X 2 0.5X T+0.5X 2 X 7 +θti{x A} 100 datasets generated For each, n =

65 Table 13: Selected X s, θ = 0.75, A = {X 1 > 0,X 2 < 0} Logit Virtual Twins Mean (# Unique X s) SD (# Unique X s) Pct. Found X 1 (int) Pct. Found X 2 (int) Pct. Found X 7 (main) Pct. Found X 3 (null) 0 13 Pct. Found X 1 at top of tree 71 Pct. Found X 1 top 2 of tree 62 Pct. Found no int/tree

66 Table 14: Compare A to Â, θ = 0.75, A = {X 1 > 0,X 2 < 0} Logit Virtual Twins percent  empty 39 7 size of  (median): Sensitivity Specificity PPV NPV AUC

67 Table 15: Q Estimates, θ = 0.75, A = {X 1 > 0,X 2 < 0} Logit Virtual Twins Mean SD Mean SD Q(A) Q(Â) ˆQ(Â) : Resub SimNewDat Cr.Val FullCr.Val

68 Table 16: Selected X s, Null case Logit Virtual Twins Mean (# Unique X s) SD (# Unique X s) Pct. Found X 1 (int) 6 72 Pct. Found X 2 (int) 6 62 Pct. Found X 7 (main) 2 70 Pct. Found X 3 (null) 0 9 Pct. Found no tree/int

69 Table 17: Properties of Â, Null case Logit Virtual Twins percent  empty Specificity

70 Table 18: Q Estimates, Null case Logit Virtual Twins Mean SD Mean SD Q(A) Q(Â) ˆQ(Â) : Resub SimNewDat Cr.Val FullCr.Val

71 Lots of possibilities for adaptation and extension Replace Random Forest with other predictor Generalize to high dimensional data (genomic/genetic) Build in to the study design Vary thresholds to make smaller or larger  Use different metrics for Q(A) Use different definitions of Z i 71

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

On the Effectiveness of Obfuscation Techniques in Online Social Networks

On the Effectiveness of Obfuscation Techniques in Online Social Networks On the Effectiveness of Obfuscation Techniques in Online Social Networks Terence Chen 1,2, Roksana Boreli 1,2, Mohamed-Ali Kaafar 1,3, and Arik Friedman 1,2 1 NICTA, Australia 2 UNSW, Australia 3 INRIA,

More information

Study Design and Statistical Analysis

Study Design and Statistical Analysis Study Design and Statistical Analysis Anny H Xiang, PhD Department of Preventive Medicine University of Southern California Outline Designing Clinical Research Studies Statistical Data Analysis Designing

More information

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Robust Tree-based Causal Inference for Complex Ad Effectiveness Analysis

Robust Tree-based Causal Inference for Complex Ad Effectiveness Analysis Robust Tree-based Causal Inference for Complex Ad Effectiveness Analysis Pengyuan Wang Wei Sun Dawei Yin Jian Yang Yi Chang Yahoo Labs, Sunnyvale, CA, USA Purdue University, West Lafayette, IN,USA {pengyuan,

More information

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev 86 ITHEA APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING Anatoli Nachev Abstract: This paper presents a case study of data mining modeling techniques for direct marketing. It focuses to three

More information

Personalized Predictive Medicine and Genomic Clinical Trials

Personalized Predictive Medicine and Genomic Clinical Trials Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations

More information

Why do statisticians "hate" us?

Why do statisticians hate us? Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Calculating Effect-Sizes

Calculating Effect-Sizes Calculating Effect-Sizes David B. Wilson, PhD George Mason University August 2011 The Heart and Soul of Meta-analysis: The Effect Size Meta-analysis shifts focus from statistical significance to the direction

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Car Insurance. Prvák, Tomi, Havri

Car Insurance. Prvák, Tomi, Havri Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes

More information

On Cross-Validation and Stacking: Building seemingly predictive models on random data

On Cross-Validation and Stacking: Building seemingly predictive models on random data On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

MS 2007-0081-RR Booil Jo. Supplemental Materials (to be posted on the Web)

MS 2007-0081-RR Booil Jo. Supplemental Materials (to be posted on the Web) MS 2007-0081-RR Booil Jo Supplemental Materials (to be posted on the Web) Table 2 Mplus Input title: Table 2 Monte Carlo simulation using externally generated data. One level CACE analysis based on eqs.

More information

Applied Multivariate Analysis - Big data analytics

Applied Multivariate Analysis - Big data analytics Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of

More information

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

More information

Multinomial Logistic Regression Ensembles

Multinomial Logistic Regression Ensembles Multinomial Logistic Regression Ensembles Abstract This article proposes a method for multiclass classification problems using ensembles of multinomial logistic regression models. A multinomial logit model

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Latent Class Regression Part II

Latent Class Regression Part II This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

arxiv:1206.6666v1 [stat.ap] 28 Jun 2012

arxiv:1206.6666v1 [stat.ap] 28 Jun 2012 The Annals of Applied Statistics 2012, Vol. 6, No. 2, 772 794 DOI: 10.1214/11-AOAS521 In the Public Domain arxiv:1206.6666v1 [stat.ap] 28 Jun 2012 ANALYZING ESTABLISHMENT NONRESPONSE USING AN INTERPRETABLE

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

MTH 140 Statistics Videos

MTH 140 Statistics Videos MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

More information

Quantitative Methods for Finance

Quantitative Methods for Finance Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

More information

Paper 45-2010 Evaluation of methods to determine optimal cutpoints for predicting mortgage default Abstract Introduction

Paper 45-2010 Evaluation of methods to determine optimal cutpoints for predicting mortgage default Abstract Introduction Paper 45-2010 Evaluation of methods to determine optimal cutpoints for predicting mortgage default Valentin Todorov, Assurant Specialty Property, Atlanta, GA Doug Thompson, Assurant Health, Milwaukee,

More information

Data Mining Introduction

Data Mining Introduction Data Mining Introduction Bob Stine Dept of Statistics, School University of Pennsylvania www-stat.wharton.upenn.edu/~stine What is data mining? An insult? Predictive modeling Large, wide data sets, often

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Lecture 13: Validation

Lecture 13: Validation Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

More information

Customer and Business Analytic

Customer and Business Analytic Customer and Business Analytic Applied Data Mining for Business Decision Making Using R Daniel S. Putler Robert E. Krider CRC Press Taylor &. Francis Group Boca Raton London New York CRC Press is an imprint

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen 9-October 2015 Presentation by: Ahmad Alsahaf Research collaborator at the Hydroinformatics lab - Politecnico di

More information

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca Data Mining in CRM & Direct Marketing Jun Du The University of Western Ontario jdu43@uwo.ca Outline Why CRM & Marketing Goals in CRM & Marketing Models and Methodologies Case Study: Response Model Case

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data

Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Kay I Penny Centre for Mathematics and Statistics, Napier University, Craiglockhart Campus, Edinburgh, EH14 1DJ k.penny@napier.ac.uk

More information

Examining a Fitted Logistic Model

Examining a Fitted Logistic Model STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Solving Regression Problems Using Competitive Ensemble Models

Solving Regression Problems Using Competitive Ensemble Models Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

2 Decision tree + Cross-validation with R (package rpart)

2 Decision tree + Cross-validation with R (package rpart) 1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

How To Choose A Churn Prediction

How To Choose A Churn Prediction Assessing classification methods for churn prediction by composite indicators M. Clemente*, V. Giner-Bosch, S. San Matías Department of Applied Statistics, Operations Research and Quality, Universitat

More information

Weight of Evidence Module

Weight of Evidence Module Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,

More information

II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an

II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne Ruiz-Gazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information