FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly
|
|
- Lorraine Watts
- 8 years ago
- Views:
Transcription
1 FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly 1
2 1. INTRODUCTION and MOTIVATION 2. PROPOSED METHOD Random Forests Classification and Regression Trees 3. SIMULATED DATA 2
3 SETTING: RANDOMIZED CLINICAL TRIAL Two treatment groups Binary outcome Efficacy Toxicity Lots of baseline covariates Range 5 to 100 Trial is already completed, small or marginal overall effect 3
4 GOAL Find subgroup of patients with enhanced treatment effect, if it exists Issues What do you mean by enhanced? Desire subgroup to be based on a small number of covariates What is the strategy for finding the subgroup Can you provide honest estimates of how good the subgroup is. 4
5 SEARCHING FOR SUBGROUPS IN RANDOMIZED CLINICAL TRIAL DATA IS A STATISTICAL NO-NO Data dredging Mining the data Overfitting the data Look hard enough you will find something Sample sizes tend not to be large enough to find subgroups 5
6 Large literature on dangers of subgroup analysis Pocock et al 2002 Rothwell et al 2005 Lagakos 2009 Brookes et al 2001, 2004 Cui et al 2002 Yusuf et al 1991 Assman et al 2000 Examples of people finding sign of the zodaic being important (Peto et al 1995) Message: use extreme caution in interpreting subgroups 6
7 Consensus opinion: Need a predefined plan for subgroup analysis Interpretation 1. Predefine the subgroups you are going to look at Interpretation 2. Predefine the strategy for searching for subgroups 7
8 A Clinicostatistical Tragedy (Feinstein 1998) Believes there is a patho-physiology reason for existence of categories Believes statistical doctrines have become too dominant Potential tragedy now is what may seem to be good statistics will be bad science 8
9 Need for validation External validation: Ideally find subgroup in one trial, validate in other trials Internal validation: Try to give honest estimate of quality of subgroup using the same dataset 9
10 Notation Y = outcome, binary T = treatment group, binary X 1,...,X p baseline covariates P(Y = 1 T,X) A subgroup (A) is a region of the design space eg A = {X 1 > 3} eg A = {X 1 > 3 and X 7 < 6)} eg A = {2X 1 + 3X 4 < 2} 10
11 Artificial data 1000 observations 15 X s Generated from logit(p(y = 1)) = X X X T X2 X TI(X A) A = {X 1 > 0,X 2 < 0}, 25% of observations Treatment group response rate = Control group response rate =
12 Scatterplot for controls X Responders Non responders X1 12
13 Scatterplot for treated individuals X Responders Non responders X1 13
14 Response rate by X1 Response rate Treated Controls ( Inf, 1.5] ( 1.5, 0.5] ( 0.5,0.5] (0.5,1.5] (1.5, Inf] X1 14
15 Response rate by X2 Response rate Treated Controls ( Inf, 1.5] ( 1.5, 0.5] ( 0.5,0.5] (0.5,1.5] (1.5, Inf] X2 15
16 Enhanced treatment effect: no unique definition Difference in absolute risk, P(Y = 1 T = 1,X) P(Y = 1 T = 0,X) Relative risk, P(Y = 1 T = 1,X)/P(Y = 1 T = 0,X) Difference in log-odds, logit(p(y = 1 T = 1,X)) logit(p(y = 1 T = 0,X)) 16
17 Simple model Treatment effect X 1 is prognostic No interaction logit(p(y = 1 T,X 1 )) = 1 + T + X 1 17
18 Sample response probabilities with no region A advantage Response probability Treated Controls X1 18
19 Interactions in statistical models Main Effects Model logit(p(y = 1 T,X)) = α + βt + γ 1 X 1 + γ 2 X 2 Note, no interaction on logit scale may have interaction on absolute risk scale Main Effects + Interaction logit(p(y = 1 T,X)) = α+βt +γ 1 X 1 +γ 2 X 2 +δtx 1 logit(p(y = 1 T,X)) = α + βt + γ 1 X 1 + γ 2 X 2 + δti(x A) Need large sample sizes to find interactions 19
20 Naive method: Forward stepwise logistic regression Include Main effects for T and X s Interactions X j X k Interactions T X j Interactions T X j X k Estimate ˆP 1i = P(Y i = 1 T i = 1,X i ) and ˆP 0i = P(Y i = 1 T i = 0,X i ) for each person i. New variable Z i = ˆP 1i ˆP 0i is then created, Subjects i in group A if Z i > c (c=cutoff, say 0.15) 20
21 Table 1: Logistic Regression Results Coefficients Estimate SE p-value X X < X < X X 2 : X < T X 1 : T X 7 : T
22 X-by-T interaction found Region  estimated as Ẑi >
23 Histogram of difference in estimated response probabilities from logit model Frequency Hat P_1i Hat P_0i 23
24 Table 2: Number of subjects in 4 cells (Logit)  not  Treatment Control Overall
25 Table 3: Response rate in 4 cells (Logit) Treatment Control  not  Overall
26 Table 4: How close is  to A (Logit)  not  A not A Overall Sensitivity = 0.24 Specificity = 0.80 Positive Predictive Value = 0.27 Negative Predictive Value =
27 Virtual Twins method For each person think about outcome if they got treatment and outcome if they got placebo Two steps Step 1. Use Random Forests (RF) on all the data Step 2. Run output from RF down a regression tree to find region A 27
28 Random Forests: A type of non-parametric regression A statistical learning algorithm for estimating function f(.) in the model P(Y = 1) = f(t,x 1,..,X p ) An ensemble method combining 250 trees Combine many simple trees Uses Bootstrap samples Uses randon subsets of covariates at each split Combine 250 predictions A black box Input, values of T and X s Output, estimate of P(Y = 1 T, X) 28
29 Step 1. Apply Random Forests to all the data Covariates X, T, X*I(T=0), X*I(T=1) Produces black box predictor For each subject apply predictor twice Once to (T = 1,X 1i,...,X pi ) Once to (T = 0,X 1i,...,X pi ) Gives ˆP 1i = P(Y i = 1 T i = 1,X i ) and ˆP 0i = P(Y i = 1 T i = 0,X i ) Form Z i = ˆP 1i ˆP 0i A measure of the treatment effect for subject i 29
30 Estimated P_1i vs. Estimated P_0i Hat P_1i Hat P_0i 30
31 Estimated P_0i vs. True P_0i Hat P_0i True P_0i 31
32 Estimated P_1i vs. True P_1i Hat P_1i True P_1i 32
33 Estimated P_1i vs. True P_1i Estimated P_1i Treated Control True P_1i 33
34 Estimated P_1i vs. True P_1i for Treated Estimated P_1i Responders Non Responders True P_1i 34
35 Histogram of difference in estimated response probabilities Frequency Hat P_1i Hat P_0i 35
36 Step 2. Regression Trees Find small number of variables that are most associated with Z i Estimate regression tree for Z i with covariates X 1i,...,X pi The result, a small number of X s with cutpoints 36
37 x1< Virtual twins tree x7< x12>= n= n= n= n=28 x7< n=233 x7>= x2>= n= n=101 37
38  classification Table 5: Number of subjects in 4 cells (VT)  not  Treatment Control Overall
39 Scatterplot for control individuals by estimated A membership X A hat Not A hat X1 39
40 Scatterplot for treated individuals by estimated A membership X A hat Not A hat X1 40
41 Scatterplot for control individuals in A hat X Responders Non responders X1 41
42 Scatterplot for treated individuals in A hat X Responders Non responders X1 42
43 Scatterplot for control individuals not in A hat X Responders Non responders X1 43
44 Scatterplot for treated individuals not in A hat X Responders Non responders X1 44
45 Table 6: Response rate in 4 cells (VT) Treatment Control  not  Overall
46 Properties of subgroup How big is � What X s does it depend on Quantify magnitude of enhanced treatment effect If know true A Are we finding the correct X s How close is  to true A Sensitivity, Specificity Positive Predictive Value, Negative Predictive Value 46
47 Table 7: How close is  to A (VT)  not  A not A Overall Sensitivity = 0.68 Specificity = 0.77 Positive Predictive Value = 0.48 Negative Predictive Value =
48 Metrics for enhanced treatment effects Q(Â) = (P(Y = 1 T = 1,X Â) P(Y = 1 T = 0,X Â)) (P(Y = 1 T = 1) P(Y = 1 T = 0)) Table 8: Response rate in 4 cells (VT) Treatment Control  not  Overall ˆQ(Â) V T = ( )-( )=
49 Table 9: Response rate in 4 cells (Logit) Treatment Control  not  Overall ˆQ(Â) Logit = ( ) - ( ) =
50 Notation Q(Â) = true value of Q for  ˆQ(Â) = estimate of Q for  Want estimates to have low bias and small variability (small SE) 50
51 Resubstitution estimates ˆQ(Â) V T = ˆQ(Â) Logit = Almost certainly optimistically biased estimates Need honest estimate of Q(Â) What would large trial Q(Â) be with this  in the next very 51
52 Methods of estimating Q(Â) Resubstitution method Simulate new data Cross-validation of ˆP 1i and ˆP 0i. Full Cross-validation 52
53 Simulate new data Simulate new data, that looks like original data, but is independent Generate binary Y i using either ˆP 1i or ˆP 0i. if T i = 1 then Y i Bernoulli( ˆP 1i ) if T i = 0 then Y i Bernoulli( ˆP 0i ) Calculate ˆQ(Â) from these new data Repeat many times and average ˆQ(Â) V T = 0.095, ˆQ( Â) Logit =
54 Cross-validation of ˆP 1i and ˆP 0i. Same as Simulate New Data, except ˆP 1i and ˆP 0i are derived after cross-validation Take 9/10 of data, run Random Forest (or Logit model with forward selection), predict for left out 1/10 Repeat 10 times ˆQ(Â) V T = 0.124, ˆQ( Â) Logit =
55 Full Cross-validation Take 9/10 of data Find region  k Find ˆQ( k ) for left out 1/10 Repeat 10 times Combine 10 separate ˆQ(Âk) to final ˆQ( Â) ˆQ(Â) V T = 0.089, ˆQ( Â) Logit =
56 Table 10: Summary of estimates of Q(Â) Estimation Method Virtual Twins Logit Resubstitution Simulate new data Cross-validation of ˆP 1i and ˆP 0i Full Cross-validation True value of Q(Â) True value of Q(A)
57 Sampling variability of ˆQ(Â) Also desirable to attach standard errors to ˆQ(Â) We propose the following: 1. Simulate many datasets using ˆP 1i and ˆP 0i from random forest/logistic method 2. For each data set, estimate a new  and calculate ˆQ(Â) 3. Estimated standard error equals the standard deviation of these ˆQ(Â) 57
58 Table 11: Standard errors for ˆQ(Â) Virtual Twins Logit Method Est. SE Est. SE Resubstitution Sim. new data
59 Null distribution of ˆQ(Â) and p-values Null distribution no region of enhanced treatment effect Null should allow possibility of main effects for X and T, but with no interaction We propose the following: 1. Define V i = logit( ˆP 1i ) logit( ˆP 0i ) and V = 1 n Vi 2. Define ˆP 1i N = expit(logit( ˆP 1i )+logit( ˆP 0i ) 2 + V 2 ˆP 0i N = expit(logit( ˆP 1i )+logit( ˆP 0i ) 2 V 2 ) 3. Simulate many datasets using ˆP N 1i and ˆP N 0i ), and 59
60 4. For each data set, first estimate  and calculate ˆQ(Â) 5. P-value is the fraction of these than the observed ˆQ(Â) ˆQ(Â) that are larger 6. If original  is empty, take p-value to be
61 Table 12: P-values for ˆQ(Â) Method Virtual Twins Logit Resubstitution Simulate new data
62 Simulation study Generate multiple datasets from logit(p(y = 1 T,X) = α + βt + γh(x) + θti(x A) A is a known region in the design space defined by a small number of X s Possible factors to consider sample size, number of X s, correlation between X s, size of true A, number of X s that determine true A, strength of enhanced treatment effect 62
63 Properties of subgroup Simulations (know true A) Are we finding the correct X s How big is Â? How close is  to true A Sensitivity, Specificity Positive Predictive Value, Negative Predictive Value Accuracy of estimates of Q(Â) 63
64 Used model: logit(p(y = 1)) = 1+0.5X X 2 0.5X T+0.5X 2 X 7 +θti{x A} 100 datasets generated For each, n =
65 Table 13: Selected X s, θ = 0.75, A = {X 1 > 0,X 2 < 0} Logit Virtual Twins Mean (# Unique X s) SD (# Unique X s) Pct. Found X 1 (int) Pct. Found X 2 (int) Pct. Found X 7 (main) Pct. Found X 3 (null) 0 13 Pct. Found X 1 at top of tree 71 Pct. Found X 1 top 2 of tree 62 Pct. Found no int/tree
66 Table 14: Compare A to Â, θ = 0.75, A = {X 1 > 0,X 2 < 0} Logit Virtual Twins percent  empty 39 7 size of  (median): Sensitivity Specificity PPV NPV AUC
67 Table 15: Q Estimates, θ = 0.75, A = {X 1 > 0,X 2 < 0} Logit Virtual Twins Mean SD Mean SD Q(A) Q(Â) ˆQ(Â) : Resub SimNewDat Cr.Val FullCr.Val
68 Table 16: Selected X s, Null case Logit Virtual Twins Mean (# Unique X s) SD (# Unique X s) Pct. Found X 1 (int) 6 72 Pct. Found X 2 (int) 6 62 Pct. Found X 7 (main) 2 70 Pct. Found X 3 (null) 0 9 Pct. Found no tree/int
69 Table 17: Properties of Â, Null case Logit Virtual Twins percent  empty Specificity
70 Table 18: Q Estimates, Null case Logit Virtual Twins Mean SD Mean SD Q(A) Q(Â) ˆQ(Â) : Resub SimNewDat Cr.Val FullCr.Val
71 Lots of possibilities for adaptation and extension Replace Random Forest with other predictor Generalize to high dimensional data (genomic/genetic) Build in to the study design Vary thresholds to make smaller or larger  Use different metrics for Q(A) Use different definitions of Z i 71
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationA General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
More informationVI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationLogistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
More informationInsurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationAdequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationSome Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationGeneralizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
More informationSection 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
More informationCross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationChapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
More informationOrganizing Your Approach to a Data Analysis
Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationFeature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
More informationDistances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
More informationTHE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationOn the Effectiveness of Obfuscation Techniques in Online Social Networks
On the Effectiveness of Obfuscation Techniques in Online Social Networks Terence Chen 1,2, Roksana Boreli 1,2, Mohamed-Ali Kaafar 1,3, and Arik Friedman 1,2 1 NICTA, Australia 2 UNSW, Australia 3 INRIA,
More informationStudy Design and Statistical Analysis
Study Design and Statistical Analysis Anny H Xiang, PhD Department of Preventive Medicine University of Southern California Outline Designing Clinical Research Studies Statistical Data Analysis Designing
More informationMethods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationLocation matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationStudying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationRobust Tree-based Causal Inference for Complex Ad Effectiveness Analysis
Robust Tree-based Causal Inference for Complex Ad Effectiveness Analysis Pengyuan Wang Wei Sun Dawei Yin Jian Yang Yi Chang Yahoo Labs, Sunnyvale, CA, USA Purdue University, West Lafayette, IN,USA {pengyuan,
More informationAPPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev
86 ITHEA APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING Anatoli Nachev Abstract: This paper presents a case study of data mining modeling techniques for direct marketing. It focuses to three
More informationPersonalized Predictive Medicine and Genomic Clinical Trials
Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationModel Validation Techniques
Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost
More informationCalculating Effect-Sizes
Calculating Effect-Sizes David B. Wilson, PhD George Mason University August 2011 The Heart and Soul of Meta-analysis: The Effect Size Meta-analysis shifts focus from statistical significance to the direction
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationCar Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
More informationOn Cross-Validation and Stacking: Building seemingly predictive models on random data
On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 claudia@media6degrees.com A number of times when using cross-validation
More informationStatistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
More informationMS 2007-0081-RR Booil Jo. Supplemental Materials (to be posted on the Web)
MS 2007-0081-RR Booil Jo Supplemental Materials (to be posted on the Web) Table 2 Mplus Input title: Table 2 Monte Carlo simulation using externally generated data. One level CACE analysis based on eqs.
More informationApplied Multivariate Analysis - Big data analytics
Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org M1 in Economics and Economics and Statistics Toulouse School of
More informationBuilding risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
More informationMultinomial Logistic Regression Ensembles
Multinomial Logistic Regression Ensembles Abstract This article proposes a method for multiclass classification problems using ensembles of multinomial logistic regression models. A multinomial logit model
More informationInstitute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
More informationLatent Class Regression Part II
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationBeating the MLB Moneyline
Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More information1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand
More informationData Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
More informationarxiv:1206.6666v1 [stat.ap] 28 Jun 2012
The Annals of Applied Statistics 2012, Vol. 6, No. 2, 772 794 DOI: 10.1214/11-AOAS521 In the Public Domain arxiv:1206.6666v1 [stat.ap] 28 Jun 2012 ANALYZING ESTABLISHMENT NONRESPONSE USING AN INTERPRETABLE
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationMTH 140 Statistics Videos
MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative
More informationQuantitative Methods for Finance
Quantitative Methods for Finance Module 1: The Time Value of Money 1 Learning how to interpret interest rates as required rates of return, discount rates, or opportunity costs. 2 Learning how to explain
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationMachine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University
Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationPaper 45-2010 Evaluation of methods to determine optimal cutpoints for predicting mortgage default Abstract Introduction
Paper 45-2010 Evaluation of methods to determine optimal cutpoints for predicting mortgage default Valentin Todorov, Assurant Specialty Property, Atlanta, GA Doug Thompson, Assurant Health, Milwaukee,
More informationData Mining Introduction
Data Mining Introduction Bob Stine Dept of Statistics, School University of Pennsylvania www-stat.wharton.upenn.edu/~stine What is data mining? An insult? Predictive modeling Large, wide data sets, often
More informationData Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationCustomer and Business Analytic
Customer and Business Analytic Applied Data Mining for Business Decision Making Using R Daniel S. Putler Robert E. Krider CRC Press Taylor &. Francis Group Boca Raton London New York CRC Press is an imprint
More informationBinary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationPresentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering
Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen 9-October 2015 Presentation by: Ahmad Alsahaf Research collaborator at the Hydroinformatics lab - Politecnico di
More informationData Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca
Data Mining in CRM & Direct Marketing Jun Du The University of Western Ontario jdu43@uwo.ca Outline Why CRM & Marketing Goals in CRM & Marketing Models and Methodologies Case Study: Response Model Case
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationImputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data
Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Kay I Penny Centre for Mathematics and Statistics, Napier University, Craiglockhart Campus, Edinburgh, EH14 1DJ k.penny@napier.ac.uk
More informationExamining a Fitted Logistic Model
STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationRole of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationSolving Regression Problems Using Competitive Ensemble Models
Solving Regression Problems Using Competitive Ensemble Models Yakov Frayman, Bernard F. Rolfe, and Geoffrey I. Webb School of Information Technology Deakin University Geelong, VIC, Australia {yfraym,brolfe,webb}@deakin.edu.au
More informationBasic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
More information2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationWe discuss 2 resampling methods in this chapter - cross-validation - the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
More informationHow To Choose A Churn Prediction
Assessing classification methods for churn prediction by composite indicators M. Clemente*, V. Giner-Bosch, S. San Matías Department of Applied Statistics, Operations Research and Quality, Universitat
More informationWeight of Evidence Module
Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,
More informationII. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an
STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne Ruiz-Gazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More information