Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Similar documents

Penalized Logistic Regression and Classification of Microarray Data

Regularized Logistic Regression for Mind Reading with Parallel Validation

Factors for success in big data science

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Lasso on Categorical Data

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Logistic Regression (1/24/13)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

STATISTICA Formula Guide: Logistic Regression. Table of Contents

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

5. Multiple regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Risk pricing for Australian Motor Insurance

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Research Methods & Experimental Design

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

GENETIC DATA ANALYSIS

Section 6: Model Selection, Logistic Regression and more...

Predictive Modeling and Big Data

Cross Validation. Dr. Thomas Jensen Expedia.com

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Package metafuse. November 7, 2015

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Lecture 3: Linear methods for classification

Part 2: Analysis of Relationship Between Two Variables

Nonparametric statistics and model selection

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Estimation of σ 2, the variance of ɛ

Big Data Analytics for Healthcare

Chapter 13 Introduction to Nonlinear Regression( 非線性迴歸 )

Statistical issues in the analysis of microarray data

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Handling missing data in Stata a whirlwind tour

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

How is Big Data Different? A Paradigm Shift

On the Creation of the BeSiVa Algorithm to Predict Voter Support

Predict the Popularity of YouTube Videos Using Early View Data

Statistical Machine Learning

Penalized regression: Introduction

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Local classification and local likelihoods

Least Squares Estimation

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Benchmarking of different classes of models used for credit scoring

JetBlue Airways Stock Price Analysis and Prediction

Statistics in Retail Finance. Chapter 2: Statistical models of default

Linear Classification. Volker Tresp Summer 2015

Package acrm. R topics documented: February 19, 2015

[3] Big Data: Model Selection

Imputing Missing Data using SAS

LASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Lasso-based Spam Filtering with Chinese s

Modelling and added value

Examining a Fitted Logistic Model

Econometrics Simple Linear Regression

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining. Nonlinear Classification

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Classification of Bad Accounts in Credit Card Industry

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

VI. Introduction to Logistic Regression

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Causal Leading Indicators Detection for Demand Forecasting

Model Validation Techniques

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Random Forest Based Imbalanced Data Cleaning and Classification

Integer Programming: Algorithms - 3

Logistic Regression.

11. Analysis of Case-control Studies Logistic Regression

Designing a learning system

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Variable Selection for Health Care Demand in Germany

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Fitting Subject-specific Curves to Grouped Longitudinal Data

Investigating the genetic basis for intelligence

Transcription:

Building risk prediction models - with a focus on Genome-Wide Association Studies

Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p ) = f (X 1,..., X p ), and evaluate how good the model is. Issues: Which X to use (model selection). What form to use for f (non-parametric(?) regression). Selection of smoothing/control parameters. Estimation of coefficients in the final model. Unbiased (?) evaluation of the model.

What form to use for f Not discussed in this talk! There are many (non-parametric) regression methods. Quite a few of these methods will involve variable selection, and/or selection of some hyper-parameters, in addition to fitting a regression model. In my experience often (not always) when there are many predictors linear models work (almost) as good as more complicated models (trees, splines, interactions... ), and they are often easier to explain, and come with less variance.

Selection of predictors. Selection of predictors on the same data as training and/or evaluating models can (sometimes severely) influence your results. In different ways. Be aware! Questions to ask - If certain markers are selected for the prediction, because they are the most significant ones, as reported in the literature, was the data you have in hand part of the data that was used to select those markers? Can you resist the temptation to use your test-set for anything until you have selected your final model. Are you sure that the subjects in your test data have not been used for this selection mentioned in the previous point. Remember, you can not even use it to select between the last few models! Using the same data to evaluate your model as is part of your cross-validation procedure biases your results. Carefully examine which steps influence each other.

In an ideal world you have loads of data 1 Selection data to select which predictors to use. 2 Training data to select control parameters (like λ in the Lasso) and parameters. You may need to do cross-validation to select λ, and then refit using all training data to fit the parameters. Alternatively, you may want separate data for Training and Validation. 3 Evaluation or Test data to evaluate your rule that has not been used before. If you don t have enough data, you sometimes can do a second level of cross-validation to effectively increase your sample size. The training and selection data can sometimes be combined, if you careful of what you do on which data.

Risk prediction models for GWAS Combine significant SNPs and environmental factors to predict risk of a disease. Do not worry about cause and effect of SNPs and these factors, as the goal is prediction. Do not worry about the form of the predictors: a black box is fine.

Where to worry about? Model selection: while there are many SNPs that can be significant (or not), there are many more possible models. We need efficient strategies to select models, as well as fair ways to compare them. Model evaluation: after all selection we also want to evaluate the quality of the model. We need new data for that. On a practical level - it is getting increasingly easy to get GWAS data from other groups. However, true prediction models also need information on other risk factors. Harmonizing those is a lot of work!

Lasso, LARS, boosting The traditional way to fit a logistic model with predictions X 1,..., X p is maximum likelihood. This works well provided the predictors are independent, and p < n the sample size. The Lasso (Tibshirani) noted that for prediction it is often better that some of the coefficients are shrunk (maybe all the way down to 0). This can be achieved by maximizing l(x, D; β) λ k β k, where l(x, D; β) is the logistic likelihood with parameters β, and λ > 0 is a penalty parameter.

Lasso, LARS, boosting (cont) There are close relations between the Lasso and (some forms of) boosting, or other related stepwise methods. These relations are partly formalized using the LARS (Efron, Friedman, Hastie, Tibshirani) algorithm. Parameter selection is usually via cross-validation. (But we still need a test-set to evaluate how good the prediction is.) It should be noted that these methods select the best predictors, not necessarily the significantly associated variables.

Lasso and GWAS The code may be efficient. But it cannot deal efficiently with 100,000s of predictors, and be able to do comparative simulation studies. The natural approach would appear to be to filter at a particular α level and only consider those predictors. The clean approach is to select significant SNPs each time separately for each cross-validation run (for λ). Potentially, the dirty way increases bias and decreases variance.

WTCCC data 3000 common controls 2000 cases for seven diseases each (Coronary artery disease, type 1 diabetes, type 2 diabetes, Crohn s, rheumatoid arthritis, bipolar disorder, hypertension). Affymetrix 5.0 ( 500,000 SNPs). Carried out experiment for T1D, T2D, Crohn s. For each of the diseases divide data in training set of 3000 and test set of 2000. Apply various prediction methods.

Preprocessing Filter out all SNPs with MAF < 5%, missingness > 5%, HW control P < 10 5. Fill in missing data using single (random) imputation. (Makes life much easier.) Many methods can use with large datasets, but 500,000 predictors - certainly for simulations - is too much. Thus, filter and select the top p predictors. How? Correlations between predictors can get very high - as high as 1.00 - some methods will need some prior filtering.

Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best predictors using all data. Fit using Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data.

Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate

Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate select lambda evaluate

Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-TestSelect the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit Training CV Test using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate select lambda evaluate select lambda evaluate

Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit Training CV Test using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Training CV Test Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate select lambda evaluate select lambda evaluate select lambda evaluate

Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit Training CV Test using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Training CV Test Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV Selection Training CV Test select lambda evaluate select lambda evaluate select lambda evaluate select lambda evaluate select lambda evaluate

Crohn s disease - Lasso Approach 1 2 3 4 Cheat No-Test Part-CV Full-CV Filter All Training Training CV Parameters CV Test CV CV 0 1241.77 1241.77 1241.77 1241.77 1 1233.11 1239.00 1245.28 1239.33 2 1219.18 1231.46 1237.07 1231.84 5 1213.11 1228.46 1238.29 1232.09 10 1203.67 1228.46 1237.36 1232.71 25 1200.98 1206.54 1212.29 1207.98 50 1180.92 1196.46 1205.37 1197.59 100 1151.80 1193.20 1214.62 1193.20 250 1010.51 1194.67 1360.21 1196.24 500 904.86 1192.89 1467.79 1195.61 1000 784.33 1191.17 1658.86 1195.04 2000 688.18 1191.38 1987.28 1194.78

Crohn s - average log-likelihood log likelihood 1800 1500 1200 900 600 300 training approach 1 test approach 1 training approach 3 test approach 3 training approach 4 test approach 4 0 1 5 10 50 100 500 1000 2000 number of SNPs considered

Number of SNPs actually used GLM 17 Fitting the best p predictors; select p like λ. filtered GLM 22 Fitting the best p predictors; remove SNPs with R 2 > 0.9, select p like λ. stepwise GLM 23 Stepwise selection using BIC. Lasso 10 top SNPs considered 6 25 top SNPs considered 14 50 top SNPs considered 25 100 top SNPs considered 33 250 top SNPs considered 91 500 top SNPs considered 155 1000 top SNPs considered 176 2000 top SNPs considered 177

Crohn s average log likelihood 0.66 0.64 0.62 0.60 0.58 training log lik test log lik training AUC test AUC GLM filtered GLM stepwise GLM AUC 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0 1 5 10 50 100 500 1000 2000 number of SNPs considered

ROC - Crohn s 0.0 0.2 0.4 0.6 0.8 1.0 Number of SNPs considered 5 10 25 100 500 2000 0.0 0.2 0.4 0.6 0.8 1.0

SNPs used - Crohn s BIC AIC glm filtered lasso: number of top SNPs considered 5 7 10 25 50 100 250 500 1000 1500 2000 SNP not used SNP used 1 5 10 25 50 100 250 500 1000 1500 2000 SNP used

Approach 5 - Wei et al. AJHG 2013

Approach 5 - Wei et al. AJHG 2013 Selection set 13, 273 Training set α 13, 273 Test set 13, 273 Not used (1 α) 13, 273

Verify results on another GWAS NIDDK Crohn s disease GWAS Illumina 300K 792 cases, 932 controls Refit selected models on complete WTCCC data. Impute the essential SNPs in the NIDDK data using MACH. Use ten probability samples, average results. Apply model to NIDDK data. Adjust intercept of logistic model to correct for different case/control ratio.

This is a high bar Different platform: we have to impute > 90% of the SNPs. Different populations. Different continents. No information whether disease adjudication is comparable.

AUC - NIDDK and WTCCC comparison AUC 0.50 0.55 0.60 0.65 0 NIDDK WTCCC 1 5 10 50 100 500 1000 2000 number of SNPs considered

Cross-study experiment fraction that is case 0.0 0.2 0.4 0.6 0.8 1.0 WTCCC all as train NIDDK test 0.0 0.2 0.4 0.6 0.8 1.0 fitted probability

Conclusions It is possible to develop prediction models with moderate predictive power using GWAS data. These predictive models produce results that are reproducible on other GWAS studies You have to be honest in cross-validation. Using more SNPs than are identified as significant helps. A shrinkage method like the Lasso helps.

References/Thanks Thanks: Michael LeBlanc, Valerie Obenchain, Li Hsu References: Kooperberg C, LeBlanc M, Obenchain V (2010). Risk prediction using genome-wide association studies. Genetic Epidemiology, 34, 643 652. Wei Z, Wang W,... (2013). Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. American Journal of Human Genetics, 92, 1008 1012.