Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Size: px
Start display at page:

Download "Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg"

Transcription

1 Building risk prediction models - with a focus on Genome-Wide Association Studies

2 Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p ) = f (X 1,..., X p ), and evaluate how good the model is. Issues: Which X to use (model selection). What form to use for f (non-parametric(?) regression). Selection of smoothing/control parameters. Estimation of coefficients in the final model. Unbiased (?) evaluation of the model.

3 What form to use for f Not discussed in this talk! There are many (non-parametric) regression methods. Quite a few of these methods will involve variable selection, and/or selection of some hyper-parameters, in addition to fitting a regression model. In my experience often (not always) when there are many predictors linear models work (almost) as good as more complicated models (trees, splines, interactions... ), and they are often easier to explain, and come with less variance.

4 Selection of predictors. Selection of predictors on the same data as training and/or evaluating models can (sometimes severely) influence your results. In different ways. Be aware! Questions to ask - If certain markers are selected for the prediction, because they are the most significant ones, as reported in the literature, was the data you have in hand part of the data that was used to select those markers? Can you resist the temptation to use your test-set for anything until you have selected your final model. Are you sure that the subjects in your test data have not been used for this selection mentioned in the previous point. Remember, you can not even use it to select between the last few models! Using the same data to evaluate your model as is part of your cross-validation procedure biases your results. Carefully examine which steps influence each other.

5 In an ideal world you have loads of data 1 Selection data to select which predictors to use. 2 Training data to select control parameters (like λ in the Lasso) and parameters. You may need to do cross-validation to select λ, and then refit using all training data to fit the parameters. Alternatively, you may want separate data for Training and Validation. 3 Evaluation or Test data to evaluate your rule that has not been used before. If you don t have enough data, you sometimes can do a second level of cross-validation to effectively increase your sample size. The training and selection data can sometimes be combined, if you careful of what you do on which data.

6 Risk prediction models for GWAS Combine significant SNPs and environmental factors to predict risk of a disease. Do not worry about cause and effect of SNPs and these factors, as the goal is prediction. Do not worry about the form of the predictors: a black box is fine.

7 Where to worry about? Model selection: while there are many SNPs that can be significant (or not), there are many more possible models. We need efficient strategies to select models, as well as fair ways to compare them. Model evaluation: after all selection we also want to evaluate the quality of the model. We need new data for that. On a practical level - it is getting increasingly easy to get GWAS data from other groups. However, true prediction models also need information on other risk factors. Harmonizing those is a lot of work!

8 Lasso, LARS, boosting The traditional way to fit a logistic model with predictions X 1,..., X p is maximum likelihood. This works well provided the predictors are independent, and p < n the sample size. The Lasso (Tibshirani) noted that for prediction it is often better that some of the coefficients are shrunk (maybe all the way down to 0). This can be achieved by maximizing l(x, D; β) λ k β k, where l(x, D; β) is the logistic likelihood with parameters β, and λ > 0 is a penalty parameter.

9 Lasso, LARS, boosting (cont) There are close relations between the Lasso and (some forms of) boosting, or other related stepwise methods. These relations are partly formalized using the LARS (Efron, Friedman, Hastie, Tibshirani) algorithm. Parameter selection is usually via cross-validation. (But we still need a test-set to evaluate how good the prediction is.) It should be noted that these methods select the best predictors, not necessarily the significantly associated variables.

10 Lasso and GWAS The code may be efficient. But it cannot deal efficiently with 100,000s of predictors, and be able to do comparative simulation studies. The natural approach would appear to be to filter at a particular α level and only consider those predictors. The clean approach is to select significant SNPs each time separately for each cross-validation run (for λ). Potentially, the dirty way increases bias and decreases variance.

11 WTCCC data 3000 common controls 2000 cases for seven diseases each (Coronary artery disease, type 1 diabetes, type 2 diabetes, Crohn s, rheumatoid arthritis, bipolar disorder, hypertension). Affymetrix 5.0 ( 500,000 SNPs). Carried out experiment for T1D, T2D, Crohn s. For each of the diseases divide data in training set of 3000 and test set of Apply various prediction methods.

12 Preprocessing Filter out all SNPs with MAF < 5%, missingness > 5%, HW control P < Fill in missing data using single (random) imputation. (Makes life much easier.) Many methods can use with large datasets, but 500,000 predictors - certainly for simulations - is too much. Thus, filter and select the top p predictors. How? Correlations between predictors can get very high - as high as some methods will need some prior filtering.

13 Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best predictors using all data. Fit using Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data.

14 Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate

15 Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate select lambda evaluate

16 Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-TestSelect the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit Training CV Test using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate select lambda evaluate select lambda evaluate

17 Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit Training CV Test using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Training CV Test Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV select lambda evaluate select lambda evaluate select lambda evaluate select lambda evaluate

18 Five selection procedures For each procedure set part of data apart as test-set (we took 40%). 1 Cheat Select the best Training predictors using all data. Fit using Test Lasso. Select the best λ using cross-validation (CV) on training data. Refit with that λ on all training data. Training Test 2 No-Test Select the best predictors using training data. Fit using Lasso. Select the best λ looking at the test data. 3 Part-CV Select the best predictors using training data. Fit Training CV Test using Lasso. Select the best λ using CV on training data. Refit with λ on all training data. 4 Full-CV CV: Select best predictors on 9/10th of training data. Training CV Test Get score for each λ, and evaluate on remaining 1/10th of training data. Select the best λ. Select best predictors using training data. Refit with λ on all training data. 5 Wei et al Divide training data in two parts. Use one part to select best predictors. Use other part to find best λ using CV. Refit with λ on second part of training data. CV Selection Training CV Test select lambda evaluate select lambda evaluate select lambda evaluate select lambda evaluate select lambda evaluate

19 Crohn s disease - Lasso Approach Cheat No-Test Part-CV Full-CV Filter All Training Training CV Parameters CV Test CV CV

20 Crohn s - average log-likelihood log likelihood training approach 1 test approach 1 training approach 3 test approach 3 training approach 4 test approach number of SNPs considered

21 Number of SNPs actually used GLM 17 Fitting the best p predictors; select p like λ. filtered GLM 22 Fitting the best p predictors; remove SNPs with R 2 > 0.9, select p like λ. stepwise GLM 23 Stepwise selection using BIC. Lasso 10 top SNPs considered 6 25 top SNPs considered top SNPs considered top SNPs considered top SNPs considered top SNPs considered top SNPs considered top SNPs considered 177

22 Crohn s average log likelihood training log lik test log lik training AUC test AUC GLM filtered GLM stepwise GLM AUC number of SNPs considered

23 ROC - Crohn s Number of SNPs considered

24 SNPs used - Crohn s BIC AIC glm filtered lasso: number of top SNPs considered SNP not used SNP used SNP used

25 Approach 5 - Wei et al. AJHG 2013

26 Approach 5 - Wei et al. AJHG 2013 Selection set 13, 273 Training set α 13, 273 Test set 13, 273 Not used (1 α) 13, 273

27 Verify results on another GWAS NIDDK Crohn s disease GWAS Illumina 300K 792 cases, 932 controls Refit selected models on complete WTCCC data. Impute the essential SNPs in the NIDDK data using MACH. Use ten probability samples, average results. Apply model to NIDDK data. Adjust intercept of logistic model to correct for different case/control ratio.

28 This is a high bar Different platform: we have to impute > 90% of the SNPs. Different populations. Different continents. No information whether disease adjudication is comparable.

29 AUC - NIDDK and WTCCC comparison AUC NIDDK WTCCC number of SNPs considered

30 Cross-study experiment fraction that is case WTCCC all as train NIDDK test fitted probability

31 Conclusions It is possible to develop prediction models with moderate predictive power using GWAS data. These predictive models produce results that are reproducible on other GWAS studies You have to be honest in cross-validation. Using more SNPs than are identified as significant helps. A shrinkage method like the Lasso helps.

32 References/Thanks Thanks: Michael LeBlanc, Valerie Obenchain, Li Hsu References: Kooperberg C, LeBlanc M, Obenchain V (2010). Risk prediction using genome-wide association studies. Genetic Epidemiology, 34, Wei Z, Wang W,... (2013). Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. American Journal of Human Genetics, 92,

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Factors for success in big data science

Factors for success in big data science Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)

More information

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2

More information

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013 Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

More information

From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes

From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes Zhi Wei 1., Kai Wang 2., Hui-Qi Qu 3, Haitao Zhang 2, Jonathan Bradfield 2, Cecilia

More information

Lasso on Categorical Data

Lasso on Categorical Data Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,

More information

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

More information

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

More information

Leveraging electronic health records for predictive modeling of surgical complications. Grant Weller

Leveraging electronic health records for predictive modeling of surgical complications. Grant Weller Leveraging electronic health records for predictive modeling of surgical complications Grant Weller ISCB 2015 Utrecht NL August 26, 2015 Collaborators: David W. Larson, MD; Jenna Lovely, PharmD, RPh; Berton

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

More information

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore Publication List Chen Zehua Department of Statistics & Applied Probability National University of Singapore Publications Journal Papers 1. Y. He and Z. Chen (2014). A sequential procedure for feature selection

More information

Research Methods & Experimental Design

Research Methods & Experimental Design Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

GENETIC DATA ANALYSIS

GENETIC DATA ANALYSIS GENETIC DATA ANALYSIS 1 Genetic Data: Future of Personalized Healthcare To achieve personalization in Healthcare, there is a need for more advancements in the field of Genomics. The human genome is made

More information

Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

More information

Predictive Modeling and Big Data

Predictive Modeling and Big Data Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

Data Science - A Glossary of Downloadabytes

Data Science - A Glossary of Downloadabytes Our future in big data science Damjan Vukcevic http://damjan.vukcevic.net/ 13 October 2015 SSA Canberra, Young Statisticians Workshop What is big data? You know it when you see it? Tell-tale signs: Need

More information

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to nonparametric regression: Least squares vs. Nearest neighbors Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,

More information

Package metafuse. November 7, 2015

Package metafuse. November 7, 2015 Type Package Package metafuse November 7, 2015 Title Fused Lasso Approach in Regression Coefficient Clustering Version 1.0-1 Date 2015-11-06 Author Lu Tang, Peter X.K. Song Maintainer Lu Tang

More information

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS Predictive Modeling using SAS Purpose of Predictive Modeling To Predict the Future x To identify statistically significant attributes or risk factors x To publish findings in Science, Nature, or the New

More information

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan

Combining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan The Challenge Detecting small effects requires very large sample sizes Combined

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

Nonparametric statistics and model selection

Nonparametric statistics and model selection Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.

More information

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School

More information

Estimation of σ 2, the variance of ɛ

Estimation of σ 2, the variance of ɛ Estimation of σ 2, the variance of ɛ The variance of the errors σ 2 indicates how much observations deviate from the fitted surface. If σ 2 is small, parameters β 0, β 1,..., β k will be reliably estimated

More information

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University Tutorial presentation at

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

Vertical data integration for melanoma prognosis. Australia 3 Melanoma Institute Australia, NSW 2060 Australia. kaushala@maths.usyd.edu.au.

Vertical data integration for melanoma prognosis. Australia 3 Melanoma Institute Australia, NSW 2060 Australia. kaushala@maths.usyd.edu.au. Vertical integration for melanoma prognosis Kaushala Jayawardana 1,4, Samuel Müller 1, Sarah-Jane Schramm 2,3, Graham J. Mann 2,3 and Jean Yang 1 1 School of Mathematics and Statistics, University of Sydney,

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

Predicting The Risk Of Rheumatoid Arthritis

Predicting The Risk Of Rheumatoid Arthritis Predicting The Risk Of Rheumatoid Arthritis Modelling Genetic And Environmental Risk Factors Ian Scott Arthritis Research UK Clinical Research Fellow Declaration Of Interests: No Competing Interests Describe

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

How is Big Data Different? A Paradigm Shift

How is Big Data Different? A Paradigm Shift How is Big Data Different? A Paradigm Shift Jennifer Clarke, Ph.D. Associate Professor Department of Statistics Department of Food Science and Technology University of Nebraska Lincoln ASA Snake River

More information

On the Creation of the BeSiVa Algorithm to Predict Voter Support

On the Creation of the BeSiVa Algorithm to Predict Voter Support On the Creation of the BeSiVa Algorithm to Predict Voter Support By Benjamin Rogers Submitted to the graduate degree program in Political Science and the Graduate Faculty of the University of Kansas in

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

More information

Local classification and local likelihoods

Local classification and local likelihoods Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly

FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly 1 1. INTRODUCTION and MOTIVATION 2. PROPOSED METHOD Random Forests Classification

More information

JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Package acrm. R topics documented: February 19, 2015

Package acrm. R topics documented: February 19, 2015 Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author

More information

[3] Big Data: Model Selection

[3] Big Data: Model Selection [3] Big Data: Model Selection Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.taddy/teaching [3] Making Model Decisions Out-of-Sample vs In-Sample performance Regularization

More information

Imputing Missing Data using SAS

Imputing Missing Data using SAS ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

More information

LASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013.

LASSO Regression. Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013. Case Study 3: fmri Prediction LASSO Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 21 th, 2013 Emily Fo013 1 LASSO Regression LASSO: least

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Session 11 PD, Provider Perspectives of Values Based Payment Programs. Moderator: William T. O'Brien, FSA, FCA

Session 11 PD, Provider Perspectives of Values Based Payment Programs. Moderator: William T. O'Brien, FSA, FCA Session 11 PD, Provider Perspectives of Values Based Payment Programs Moderator: William T. O'Brien, FSA, FCA Presenters: Donald Fry, M.D. Lillian Louise Dittrick, FSA, MAAA Colleen Audrey Norris, ASA,

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

Modelling and added value

Modelling and added value Modelling and added value Course: Statistical Evaluation of Diagnostic and Predictive Models Thomas Alexander Gerds (University of Copenhagen) Summer School, Barcelona, June 30, 2015 1 / 53 Multiple regression

More information

Examining a Fitted Logistic Model

Examining a Fitted Logistic Model STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other 1 Hypothesis Testing Richard S. Balkin, Ph.D., LPC-S, NCC 2 Overview When we have questions about the effect of a treatment or intervention or wish to compare groups, we use hypothesis testing Parametric

More information

Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Evidence to Action: Use of Predictive Models for Beach Water Postings

Evidence to Action: Use of Predictive Models for Beach Water Postings Evidence to Action: Use of Predictive Models for Beach Water Postings Canadian Society for Epidemiology and Biostatistics Caitlyn Paget, June 4 th 2015 Goal is to improve program delivery Can we improve

More information

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Predicting Health Care Costs by Two-part Model with Sparse Regularization Predicting Health Care Costs by Two-part Model with Sparse Regularization Atsuyuki Kogure Keio University, Japan July, 2015 Abstract We consider the problem of predicting health care costs using the two-part

More information

Causal Leading Indicators Detection for Demand Forecasting

Causal Leading Indicators Detection for Demand Forecasting Causal Leading Indicators Detection for Demand Forecasting Yves R. Sagaert, El-Houssaine Aghezzaf, Nikolaos Kourentzes, Bram Desmet Department of Industrial Management, Ghent University 13/07/2015 EURO

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Integer Programming: Algorithms - 3

Integer Programming: Algorithms - 3 Week 9 Integer Programming: Algorithms - 3 OPR 992 Applied Mathematical Programming OPR 992 - Applied Mathematical Programming - p. 1/12 Dantzig-Wolfe Reformulation Example Strength of the Linear Programming

More information

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Designing a learning system

Designing a learning system Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

Variable Selection for Health Care Demand in Germany

Variable Selection for Health Care Demand in Germany Variable Selection for Health Care Demand in Germany Zhu Wang Connecticut Children s Medical Center University of Connecticut School of Medicine zwang@connecticutchildrens.org July 23, 2015 This document

More information

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

Investigating the genetic basis for intelligence

Investigating the genetic basis for intelligence Investigating the genetic basis for intelligence Steve Hsu University of Oregon and BGI www.cog-genomics.org Outline: a multidisciplinary subject 1. What is intelligence? Psychometrics 2. g and GWAS: a

More information