Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection



Similar documents
Regression Modeling Strategies

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

SUMAN DUVVURU STAT 567 PROJECT REPORT

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

BayesX - Software for Bayesian Inference in Structured Additive Regression

How To Understand The Theory Of Probability

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Statistics Graduate Courses

Missing data and net survival analysis Bernard Rachet

Biostatistics: Types of Data Analysis

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

Data Mining Introduction

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Penalized regression: Introduction

5. Multiple regression

Model Validation Techniques

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Elements of statistics (MATH0487-1)

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Parallelization Strategies for Multicore Data Analysis

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Interpretation of Somers D under four simple models

Organizing Your Approach to a Data Analysis

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Nonnested model comparison of GLM and GAM count regression models for life insurance data

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

A Basic Introduction to Missing Data

From the help desk: Bootstrapped standard errors

Introduction to Regression and Data Analysis

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Regression III: Advanced Methods

Penalized Logistic Regression and Classification of Microarray Data

Handling missing data in Stata a whirlwind tour

Regression Analysis: A Complete Example

Chapter 7: Simple linear regression Learning Objectives

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Applying Statistics Recommended by Regulatory Documents

11. Analysis of Case-control Studies Logistic Regression

Local classification and local likelihoods

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Simple linear regression

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

2013 MBA Jump Start Program. Statistics Module Part 3

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Statistical Rules of Thumb

SAS Software to Fit the Generalized Linear Model

SAS Certificate Applied Statistics and SAS Programming

Linear regression methods for large n and streaming data

Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs

Statistics in Retail Finance. Chapter 6: Behavioural models

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

data visualization and regression

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Package smoothhr. November 9, 2015

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

College Readiness LINKING STUDY

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Moderator and Mediator Analysis

UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH

REGRESSION MODELING STRATEGIES

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Methods for Meta-analysis in Medical Research

Multivariate Logistic Regression

GLM I An Introduction to Generalized Linear Models

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Algebra 1 Course Information

Examining a Fitted Logistic Model

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Generalized Linear Models

Gerry Hobbs, Department of Statistics, West Virginia University

Dealing with Missing Data

Introduction to mixed model and missing data issues in longitudinal studies

Modeling the Claim Duration of Income Protection Insurance Policyholders Using Parametric Mixture Models

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics

Basic Statistical and Modeling Procedures Using SAS

An Overview and Evaluation of Decision Tree Methodology

Imputing Values to Missing Data

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Simple Linear Regression

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Development and validation of a prediction model with missing predictor data: a practical approach

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Modelling and added value

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Factors affecting online sales

Additional sources Compilation of sources:

Principles of Hypothesis Testing for Public Health

Simple Linear Regression Inference

Handling attrition and non-response in longitudinal data

Transcription:

Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics - lack of fit Quantify precision / Inference Validation - quantify accuracy Presentation Harrell et al. Stat in Med 1996, 1998 Accuracy Measures Nonparametric Calibration No binning (ECDF, not histogram) Summary indexes well dev. for binary Y ROC area (concordance prob.) Brier quadratic prob. score, R 2 Nonparametric calibration plot Uses nonparametric regression (loess) Summarize using average pred. - obs. For survival models, rank correlation index can allow censoring; other indexes needed Actual Probability 0.0 0.05 0.10 0.15 0.20 Dxy C (ROC) R2 D U Q Brier Intercept Slope Emax 0.493 0.747 0.132 0.047 0.000 0.047 0.049 0.110 1.002 0.028 0.0 0.05 0.10 0.15 0.20 Predicted Probability Ideal Logistic calibration Nonparametric

Empirical Modeling Tools Nonparametric regression (loess) single X or generalized additive model can transform both X and Y simultaneously Restricted cubic splines (piecewise cubic poly. with linear tails) Tensor spline surfaces in multiple predictors Adequacy of Biomath. Models Can embed pre-specified model into an empirical model, e.g. E(Y X) = 1 - exp(x 2 /h + spline(x)) Test coefficients of spline(x) for significant additional predictive information Model Uncertainty / Selection When many models tested, best model may be rated optimistically Standard errors, P-values assuming final model pre-specified are inappropriate Need to add model uncertainty to confidence intervals Averaging competing models may improve prediction Bayesian Modeling Incorporate prior beliefs about shapes of effects of Xs, parameter values Smoothing toward prior knowledge of mechanisms Exact inference without reliance on largesample χ 2 or Gaussian distributions

Penalized MLE An empirical Bayesian approach Discounts parameter estimates toward 0 Estimate how many d.f. can be reliably estimated C Cross-validation or effective AIC Estimate that many effective d.f. Bootstrap for S.E., C.L. Large sample normal theory may be inadequate for nonlinear models Nonparametric bootstrap can estimate S.E. of individual parameter estimates or of predicted values Involves sampling with replacement from original data, re-fitting model many times Multiple Observations / Subject May not affect model parameter estimates very much Will deflate variances (C.I. too narrow) Can correct variances using cluster sandwich estimator or cluster bootstrap May want to include subject s track record to date as a predictor Bootstrap Model Validation Validate summary indexes (R 2 ) Calibration plot corrected for overfitting

Time-Dependent Covariables Excellent for experimentally controlled conditions Useful for understanding patterns of risk if covariables are internal E.g. understand how to use VGE grade profile Not very useful for prospective use Other Models Family of logistic regression models for ordinal response Uses ordering of Y No grouping of Y required Decide whether time until symptom more important than its severity Accelerated failure time models Other log-logistic formulations Nonparametric regression Other Models, Continued Berridge & Whitehead Stat in Med 1991 Combines ordinal logistic model and Cox survival model to jointly model time until and severity of event Summary Explosion of statistical methodology for model fitting, diagnostics, validation Avoid categorization - use smoothers Software (e.g., S-PLUS) is beginning to keep up with stat. developments Opportunities for merging empirical and biomathematical modeling

Missing Predictor Values Omitting incomplete records causes bias and decrease in precision Imputing missing values with a mean results in bias if predictors are correlated Customized imputation models Full likelihood or Bayesian approach Outline Overview of modeling process Measures of predictive accuracy New empirical modeling tools Testing adequacy of biomath. Models Model selection / uncertainty Bayesian modeling Penalized maximum likelihood estimation Outline Handling missing predictor values Robust s.e., C.L. using bootstrap Multiple observations per subject Bootstrap model validation Time dependent covariables Other models