Modelling and added value
|
|
- Jonas Watson
- 8 years ago
- Views:
Transcription
1 Modelling and added value Course: Statistical Evaluation of Diagnostic and Predictive Models Thomas Alexander Gerds (University of Copenhagen) Summer School, Barcelona, June 30, / 53
2 Multiple regression Multiple regression can be used to exploit the joint predictive power of several or many variables, and also to assess the added value of new markers in the presence of conventional risk factors. Commonly used modelling techniques: logistic regression for binary outcome Cox regression for time-to-event (survival) outcome P-values testing the null hypothesis of no association are not a good measure of predictive power. 2 / 53
3 Example: epo study 1 Anaemia is a deciency of red blood cells and/or hemoglobin and an additional risk factor for cancer patients. Randomized placebo controlled trial: does treatment with epoetin beta epo (300 U/kg) enhance hemoglobin concentration level and improve survival chances? Henke et al identied the c20 expression (erythropoietin receptor status) as a new biomarker for the prognosis of locoregional progression-free survival. 1 Henke et al. Do erythropoietin receptors on cancer cells explain unexpected clinical ndings? J Clin Oncol, 24(29): , / 53
4 Treatment The study includes head and neck cancer patients with a tumor located in the oropharynx (36%), the oral cavity (27%), the larynx (14%) or in the hypopharynx (23%). One of the treatments was radiotherapy following Resection Complete Incomplete No Placebo Epo with non-missing blood values 4 / 53
5 Outcome Blood hemoglobin levels were measured weekly during radiotherapy (7 weeks). Treatment with epoetin beta was dened successful when the hemoglobin level increased suciently. For patient i set Y i = { 1 treatment successful 0 treatment failed 5 / 53
6 Target Patient no. Treatment successful Predicted probability 1 0 P P P P P P P 7 6 / 53
7 Predictors Age min: 41 y, median: 59 y, max: 80 y Gender male: 85%, female: 15% Baseline hemoglobin mean: g/dl, std: 1.45 Treatment epo: 50%, placebo 50% Resection complete: 48%, incomplete: 19%, no resection: 34% Epo receptor status neg: 32%, pos: 68% 7 / 53
8 Logistic regression Response: treatment successful yes/no Factor OddsRatio StandardError CI.95 pvalue (Intercept) < Age [0.91; 1.03] Sex:female [0.91; 26.02] HbBase [1.99; 5.91] < Treatment:Epo [23.9; ] < Resection:Incompl [0.36; 9.03] Resection:Compl [1.13; 17.36] Receptor:positive [1.72; 23.39] / 53
9 The model provides general information Treatment with epo increases the chance (odds) of reaching the target hemoglobin level signicantly by a factor of (CI 95% : [23.9; 493.4], p < ) in the overall study population. Does that mean everyone should be treated? 9 / 53
10 The model provides information for a single patient For example: the predicted probability that a 51 year old man with complete tumor resection and baseline hemoglobin level 12.6 g/dl reaches the target hemoglobin level (Y i =1) is [Epo group: ] 97.4% [ Placebo: ] 29.2 % If a similar patient has baseline hemoglobin level 14.8 g/dl then the model predicts: [Epo group: ] 99.8% [Placebo: ] 84.7 % 10 / 53
11 Predictions and Brier score for logistic regression Patient Treatment Predicted Brier no. successful probability (%) Residual score Y i P i Y i P i (Y i P i ) < Σ / 53
12 The model behind the table ( ) Pi log = β 0 + β 1 x 1,i + + β k x k,i 1 P i P i = exp{β 0 + β 1 x 1,i + + β k x k,i } P i the probability of successful treatment x 1,i rst predictor for subject i: (e.g. age = 50) x 2,i second predictor for subject i: (e.g. gender = male) x k,i k'th predictor for subject i: (e.g. eporeceptor = pos) β 0,..., β k are regression coecients that are estimated based on the epo study 12 / 53
13 Predicted treatment success probability (logistic regression) For a treated man with no resection possible and negative epo receptor status. Predicted risk 100% 14 90% 80% 13 70% Baseline hemoglobin (g/dl) % 50% 40% 10 30% 9 20% 10% Age (years) 0% 13 / 53
14 Nomogram Points age sex HbBase Treat Resection eporec Total Points Linear Predictor female male Epo Placebo Incompl No Compl Chance of treatment success / 53
15 Nomogram: R-code library(rms) f7 <- lrm(y~age+sex+hbbase+treat+resection+eporec,data= Epo,x=TRUE,y=TRUE) dd <- datadist(epo) options(datadist = "dd") nom7 <- nomogram(f7, fun=function(x)1/(1+exp(-x)), fun.at=c(.001,.01,.05,0.25,0.75,.95,.99,.999), funlabel="chance of treatment success") plot(nom7) library(dynnom) f7 <- glm(y~age+sex+hbbase+treat+resection+eporec,data= Epo,family=binomial()) DynNom(f7,Epo,clevel=0.95) 15 / 53
16 Tools for evaluating prediction accuracy For each subject we have a predicted risk based on multiple predictors. To evaluate the prediction performance of the logistic regression model we consider the following tools: Prediction accuracy: Brier score (lack of calibration and lack of spread of predictions) Discrimination: Roc curve, c-index = AUC (lack of spread of predictions) Calibration plot: (lack of calibration) Re-classication scatterplot/table: (changes of risk predictions) Brier score: The squared dierence between the observed status and the predicted risk. AUC: The fraction of randomly selected pairs of patients where the predicted risk was higher for the diseased subject compared to the non-diseased subject. 16 / 53
17 Brier score for null model in the Epo study Patient Treatment Predicted Brier no. successful probability (%) Residual score Y i P i Y i P i (Y i P i ) Σ The predicted probability is the prevalence of patients with treatment success in the data set. 17 / 53
18 Prevalence model Calibration plot Observed proportion 0 % 25 % 50 % 75 % 100 % Performance null model Brier=24.7 AUC= % 25 % 50 % 75 % 100 % Predicted probability of treatment success 18 / 53
19 Univariate logistic regression models Categorical predictors library(rms) resecmodel <- lrm(y~resection,data=epo,x=true,y=true) sexmodel <- lrm(y~sex,data=epo,x=true,y=true) treatmodel <- lrm(y~treat,data=epo,x=true,y=true) ## or via glm treatmodel <- glm(y~treat,data=epo,family="binomial") Continuous predictors library(rms) basehbmodel <- lrm(y~hbbase,data=epo,x=true,y=true) agemodel <- glm(y~age,data=epo,family="binomial") 19 / 53
20 Categorical predictors Resection status Treatment success 0 Treatment success 1 No Incompl Compl Gender Treatment success 0 Treatment success 1 male female Treatment Treatment success 0 Treatment success 1 Placebo 66 8 Epo / 53
21 Categorical predictors: Resection status, gender, treatment Calibration plot Observed proportion 0 % 25 % 50 % 75 % 100 % Null model Brier=24.7 AUC=50.0 Gender model Brier=24.7 AUC=50.3 Resection model Brier=24.0 AUC=58.7 Treatment model Brier=13.6 AUC= % 25 % 50 % 75 % 100 % Predicted probability of treatment success 21 / 53
22 Continuous predictors: Baseline hemoglobin, Age Scatter plot Age (years) Baseline hemoglobin (g/dl) Treatment success Treatment failed 22 / 53
23 Continuous predictors: Baseline hemoglobin, Age Calibration plot 100 % Null model Brier=24.7 AUC= % Observed proportion 50 % Age model Brier=24.7 AUC= % Baseline hemoglobin model Brier=19.3 AUC= % 0 % 25 % 50 % 75 % 100 % Predicted probability of treatment success 23 / 53
24 Continuous predictors: Baseline hemoglobin, Age Roc curves Sensitivity 0 % 25 % 50 % 75 % 100 % Null model Brier=24.7 AUC=50.0 Age model Brier=24.7 AUC=51.2 Baseline hemoglobin Brier=19.3 AUC= % 25 % 50 % 75 % 100 % 1 Specificity 24 / 53
25 Continuous predictors: Baseline hemoglobin, Age Re classification plot Predicted chance (Age model) Predicted chance (Hemoglobin model) 0 % 25 % 50 % 75 % 100 % 0 % 25 % 50 % 75 % 100 % Treatment success Treatment failed 25 / 53
26 Multiple logistic regression Model excluding epo receptor status add <- lrm(y~age+sex+hbbase+treat+resection,data=epo,x= TRUE,y=TRUE) Model including epo receptor status add.epor <- lrm(y~age+sex+hbbase+treat+resection+eporec,data=epo,x=true,y=true) 26 / 53
27 Multiple logistic regression Re classification plot Predicted chance (excluding receptor status) Predicted chance (including receptor status) 0 % 25 % 50 % 75 % 100 % 0 % 25 % 50 % 75 % 100 % Treatment success Treatment failed 27 / 53
28 Multiple logistic regression Calibration plot 100 % Null model Brier=24.7 AUC= % Observed proportion 50 % 25 % All variables Brier= 9.6 AUC= % All + receptor status Brier= 8.7 AUC= % 25 % 50 % 75 % 100 % Predicted event probability 28 / 53
29 Multiple logistic regression Roc curves Sensitivity 0 % 25 % 50 % 75 % 100 % All variables Brier= 9.6 AUC=93.3 All + receptor status Brier= 8.7 AUC= % 25 % 50 % 75 % 100 % 1 Specificity 29 / 53
30 Exercises Do the tutorial 'Added value of new marker' 2. Split the IVF data (see link on course homepage) at random into two parts (60% for learning, 40% for evaluation). Then, build a multiple logistic regression model to predict response. Include the following covariates: antfoll, smoking, fsh, ovolume, bmi. 3. Produce a table which shows the odds ratios with condence limits (hint: Publish::publish.glm(t)) and write a caption which explains the table. 4. Produce a calibration plot and write a caption. (hint: ModelGood::calPlot2) 5. Produce a Roc curve, add the Brier score and AUC as a legend, and write a caption. 6. Build a second logistic regression model where you include the above variables and add the variable cyclelen. 7. Evaluate the added value of cyclelen: re-classication table and plot (hint: ModelGood::reclass), dierence in Brier scores and AUC with appropriate tests. Describe the underlying null hypotheses. 8. For each subject in the test data compute the dierence of the predictions between the model which excludes cyclelen and the model that includes cyclelen. Consider this dierence as a new continuous marker and produce the corresponding ROC curve and AUC. Describe the interpretation of AUC for this specic ROC curve in words and comment. 30 / 53
31 Model selection Very many dierent 'logistic regression models' can be constructed by selecting subsets of variables and transformations/groupings of variables. Standard multiple (logistic) regression works if the number of predictors is not too large, and substantially smaller than the sample size the decision maker has a-priory knowledge about which variables to put into the model Ad-hoc model selection algorithms, like automated backward elimination, do not lead to reproducible prediction models. 31 / 53
32 A Conversation of Richard Olshen with Leo Breiman 3... Olshen: What about arcing, bagging and boosting? Breiman: Okay. Yeah. This is fascinating stu, Richard. In the last ve years, there have been some really big breakthroughs in prediction. And I think combining predictors is one of the two big breakthroughs. And the idea of this was, okay, that suppose you take CART, which is a pretty good classier, but not a great classier. I mean, for instance, neural nets do a much better job. Olshen: Well, suitably trained? Breiman: Suitably trained. Olshen: Against an untrained CART? Breiman: Right. Exactly. And I think I was thinking about this. I had written an article on subset selection in linear regression. I had realized then that subset selection in linear regression is really a very unstable procedure. If you tamper with the data just a little bit, the rst best ve variable regression may change to another set of ve variables. And so I thought, Okay. We can stabilize this by just perturbing the data a little and get the best ve variable predictor. Perturb it again. Get the best ve variable predictor and then average all these ve variable predictors. And sure enough, that worked out beautifully. This was published in an article in the Annals (Breiman, 1996b) Statist. Sci. Volume 16, Issue 2 (2001), / 53
33 33 / 53
34 Backward elimination On full data (n=149): library(rms) data(epo) f7 <- lrm(y~age+sex+hbbase+treat+resection+eporec,data=epo,x=true,y=true) fastbw(f7) Deleted Chi-Sq d.f. P Residual d.f. P AIC age Resection Approximate Estimates after Deleting Factors Coef S.E. Wald Z P Intercept sex=female HbBase Treat=Epo eporec Factors in Final Model [1] sex HbBase Treat eporec 34 / 53
35 Backward elimination On reduced data (n=130): library(rms) data(epo) set.seed(1731) f7a <- lrm(y~age+sex+hbbase+treat+resection+eporec,data=epo[sample(1:149, replace=false,size=130),],x=true,y=true) fastbw(f7a) Deleted Chi-Sq d.f. P Residual d.f. P AIC age sex Resection Approximate Estimates after Deleting Factors Coef S.E. Wald Z P Intercept HbBase Treat=Epo eporec Factors in Final Model [1] HbBase Treat eporec 35 / 53
36 Guided model selection The hope of conventional regression modelling is that the better the model ts the better it predicts. But, the model should predict new patients. Prostate Cancer Risk Calculator: We used multivariable logistic regression to model the risk of prostate cancer by considering all possible combinations of main eects and interactions. The models chosen were those that minimized the Bayesian information criterion (BIC) and maximized the average out-of-sample area under the receiver operating characteristic curve (via 4-fold cross-validation). 36 / 53
37 The two cultures 4 4 L. Breiman. Statistical modeling: The two cultures. Statistical Science, 16 (3): , / 53
38 The two cultures 38 / 53
39 Classication trees A tree model is a form of recursive partitioning. It lets the data decide which variables are important and where to place cut-os in continuous variables. In general terms, the purpose of the analyzes via tree-building algorithms is to determine a set of splits that permit accurate prediction or classication of cases. In other words: a tree is a combination of many medical tests. 39 / 53
40 Epo study 1 arm p < Placebo Epo 2 Resection p = HbBase p < {No, Incomplete} Complete 11.3 > 11.3 Node 3 (n = 39) Node 4 (n = 35) Node 6 (n = 19) Node 7 (n = 56) / 53
41 Roughly, the algorithm works as follows: 1. Find the predictor so that the best possible split on that predictor optimizes some statistical criterion over all possible splits on the other predictors. 2. For ordinal and continuous predictors, the split is of the form X < c versus X c. 3. Repeat step 1 within each previously formed subset. 4. Proceed until fewer than k observations remain to be split, or until nothing is gained from further splitting, i.e. the tree is fully grown. 5. The tree is pruned according to some criterion. 41 / 53
42 Characters of classication trees But: Trees are specically designed for accurate classication/prediction Results have a graphical representation and are easy to interpret No model assumptions Recursive partitioning can identify complex interactions One can introduce dierent costs of miss-classication in the three Trees are not robust against even small perturbations of the data. It is quite easy to over-t the data. 42 / 53
43 More complex tree (overtting?) 1 arm p < Placebo Epo 2 Resection p = HbBase p < {No, Incomplete}Complete 11.3 > HbBase p = Resection p = No{Incomplete, Complete} 12.1 > eporec p = positive negative Node 3 (n = 39) Node 5 (n = 25) Node 6 (n = 10) Node 8 (n = 19) Node 10 (n = 18) Node 12 (n = 27) Node 13 (n = 11) / 53
44 Comparing the dierent predictions Patient no. Treatment successful Predicted probability (%) Simple Complex LRM tree tree / 53
45 Comparing the dierent predictions Model Brier score AUC Simple tree Logistic regression Complex tree Random forest Note: These numbers are estimated by using the same data that were used to construct the models. 45 / 53
46 Dilemma: Both, logistic regression with automated variable selection, e.g., backward elimination, and also decision trees are notoriously unstable (overt). How shall we proceed? 46 / 53
47 In search of a solution Genuine algorithms to obtain a useful prediction model: X i Neural Nets Support Vector Machines Bump hunting and LASSO Rigde regression and boosting RandomForests Logic regression ˆF (y X i ) All these algorithms can be applied in high dimensional settings, i.e., when there are more candidate predictor variables than subjects. 47 / 53
48 Penalized likelihood regression (works for logistic and Cox partial likelihood) Ridge regression: ˆβ ridge = argmax{likelihood(β) λ j β 2 j } Shrinks LASSO regression: ˆβ LASSO = argmax{likelihood(β) λ j β j } Shrinks and selects Elastic net: combines L1 and L2 norm 48 / 53
49 Package glmnet library(modelgood) library(glmnet) g1a <- glmnet(as.numeric(epo$y)-1,x=model.matrix(~-1 +age + HbBase + Treat + Resection + eporec+sex,data=epo),alpha=0.1) g1 <- ElasticNet(Y~age + HbBase + Treat + Resection + eporec+sex,data=epo,alpha=0.1) plot(g1a) print(g1) $call ElasticNet(formula = Y ~ age + HbBase + Treat + Resection + eporec + sex, data = Epo, alpha = 0.1) $enet Call: glmnet(x = covariates, y = response, alpha = 0.1, lambda = optlambda) Df %Dev Lambda [1,] $Lambda [1] attr(,"class") [1] "ElasticNet" 49 / 53
50 Shrinked regression coecients Coefficients L1 Norm 50 / 53
51 A function of the penalization parameter λ Coefficients Log Lambda 51 / 53
52 Summary Predicted probabilities for the unknown current or future event status of a subject can be obtained from a penalized or unpenalized logistic regression models. Predictions can also be obtained from a decision tree or random forest. Re-classication plots, calibration plots, ROC curves, Brier score and AUC can be used to assess and compare the performance of dierent models. The apparent comparison using the same data that were used to select and t the models is not fair and may be grossly misleading. Advanced algorithmic methods have tuning parameters which are optimized for obtaining accurate predictions. 52 / 53
53 Exercise 2.2 Consider the results of Exercise 2.1. Change the seed several times used to split the IVF data and repeat the analysis. Report the Monte carlo error in the AUC of the two models. Introduce a random normal noise variable into the IVF data set and analyse its added value. Repeat with 10 such variables to see if any of these random noise variable has higher added value than cyclelen. 53 / 53
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationIntroduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationGeneralized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
More informationPersonalized Predictive Medicine and Genomic Clinical Trials
Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations
More informationExamining a Fitted Logistic Model
STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic
More informationRegression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
More information11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationModel-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and
More informationAdequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
More informationBasic Statistical and Modeling Procedures Using SAS
Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom
More informationNominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationModel selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
More informationSection 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
More informationModel Validation Techniques
Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost
More informationApplied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne
Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationBig Data Analytics for Healthcare
Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University Tutorial presentation at
More informationBasic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
More informationStatistics in Retail Finance. Chapter 2: Statistical models of default
Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationDidacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
More informationBOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully
More informationAn Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com
More informationRegression 3: Logistic Regression
Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationLOGISTIC REGRESSION ANALYSIS
LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic
More informationTHE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
More informationAgenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
More informationData Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
More informationLogistic regression modeling the probability of success
Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might
More informationA Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationDEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationBuilding risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
More informationUsing An Ordered Logistic Regression Model with SAS Vartanian: SW 541
Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL
More informationTips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD
Tips for surviving the analysis of survival data Philip Twumasi-Ankrah, PhD Big picture In medical research and many other areas of research, we often confront continuous, ordinal or dichotomous outcomes
More informationMultinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
More informationClassification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
More informationLatent Class Regression Part II
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationPenalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
More informationPredictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationThis can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.
One-Degree-of-Freedom Tests Test for group occasion interactions has (number of groups 1) number of occasions 1) degrees of freedom. This can dilute the significance of a departure from the null hypothesis.
More informationBinary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
More informationDescriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
More informationOverview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)
Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More informationChapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
More informationCART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
More informationInsurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationPredicting Health Care Costs by Two-part Model with Sparse Regularization
Predicting Health Care Costs by Two-part Model with Sparse Regularization Atsuyuki Kogure Keio University, Japan July, 2015 Abstract We consider the problem of predicting health care costs using the two-part
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More informationDeveloping Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationLinda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents
Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationMachine Learning Algorithms for Predicting Severe Crises of Sickle Cell Disease
Machine Learning Algorithms for Predicting Severe Crises of Sickle Cell Disease Clara Allayous Département de Biologie, Université des Antilles et de la guyane Stéphan Clémençon MODALX - Univesité Paris
More informationSome Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
More informationRegression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
More informationStatistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
More informationHandling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
More informationSPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way
More informationECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node
Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous
More informationR 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models
Faculty of Health Sciences R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models Inference & application to prediction of kidney graft failure Paul Blanche joint work with M-C.
More informationRisk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
More informationI L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of
More informationLogistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
More informationChapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
More informationStudents' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)
Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared
More informationEvent driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
More information10. Analysis of Longitudinal Studies Repeat-measures analysis
Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationOrdinal Regression. Chapter
Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationBiostatistics Short Course Introduction to Longitudinal Studies
Biostatistics Short Course Introduction to Longitudinal Studies Zhangsheng Yu Division of Biostatistics Department of Medicine Indiana University School of Medicine Zhangsheng Yu (Indiana University) Longitudinal
More informationData Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
More informationRidge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS
Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationCategorical Data Analysis
Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods
More informationModeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
More information5.1 CHI-SQUARE TEST OF INDEPENDENCE
C H A P T E R 5 Inferential Statistics and Predictive Analytics Inferential statistics draws valid inferences about a population based on an analysis of a representative sample of that population. The
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More information