Modelling and added value

Size: px
Start display at page:

Download "Modelling and added value"

Transcription

1 Modelling and added value Course: Statistical Evaluation of Diagnostic and Predictive Models Thomas Alexander Gerds (University of Copenhagen) Summer School, Barcelona, June 30, / 53

2 Multiple regression Multiple regression can be used to exploit the joint predictive power of several or many variables, and also to assess the added value of new markers in the presence of conventional risk factors. Commonly used modelling techniques: logistic regression for binary outcome Cox regression for time-to-event (survival) outcome P-values testing the null hypothesis of no association are not a good measure of predictive power. 2 / 53

3 Example: epo study 1 Anaemia is a deciency of red blood cells and/or hemoglobin and an additional risk factor for cancer patients. Randomized placebo controlled trial: does treatment with epoetin beta epo (300 U/kg) enhance hemoglobin concentration level and improve survival chances? Henke et al identied the c20 expression (erythropoietin receptor status) as a new biomarker for the prognosis of locoregional progression-free survival. 1 Henke et al. Do erythropoietin receptors on cancer cells explain unexpected clinical ndings? J Clin Oncol, 24(29): , / 53

4 Treatment The study includes head and neck cancer patients with a tumor located in the oropharynx (36%), the oral cavity (27%), the larynx (14%) or in the hypopharynx (23%). One of the treatments was radiotherapy following Resection Complete Incomplete No Placebo Epo with non-missing blood values 4 / 53

5 Outcome Blood hemoglobin levels were measured weekly during radiotherapy (7 weeks). Treatment with epoetin beta was dened successful when the hemoglobin level increased suciently. For patient i set Y i = { 1 treatment successful 0 treatment failed 5 / 53

6 Target Patient no. Treatment successful Predicted probability 1 0 P P P P P P P 7 6 / 53

7 Predictors Age min: 41 y, median: 59 y, max: 80 y Gender male: 85%, female: 15% Baseline hemoglobin mean: g/dl, std: 1.45 Treatment epo: 50%, placebo 50% Resection complete: 48%, incomplete: 19%, no resection: 34% Epo receptor status neg: 32%, pos: 68% 7 / 53

8 Logistic regression Response: treatment successful yes/no Factor OddsRatio StandardError CI.95 pvalue (Intercept) < Age [0.91; 1.03] Sex:female [0.91; 26.02] HbBase [1.99; 5.91] < Treatment:Epo [23.9; ] < Resection:Incompl [0.36; 9.03] Resection:Compl [1.13; 17.36] Receptor:positive [1.72; 23.39] / 53

9 The model provides general information Treatment with epo increases the chance (odds) of reaching the target hemoglobin level signicantly by a factor of (CI 95% : [23.9; 493.4], p < ) in the overall study population. Does that mean everyone should be treated? 9 / 53

10 The model provides information for a single patient For example: the predicted probability that a 51 year old man with complete tumor resection and baseline hemoglobin level 12.6 g/dl reaches the target hemoglobin level (Y i =1) is [Epo group: ] 97.4% [ Placebo: ] 29.2 % If a similar patient has baseline hemoglobin level 14.8 g/dl then the model predicts: [Epo group: ] 99.8% [Placebo: ] 84.7 % 10 / 53

11 Predictions and Brier score for logistic regression Patient Treatment Predicted Brier no. successful probability (%) Residual score Y i P i Y i P i (Y i P i ) < Σ / 53

12 The model behind the table ( ) Pi log = β 0 + β 1 x 1,i + + β k x k,i 1 P i P i = exp{β 0 + β 1 x 1,i + + β k x k,i } P i the probability of successful treatment x 1,i rst predictor for subject i: (e.g. age = 50) x 2,i second predictor for subject i: (e.g. gender = male) x k,i k'th predictor for subject i: (e.g. eporeceptor = pos) β 0,..., β k are regression coecients that are estimated based on the epo study 12 / 53

13 Predicted treatment success probability (logistic regression) For a treated man with no resection possible and negative epo receptor status. Predicted risk 100% 14 90% 80% 13 70% Baseline hemoglobin (g/dl) % 50% 40% 10 30% 9 20% 10% Age (years) 0% 13 / 53

14 Nomogram Points age sex HbBase Treat Resection eporec Total Points Linear Predictor female male Epo Placebo Incompl No Compl Chance of treatment success / 53

15 Nomogram: R-code library(rms) f7 <- lrm(y~age+sex+hbbase+treat+resection+eporec,data= Epo,x=TRUE,y=TRUE) dd <- datadist(epo) options(datadist = "dd") nom7 <- nomogram(f7, fun=function(x)1/(1+exp(-x)), fun.at=c(.001,.01,.05,0.25,0.75,.95,.99,.999), funlabel="chance of treatment success") plot(nom7) library(dynnom) f7 <- glm(y~age+sex+hbbase+treat+resection+eporec,data= Epo,family=binomial()) DynNom(f7,Epo,clevel=0.95) 15 / 53

16 Tools for evaluating prediction accuracy For each subject we have a predicted risk based on multiple predictors. To evaluate the prediction performance of the logistic regression model we consider the following tools: Prediction accuracy: Brier score (lack of calibration and lack of spread of predictions) Discrimination: Roc curve, c-index = AUC (lack of spread of predictions) Calibration plot: (lack of calibration) Re-classication scatterplot/table: (changes of risk predictions) Brier score: The squared dierence between the observed status and the predicted risk. AUC: The fraction of randomly selected pairs of patients where the predicted risk was higher for the diseased subject compared to the non-diseased subject. 16 / 53

17 Brier score for null model in the Epo study Patient Treatment Predicted Brier no. successful probability (%) Residual score Y i P i Y i P i (Y i P i ) Σ The predicted probability is the prevalence of patients with treatment success in the data set. 17 / 53

18 Prevalence model Calibration plot Observed proportion 0 % 25 % 50 % 75 % 100 % Performance null model Brier=24.7 AUC= % 25 % 50 % 75 % 100 % Predicted probability of treatment success 18 / 53

19 Univariate logistic regression models Categorical predictors library(rms) resecmodel <- lrm(y~resection,data=epo,x=true,y=true) sexmodel <- lrm(y~sex,data=epo,x=true,y=true) treatmodel <- lrm(y~treat,data=epo,x=true,y=true) ## or via glm treatmodel <- glm(y~treat,data=epo,family="binomial") Continuous predictors library(rms) basehbmodel <- lrm(y~hbbase,data=epo,x=true,y=true) agemodel <- glm(y~age,data=epo,family="binomial") 19 / 53

20 Categorical predictors Resection status Treatment success 0 Treatment success 1 No Incompl Compl Gender Treatment success 0 Treatment success 1 male female Treatment Treatment success 0 Treatment success 1 Placebo 66 8 Epo / 53

21 Categorical predictors: Resection status, gender, treatment Calibration plot Observed proportion 0 % 25 % 50 % 75 % 100 % Null model Brier=24.7 AUC=50.0 Gender model Brier=24.7 AUC=50.3 Resection model Brier=24.0 AUC=58.7 Treatment model Brier=13.6 AUC= % 25 % 50 % 75 % 100 % Predicted probability of treatment success 21 / 53

22 Continuous predictors: Baseline hemoglobin, Age Scatter plot Age (years) Baseline hemoglobin (g/dl) Treatment success Treatment failed 22 / 53

23 Continuous predictors: Baseline hemoglobin, Age Calibration plot 100 % Null model Brier=24.7 AUC= % Observed proportion 50 % Age model Brier=24.7 AUC= % Baseline hemoglobin model Brier=19.3 AUC= % 0 % 25 % 50 % 75 % 100 % Predicted probability of treatment success 23 / 53

24 Continuous predictors: Baseline hemoglobin, Age Roc curves Sensitivity 0 % 25 % 50 % 75 % 100 % Null model Brier=24.7 AUC=50.0 Age model Brier=24.7 AUC=51.2 Baseline hemoglobin Brier=19.3 AUC= % 25 % 50 % 75 % 100 % 1 Specificity 24 / 53

25 Continuous predictors: Baseline hemoglobin, Age Re classification plot Predicted chance (Age model) Predicted chance (Hemoglobin model) 0 % 25 % 50 % 75 % 100 % 0 % 25 % 50 % 75 % 100 % Treatment success Treatment failed 25 / 53

26 Multiple logistic regression Model excluding epo receptor status add <- lrm(y~age+sex+hbbase+treat+resection,data=epo,x= TRUE,y=TRUE) Model including epo receptor status add.epor <- lrm(y~age+sex+hbbase+treat+resection+eporec,data=epo,x=true,y=true) 26 / 53

27 Multiple logistic regression Re classification plot Predicted chance (excluding receptor status) Predicted chance (including receptor status) 0 % 25 % 50 % 75 % 100 % 0 % 25 % 50 % 75 % 100 % Treatment success Treatment failed 27 / 53

28 Multiple logistic regression Calibration plot 100 % Null model Brier=24.7 AUC= % Observed proportion 50 % 25 % All variables Brier= 9.6 AUC= % All + receptor status Brier= 8.7 AUC= % 25 % 50 % 75 % 100 % Predicted event probability 28 / 53

29 Multiple logistic regression Roc curves Sensitivity 0 % 25 % 50 % 75 % 100 % All variables Brier= 9.6 AUC=93.3 All + receptor status Brier= 8.7 AUC= % 25 % 50 % 75 % 100 % 1 Specificity 29 / 53

30 Exercises Do the tutorial 'Added value of new marker' 2. Split the IVF data (see link on course homepage) at random into two parts (60% for learning, 40% for evaluation). Then, build a multiple logistic regression model to predict response. Include the following covariates: antfoll, smoking, fsh, ovolume, bmi. 3. Produce a table which shows the odds ratios with condence limits (hint: Publish::publish.glm(t)) and write a caption which explains the table. 4. Produce a calibration plot and write a caption. (hint: ModelGood::calPlot2) 5. Produce a Roc curve, add the Brier score and AUC as a legend, and write a caption. 6. Build a second logistic regression model where you include the above variables and add the variable cyclelen. 7. Evaluate the added value of cyclelen: re-classication table and plot (hint: ModelGood::reclass), dierence in Brier scores and AUC with appropriate tests. Describe the underlying null hypotheses. 8. For each subject in the test data compute the dierence of the predictions between the model which excludes cyclelen and the model that includes cyclelen. Consider this dierence as a new continuous marker and produce the corresponding ROC curve and AUC. Describe the interpretation of AUC for this specic ROC curve in words and comment. 30 / 53

31 Model selection Very many dierent 'logistic regression models' can be constructed by selecting subsets of variables and transformations/groupings of variables. Standard multiple (logistic) regression works if the number of predictors is not too large, and substantially smaller than the sample size the decision maker has a-priory knowledge about which variables to put into the model Ad-hoc model selection algorithms, like automated backward elimination, do not lead to reproducible prediction models. 31 / 53

32 A Conversation of Richard Olshen with Leo Breiman 3... Olshen: What about arcing, bagging and boosting? Breiman: Okay. Yeah. This is fascinating stu, Richard. In the last ve years, there have been some really big breakthroughs in prediction. And I think combining predictors is one of the two big breakthroughs. And the idea of this was, okay, that suppose you take CART, which is a pretty good classier, but not a great classier. I mean, for instance, neural nets do a much better job. Olshen: Well, suitably trained? Breiman: Suitably trained. Olshen: Against an untrained CART? Breiman: Right. Exactly. And I think I was thinking about this. I had written an article on subset selection in linear regression. I had realized then that subset selection in linear regression is really a very unstable procedure. If you tamper with the data just a little bit, the rst best ve variable regression may change to another set of ve variables. And so I thought, Okay. We can stabilize this by just perturbing the data a little and get the best ve variable predictor. Perturb it again. Get the best ve variable predictor and then average all these ve variable predictors. And sure enough, that worked out beautifully. This was published in an article in the Annals (Breiman, 1996b) Statist. Sci. Volume 16, Issue 2 (2001), / 53

33 33 / 53

34 Backward elimination On full data (n=149): library(rms) data(epo) f7 <- lrm(y~age+sex+hbbase+treat+resection+eporec,data=epo,x=true,y=true) fastbw(f7) Deleted Chi-Sq d.f. P Residual d.f. P AIC age Resection Approximate Estimates after Deleting Factors Coef S.E. Wald Z P Intercept sex=female HbBase Treat=Epo eporec Factors in Final Model [1] sex HbBase Treat eporec 34 / 53

35 Backward elimination On reduced data (n=130): library(rms) data(epo) set.seed(1731) f7a <- lrm(y~age+sex+hbbase+treat+resection+eporec,data=epo[sample(1:149, replace=false,size=130),],x=true,y=true) fastbw(f7a) Deleted Chi-Sq d.f. P Residual d.f. P AIC age sex Resection Approximate Estimates after Deleting Factors Coef S.E. Wald Z P Intercept HbBase Treat=Epo eporec Factors in Final Model [1] HbBase Treat eporec 35 / 53

36 Guided model selection The hope of conventional regression modelling is that the better the model ts the better it predicts. But, the model should predict new patients. Prostate Cancer Risk Calculator: We used multivariable logistic regression to model the risk of prostate cancer by considering all possible combinations of main eects and interactions. The models chosen were those that minimized the Bayesian information criterion (BIC) and maximized the average out-of-sample area under the receiver operating characteristic curve (via 4-fold cross-validation). 36 / 53

37 The two cultures 4 4 L. Breiman. Statistical modeling: The two cultures. Statistical Science, 16 (3): , / 53

38 The two cultures 38 / 53

39 Classication trees A tree model is a form of recursive partitioning. It lets the data decide which variables are important and where to place cut-os in continuous variables. In general terms, the purpose of the analyzes via tree-building algorithms is to determine a set of splits that permit accurate prediction or classication of cases. In other words: a tree is a combination of many medical tests. 39 / 53

40 Epo study 1 arm p < Placebo Epo 2 Resection p = HbBase p < {No, Incomplete} Complete 11.3 > 11.3 Node 3 (n = 39) Node 4 (n = 35) Node 6 (n = 19) Node 7 (n = 56) / 53

41 Roughly, the algorithm works as follows: 1. Find the predictor so that the best possible split on that predictor optimizes some statistical criterion over all possible splits on the other predictors. 2. For ordinal and continuous predictors, the split is of the form X < c versus X c. 3. Repeat step 1 within each previously formed subset. 4. Proceed until fewer than k observations remain to be split, or until nothing is gained from further splitting, i.e. the tree is fully grown. 5. The tree is pruned according to some criterion. 41 / 53

42 Characters of classication trees But: Trees are specically designed for accurate classication/prediction Results have a graphical representation and are easy to interpret No model assumptions Recursive partitioning can identify complex interactions One can introduce dierent costs of miss-classication in the three Trees are not robust against even small perturbations of the data. It is quite easy to over-t the data. 42 / 53

43 More complex tree (overtting?) 1 arm p < Placebo Epo 2 Resection p = HbBase p < {No, Incomplete}Complete 11.3 > HbBase p = Resection p = No{Incomplete, Complete} 12.1 > eporec p = positive negative Node 3 (n = 39) Node 5 (n = 25) Node 6 (n = 10) Node 8 (n = 19) Node 10 (n = 18) Node 12 (n = 27) Node 13 (n = 11) / 53

44 Comparing the dierent predictions Patient no. Treatment successful Predicted probability (%) Simple Complex LRM tree tree / 53

45 Comparing the dierent predictions Model Brier score AUC Simple tree Logistic regression Complex tree Random forest Note: These numbers are estimated by using the same data that were used to construct the models. 45 / 53

46 Dilemma: Both, logistic regression with automated variable selection, e.g., backward elimination, and also decision trees are notoriously unstable (overt). How shall we proceed? 46 / 53

47 In search of a solution Genuine algorithms to obtain a useful prediction model: X i Neural Nets Support Vector Machines Bump hunting and LASSO Rigde regression and boosting RandomForests Logic regression ˆF (y X i ) All these algorithms can be applied in high dimensional settings, i.e., when there are more candidate predictor variables than subjects. 47 / 53

48 Penalized likelihood regression (works for logistic and Cox partial likelihood) Ridge regression: ˆβ ridge = argmax{likelihood(β) λ j β 2 j } Shrinks LASSO regression: ˆβ LASSO = argmax{likelihood(β) λ j β j } Shrinks and selects Elastic net: combines L1 and L2 norm 48 / 53

49 Package glmnet library(modelgood) library(glmnet) g1a <- glmnet(as.numeric(epo$y)-1,x=model.matrix(~-1 +age + HbBase + Treat + Resection + eporec+sex,data=epo),alpha=0.1) g1 <- ElasticNet(Y~age + HbBase + Treat + Resection + eporec+sex,data=epo,alpha=0.1) plot(g1a) print(g1) $call ElasticNet(formula = Y ~ age + HbBase + Treat + Resection + eporec + sex, data = Epo, alpha = 0.1) $enet Call: glmnet(x = covariates, y = response, alpha = 0.1, lambda = optlambda) Df %Dev Lambda [1,] $Lambda [1] attr(,"class") [1] "ElasticNet" 49 / 53

50 Shrinked regression coecients Coefficients L1 Norm 50 / 53

51 A function of the penalization parameter λ Coefficients Log Lambda 51 / 53

52 Summary Predicted probabilities for the unknown current or future event status of a subject can be obtained from a penalized or unpenalized logistic regression models. Predictions can also be obtained from a decision tree or random forest. Re-classication plots, calibration plots, ROC curves, Brier score and AUC can be used to assess and compare the performance of dierent models. The apparent comparison using the same data that were used to select and t the models is not fair and may be grossly misleading. Advanced algorithmic methods have tuning parameters which are optimized for obtaining accurate predictions. 52 / 53

53 Exercise 2.2 Consider the results of Exercise 2.1. Change the seed several times used to split the IVF data and repeat the analysis. Report the Monte carlo error in the AUC of the two models. Introduce a random normal noise variable into the IVF data set and analyse its added value. Repeat with 10 such variables to see if any of these random noise variable has higher added value than cyclelen. 53 / 53

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

Personalized Predictive Medicine and Genomic Clinical Trials

Personalized Predictive Medicine and Genomic Clinical Trials Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations

More information

Examining a Fitted Logistic Model

Examining a Fitted Logistic Model STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

More information

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013 Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

More information

Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University Tutorial presentation at

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

More information

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

Regression 3: Logistic Regression

Regression 3: Logistic Regression Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

More information

Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

More information

Logistic regression modeling the probability of success

Logistic regression modeling the probability of success Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )

More information

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

More information

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD Tips for surviving the analysis of survival data Philip Twumasi-Ankrah, PhD Big picture In medical research and many other areas of research, we often confront continuous, ordinal or dichotomous outcomes

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Latent Class Regression Part II

Latent Class Regression Part II This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form. One-Degree-of-Freedom Tests Test for group occasion interactions has (number of groups 1) number of occasions 1) degrees of freedom. This can dilute the significance of a departure from the null hypothesis.

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

CART 6.0 Feature Matrix

CART 6.0 Feature Matrix CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Predicting Health Care Costs by Two-part Model with Sparse Regularization

Predicting Health Care Costs by Two-part Model with Sparse Regularization Predicting Health Care Costs by Two-part Model with Sparse Regularization Atsuyuki Kogure Keio University, Japan July, 2015 Abstract We consider the problem of predicting health care costs using the two-part

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Machine Learning Algorithms for Predicting Severe Crises of Sickle Cell Disease

Machine Learning Algorithms for Predicting Severe Crises of Sickle Cell Disease Machine Learning Algorithms for Predicting Severe Crises of Sickle Cell Disease Clara Allayous Département de Biologie, Université des Antilles et de la guyane Stéphan Clémençon MODALX - Univesité Paris

More information

Some Essential Statistics The Lure of Statistics

Some Essential Statistics The Lure of Statistics Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived

More information

Regression Analysis: A Complete Example

Regression Analysis: A Complete Example Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

More information

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous

More information

R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models

R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models Faculty of Health Sciences R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models Inference & application to prediction of kidney graft failure Paul Blanche joint work with M-C.

More information

Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared

More information

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016 Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency

More information

10. Analysis of Longitudinal Studies Repeat-measures analysis

10. Analysis of Longitudinal Studies Repeat-measures analysis Research Methods II 99 10. Analysis of Longitudinal Studies Repeat-measures analysis This chapter builds on the concepts and methods described in Chapters 7 and 8 of Mother and Child Health: Research methods.

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Biostatistics Short Course Introduction to Longitudinal Studies

Biostatistics Short Course Introduction to Longitudinal Studies Biostatistics Short Course Introduction to Longitudinal Studies Zhangsheng Yu Division of Biostatistics Department of Medicine Indiana University School of Medicine Zhangsheng Yu (Indiana University) Longitudinal

More information

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan

Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:

More information

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS

Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

Categorical Data Analysis

Categorical Data Analysis Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

5.1 CHI-SQUARE TEST OF INDEPENDENCE

5.1 CHI-SQUARE TEST OF INDEPENDENCE C H A P T E R 5 Inferential Statistics and Predictive Analytics Inferential statistics draws valid inferences about a population based on an analysis of a representative sample of that population. The

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information