Predicting Health Care Costs by Two-part Model with Sparse Regularization
|
|
|
- Walter Miles
- 10 years ago
- Views:
Transcription
1 Predicting Health Care Costs by Two-part Model with Sparse Regularization Atsuyuki Kogure Keio University, Japan July, 2015 Abstract We consider the problem of predicting health care costs using the two-part model with the sparse regularization. The data we use are health care insurance claims data and health screening data of 10,000 individuals over the period from 2010 through 2012, randomly taken from several private health care plans in Japan. We construct a predictive model using the year 2011 health care cost variable as the response variable and all the year 2010 variables as explanatory variables. Then we measure the accuracy of the model for predicting the year 2012 health care costs by comparing the actual year 2012 health care cost data with the prediction from the model with the explanatory variables replaced by those of the year The results show that the sparse regularization techniques improves the prediction accuracy. 1 Introduction We consider the problem of predicting health care costs using sparse regularized regression techniques. Our aim is to explore the possibility of the sparse techniques for improving the accuracy to predict an individual s cost in the next year when applied to the two-part model. The data we use are health care insurance claims data and health screening data of 10,000 individuals over the period from 2010 through 2012, which were randomly taken from several private health care plans in Japan 1. For our purpose we estimate two-part models using the total medical expenditures in the year 2011 as a response with all the variables in the year 2010 as explanatory variables. There are 27 variables available in each year. To account for possible nonlinearities in the regression equations, we estimate the model with all the two-way interactions between the covariates having 378 explanatory variables in total. We construct a predictive model using the year 2011 health care cost variable as the response variable and the year 2011 variables as explanatory variables. Then we measure the accuracy of the model for predicting the 2012 health care cost by comparing the actual 2012 health care cost data with the prediction based on the model with the explanatory variables replaced by those of the year The results show that the sparse techniques improve the prediction accuracy, similar to the findings reported in Loginov et. al (2013). However, we have applied only the basic form of the lasso regularization at present. Since the original lasso was proposed, a lot of advancements have been made including the elastic net (Zhou and Hastie, 2005), the adaptive lasso (Zhou, 2006) and the Bayesian lasso (Park and Cassela, 2008). In spite of the improvement, we note that the prediction error is still large with the sparse regularization. This problem mainly arises from the difficulty in predicting the inpatient medical expenditure as the prediction error is considerably made smaller by excluding the inpatient medical expenditure from the total medical expenditure. 1 The data were kindly supplied by Japan Medical Data Center (JMDC).
2 2 Two-part models Health care insurance claims data take nonnegative values but have a substantial proportion of values at zero. Modeling such zero-inflated data is challenging and many methods have been proposed including the Tobit model (Tobin, 1958), the two-part model (Duan et al, 1983), the sample selection model (Heckman, 1979) and the Tweedie model (Jorgensen, 1987), among which we here choose the two-part model which seems thus far the most popular method for the modeling of health care expenditures. 2.1 Regression models The two-part model uses two regression equations to separate the modeling of the insurance claim data into two part. The first part refers to whether the claim outcome is positive. Conditional on its being positive, the second part refers to its level. To be more specific, let Y i denote the claim amount for individual i (i = 1, 2,..., n) and let x i denote the vector of explanatory variables associated with it. Then the conditional distribution of Y i given x i is { Pr(Y i = 0; θ 1 x i ) if y i = 0 f Y (y i ; θ x i ) = (1) f Y (y i ; θ 2 x i, y i > 0) Pr(Y i > 0; θ 1 x i ) if y i > 0 where θ = (θ 1, θ 2 ) denotes a pair of the parameter vectors of the parameters in the first and the second parts. To estimate Pr(Y i > 0; θ 1 x i ) for the first part in (1), we choose a logistic regression Pr(Y i > 0; θ 1 x i ) = exp{ θ 1x i }. Several specifications have been applied for estimating f Y (y i ; θ 2 x i, y i > 0) for the second part in (1). Given Y i > 0, one might simply assume the standard linear normal model Y i = θ 2x i + ε i, where ε i s are errors and assumed to be independent and identically distributed as a normal distribution with zero mean and a constant variance σ 2. As a way to alleviate a severe nonnormality of the response variable, many authors turn to the log normal specification: log Y i = θ 2x i + ε i Some authors have pointed out a difficulty due to the transformation of Y i and suggested to use a generalized linear model (GLM); see, for example, Blough, et al.(1999). A natural GLM specification may be to use the Gamma regression f Y (y i ; θ 2 x i, y > 0) = (s/µ i) s ( Γ(s) ys 1 exp sy ) µ i with the link function where µ i = E[Y i ] and s is the shape parameter. log(µ i ) = θ 2x i 2
3 2.2 Prediction Let Ŷ denote a prediction of Y based on x and adopt the mean squared prediction error [ ( ) ] 2 E Y Ŷ to measure the prediction accuracy. Then the optimal prediction is given by the the conditional mean of Y given x. For the two-part model, it is given as E[Y ; θ x] = E[Y ; θ 2 Y > 0, x] Pr(Y > 0; θ 1 x) + E[Y ; θ 2 Y = 0, x] Pr(Y = 0; θ 1 x) = E[Y ; θ 2 Y > 0, x] Pr(Y > 0; θ 1 x) (2) For the normal case and the Gamma GLM cases, (2) becomes E[Y ; θ x] = E[x θ 2 + ε] Pr(Y > 0 x) = x θ exp{ x θ 1 } (3) For the log normal case, (2) becomes 2.3 Estimation E[Y ; θ x] = E[exp{x θ 2 } exp{ε i }] Pr(Y > 0 x) = exp{x θ 2 + σ 2 /2} 1 + exp{ x θ 1 } In practice, the parameters must be estimated from the data. The standard practice for the estimation is to maximize the likelihood n f Y (y i ; θ x i ) = i=1 where b i is defined as n i=1 [ ] Pr(Y i > 0; θ 1 x i ) b i Pr(Y i = 0; θ 1 x i ) 1 b i b i { 1 if y i = 0 0 if y i > 0 i N 1 f Y (y i ; θ 2 x i, y i > 0), (4) and N 1 denotes the set of observations with y i > 0. Thus the likelihood factors into two parts and estimation can be done separately for each likelihood. 3 Sparse regularized regression Let p denote the number of the parameters in the model. Given a fixed parameter size p it is well known that the maximum likelihood estimation method in 2.3 provides an asymptotically optimal estimation as the sample size n becomes large. In practice, however, n is fixed, and the issue in over-fitting arises. Given a fixed sample size n, the fitting of the larger model always exceeds that of the smaller model. One approach to dealing with this issue is to select a subset of explanatory variables that we believe to be related to the response variable. In this approach we fit a separate regression model for each possible combinations of the p explanatory variables and select a single best model using a criterion such as AIC or BIC from among the 2 p possibilities. However, this quickly becomes infeasible as p gets large (NP-hard problem). 3
4 Lasso (Tibshirani, 1996) is a relatively new approach to prevent such over-fitting by penalizing models with larger parameter values. It is a shrinkage regression method, but unlike the traditional shrinkage method such as the ridge regression, it makes some of the parameters become exactly zero, and thus is called a sparse regularization. With the property of sparsity, lasso is used as a selection method for explanatory variables in high dimensional data environment where the traditional methods such as AIC and BIC criterion suffer computational difficulty. See Vidaurre et. al. (2013) for a review on the lasso regularization. We use the health care insurance claims data and health screening data with 10,000 individuals from 2010 to To estimate regression models for the insurance claim as the response variable, 27 covariates are available in each year. To account for possible nonlinearity, we incorporate all the two-way interactions between the variables into the regression equations, which inflates the number of explanatory variables to 378. To deal with such a high dimensionality, we utilize the lasso method for each model in the first and second parts separately. The lasso regularization is applied by maximizing the objective function l(θ) = 1 n log f Y (y i ; θ x i ) π 1 (θ 1 ) π 2 (θ 2 ) n i=1 = 1 n [ ] log Pr(Y i > 0; θ 1 x i ) d 1 Pr(Y i = 0; θ 1 x i ) 1 d i π 1 (θ 1 ) n i=1 + 1 log f Y (y i ; θ 2 x i, y i > 0) π 2 (θ 2 ) (5) n i N 1 where π i (θ i ) is a penalty term for θ i = {θ (i) j, j = 1,..., p i} defined as p i π i (θ i ) = λ i θ (i) j, i = 1, 2. j=1 The λ i is a positive quantity to tune the regularization, and thus called a tuning parameter. Several methods have been proposed to select the tuning parameter. We use the cross validation techniques for the selection. Since the original lasso was proposed, a lot of advancements have been made. The elastic net (Zou and Hastie, 2005) uses the penalty term π(θ) = λ α p θ j + (1 α) j=1 to overcome the situation where p > n. Here α is a value between 0 and 1. If it is 1, then the elastic net becomes the standard lasso. If it is 0, then the elastic net reduces to the ridge regression. The adaptive lasso (Zou, 2006) adopts the penalty term p π(θ) = λ w j θ j with j=1 w j = 1/ ˆθ j γ, γ > 0, where ˆθ j is a consistent estimator for θ j. It intends to ease the penalty on the parameters with large values of ˆθ j. Park and Casella (2008) noted that the penalty term in the lasso corresponds to the prior distribution p { λ f(θ σ, λ) = 2 exp λ θ } j σ j=1 4 p j=1 θ 2 j 2
5 under the standard Bayesian framework and proposed a Bayesian procedure called the Bayesian lasso. 4 Analysis of Japanese Medical Data In this section we present a simple data analysis. We estimated the two-part model of the logistic regression as the first part and the standard linear model as the second part with and without the lasso regularization. We considered two cases. For Case 1 we used 27 variables described in the next section as explanatory variables. For case 2 we used the 27 variables plus all the possible two-way interactions between them, which results in 378 explanatory variables in total. 4.1 Data For each year from 2010 to 2012 we have the following observations on each individual: Demography variables sex (SEX), age (AGE) Health screening variables body mass index (BMI), systolic blood pressure (SBP), diastotic blood pressure (DBP), neutral (NF), HDL cholesterol (HDLC), LDL cholesterol (LDLC), glutamate oxaloacetic transaminase (GOT), glutamate pyruvate transaminase (GPT), Gamma-Glutamyl Transpeptidase (GGT), fasting blood sugar (FBS), hemoglobin A1c (HbA1c), urinal sugar (US), fasting blood sugar (FBS) Frequency and severity variables total Medical expenses (TME), inpatient medical expenses (IME), outpatient medical expenses (OME), pharmaceutical expenses (PE) hospital inpatient days (HID), outpatient visits (OV); Medical practice variables emergency medical care (EMC), cholesterol medication (CM), diabetes medication (DM), blood pressure medication (BPM) Disease Type variables hyperlipidemia (HL), diabetes (DB), high blood pressure (HBP), liver disease (LD), Highblood Pressure (HBP), Liver Desease (LD). We omit the outpatient medical expenses (OME) variable from the estimation because there is a exact linear relationship among total medical expenses (TME), inpatient medical expenses (IME), outpatient medical expenses (OME), and pharmaceutical expenses (PE) such that TME = IME + OME + PE. The target for the prediction is the total medical expenses (TME). The summary statistics of the target variable in year 2010 to 2012 are given in Table 1. Min. 1st Quartile Median Mean 3rd Quartile Max SD TME ,590 29,820 91,920 92,950 12,570, ,740 TME ,868 38, , ,000 9,389, ,642 TME ,178 39, , ,800 6,767, ,818 Table 1: Summary statistics of total medical expenditures 5
6 4.2 Case 1 We use 27 variables described in the above section as explanatory variables Analysis without the lasso regularization Table 2 shows the results of the logistic regression of a binary variable I i { 1 if TME.2011 i > 0 I i = 0 if TME.2011 i = 0 on the 27 variables in year The marks on the far right of the table indicate levels of significance probabilities. Table 3 shows the results of the linear regression of Y i = TME.2011 for TME.2011 > 0 on the 27 variables in year Coefficients: Estimate Std. Error z value p values (Intercept) 8.203e e SEX 3.192e e e-05 *** AGE e e BMI e e SBP e e DBP e e NF e e HDLC e e LDLC e e GOT e e GPT e e GGT e e FBS e e HbA1c e e US e e TME e e IME e e PE e e e-05 *** HID e e * OV e e < 2e-16 *** EMC e e * CM e e DM e e BPM e e HL e e DB e e HBP e e ** LD e e Table 2: Coefficients of the Logistic Regression We have constructed a predictive model using the year 2011 health care cost variable as the response variable and the year 2011 variables as explanatory variables. Then we measure the accuracy of the model for predicting the 2012 health care cost by comparing the actual 2012 health care cost data and the predicted values based on the constructed model with the explanatory variables replaced by those of year
7 Coefficients: Estimate Std. Error z value p values (Intercept) e e SEX 1.838e e AGE e e BMI e e SBP e e DBP e e NF e e HDLC e e LDLC e e GOT e e GPT e e GGT e e FBS e e HbA1c e e US e e e-08 *** TME e e <2e-16 *** IME e e <2e-16 *** PE e e HID e e OV e e *** EMC e e CM e e DM e e BPM e e ** HL e e DB e e HBP e e LD e e * Table 3: Coefficient Estimates of linear regression We have calculated the predicted values of TME.2012 based on the conditional mean (3) using the estimated parameter values and year 2011 explanatory variables. In case a predicted value is negative, we converted it to zero. The summary statistics are given in Table 4. To measure the prediction accuracy we adopt the mean squared prediction error which is calculated as ( ) 2. = (TME.2012 i PRED.2012 i ) 2, i=1 Min. 1st Quartile Median Mean 3rd Quartile Max PRED ,920 62, , ,600 7,005,000 TME ,178 39, , ,800 6,767,000 Table 4: Summary statistics of PRED.2012 and TME.2012: case 1 without lasso 7
8 4.2.2 Analysis with the lasso regularization We have estimated the logistic regression with the lasso regularization for the first part and the standard linear regression with the lasso regularization for the second part. The value of λ chosen by the cross validation for the logistic regression turned out very small ( ), which resulted in the same estimation results as the logistic regression without regularization. In contrast, the λ for the linear regression is considerable ( ), which made the estimates of many coefficients exactly zero as seen in Table 5. Figure 1 shows the mean squared error for different values of λ Mean Squared Error 5e+10 6e+10 7e+10 8e+10 9e+10 1e log(lambda) Figure 1: Mean squared error for different values of λ We predict each individual s cost of year 2012 using the estimated model and compare the prediction errors. The summary statistics are given in Table 6. The root mean squared prediction error is calculated as Case 2 We use all the two-way interactions between the variables into the regression equations, which inflates the number of covariates to Analysis without the lasso regularization The R squared for the linear regression increased from to with the inclusion of the interaction terms. The summary statistics are given in Table 7. The root mean squared prediction error is calculated as Analysis with the lasso regularization We have used the logistic regression without the lasso regularization for the first part and the linear regression with the lasso regularization for the second part. The λ chosen by the cross validation is considerable ( ), which made 353 out of the 378 explanatory variables become exactly zero. Figure 2 shows the mean squared error for different values of λ. We predict each individual s cost of year 2012 using the estimated model and compare the prediction errors. The summary statistics are given in Table 8. The root mean squared prediction error is calculated as
9 Coefficients: Estimate z value lasso Estimate (Intercept) e e+04 SEX 1.838e AGE e e+02 BMI e SBP e DBP e e+02 NF e HDLC e LDLC e GOT e GPT e GGT e FBS e HbA1c e US e e+04 TME e e+00 IME e e+00 PE e HID e OV e EMC e CM e DM e BPM e e+04 HL e DB e HBP e LD e e+03 Table 5: Comparison of estimates of linear regressions with and without the lasso regularization. Min. 1st Quartile Median Mean 3rd Quartile Max PRED ,760 63, , ,400 6,678,000 TME ,178 39, , ,800 6,767,000 Table 6: Summary statistics of PRED.2012 and TME.2012: case 1 with lasso Min. 1st Quartile Median Mean 3rd Quartile Max PRED ,040 63, , ,600 9,848,000 TME ,178 39, , ,800 6,767,000 Table 7: Summary statistics of PRED.2012 and TME.2012: case 2 without lasso Min. 1st Quartile Median Mean 3rd Quartile Max PRED ,620 68, , ,400 9,669,000 TME ,178 39, , ,800 6,767,000 Table 8: Summary statistics of PRED.2012 and TME.2012: case 2 with lasso 4.4 Comparison of prediction errors Table 9 and 10 compare mean squared prediction errors between the models with and without the lasso regularization for Case 1(p = 27) and Case 2 (p = 387). In the tables the prediction errors 9
10 Mean Squared Error 6.0e e e e+11 log(lambda) Figure 2: Mean squared error for different values of λ for the ridge regression and the elastic net are also listed. We see that the lasso regularization improves the prediction accuracy most in either case. Variance of TME.2012 No regularization ( ) lasso ( ) ridge ( ) elastic net with α = 0.5 ( ) Table 9: Comparison of prediction errors for case 1 Variance of TME.2012 No regularization ( ) lasso ( ) ridge ( ) elastic net with α = 0.5 ( ) Table 10: Comparison of prediction errors for case 2 In spite of the improvement, we note that the prediction error is still large with the sparse regularization. This problem mainly arises from the difficulty in predicting the inpatient medical expenditure. Table 11 shows that mean squared prediction errors of the TME minus IME using the two-part model with and without the sparse regularization. We note that the mean squared prediction error relative to the variance was made smaller considerably. Variance of (TME.2012 IME.2012) No regularization ( ) lasso ( ) Table 11: Comparison of prediction errors of TME minus IME for Case 2 For the year 2011 data we note that only 4% of the IME values are positive, they tend to be rather large. We also note that the positive IME is strongly related to the the hospital inpatient days. We are currently developing an extended two-part model to predict the IME by taking the HID into consideration. 10
11 5 Conclusions We have considered the problem of predicting health care costs using the two-part model with the sparse regression techniques. We have constructed a predictive model using the year 2011 health care cost variable as the response variable and the year 2011 variables as explanatory variables. Then we have measured the accuracy of the model for predicting the 2012 health care cost by comparing the actual 2012 health care cost data and the predicted values based on the constructed model with the explanatory variables replaced by those of year The results show that the sparse techniques improves the prediction accuracy. References Blough, K., Madden, C.W., and Hornbrook, M.C. (1999), Modeling risk using generalized linear models. Journal of Health Economics, 18, Duan, N., Manning, W. G. Jr., Morris, C. N., and Newhouse, J.P. (1983), A comparison of alternative models for the demand for medical care (Corr: V2 P413). Journal of Business and Economic Statistics, 1, Heckman, J. (1979), Sample selection bias as a specification error. Econometrica, 47, Jorgensen, B. (1987), Exponential dispersion models. Journal of the Royal Statistical Society, Series B, 49, Loginov et. al. (2013), Predictive Modeling in Healthcare Costs Using Regression Models, ARCH Proceedings. Park, T. and Casella, G. (2008), The Bayesian lasso. Association, 103, Journal of the American Statistical Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, Series B, 58, Tobin, J. (1958), Estimation of relationships for limited dependent variables. Econometrica, 26, Vidaurre, D., Bielza, C. and Pedro Larranaga, P. (2013), A Survey of L1 Regression. International Statistical Review, 81, Zou, H. (2006), The adaptive lasso and its oracle properties, Journal of the American Statistical Association, 101, Zou, H. and Hastie, H. (2005) Regularization and variable selection via the Elastic Net, Journal of the Royal Statistical Society, Series B, 67,
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Automated Biosurveillance Data from England and Wales, 1991 2011
Article DOI: http://dx.doi.org/10.3201/eid1901.120493 Automated Biosurveillance Data from England and Wales, 1991 2011 Technical Appendix This online appendix provides technical details of statistical
Penalized Logistic Regression and Classification of Microarray Data
Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Lasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
Logistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
Machine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
Generalized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
Section 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013
Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.
Logistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
Statistical Analysis Using SPSS for Windows Getting Started (Ver. 2014/11/6) The numbers of figures in the SPSS_screenshot.pptx are shown in red.
Statistical Analysis Using SPSS for Windows Getting Started (Ver. 2014/11/6) The numbers of figures in the SPSS_screenshot.pptx are shown in red. 1. How to display English messages from IBM SPSS Statistics
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
GLM I An Introduction to Generalized Linear Models
GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial
Nominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
Technical Efficiency Accounting for Environmental Influence in the Japanese Gas Market
Technical Efficiency Accounting for Environmental Influence in the Japanese Gas Market Sumiko Asai Otsuma Women s University 2-7-1, Karakida, Tama City, Tokyo, 26-854, Japan [email protected] Abstract:
Marketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by
Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
The zero-adjusted Inverse Gaussian distribution as a model for insurance claims
The zero-adjusted Inverse Gaussian distribution as a model for insurance claims Gillian Heller 1, Mikis Stasinopoulos 2 and Bob Rigby 2 1 Dept of Statistics, Macquarie University, Sydney, Australia. email:
2. Filling Data Gaps, Data validation & Descriptive Statistics
2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period)
A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps
Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS
Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,
Package metafuse. November 7, 2015
Type Package Package metafuse November 7, 2015 Title Fused Lasso Approach in Regression Coefficient Clustering Version 1.0-1 Date 2015-11-06 Author Lu Tang, Peter X.K. Song Maintainer Lu Tang
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
EXPANDING THE EVIDENCE BASE IN OUTCOMES RESEARCH: USING LINKED ELECTRONIC MEDICAL RECORDS (EMR) AND CLAIMS DATA
EXPANDING THE EVIDENCE BASE IN OUTCOMES RESEARCH: USING LINKED ELECTRONIC MEDICAL RECORDS (EMR) AND CLAIMS DATA A CASE STUDY EXAMINING RISK FACTORS AND COSTS OF UNCONTROLLED HYPERTENSION ISPOR 2013 WORKSHOP
Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
Statistical Analysis of Life Insurance Policy Termination and Survivorship
Statistical Analysis of Life Insurance Policy Termination and Survivorship Emiliano A. Valdez, PhD, FSA Michigan State University joint work with J. Vadiveloo and U. Dias Session ES82 (Statistics in Actuarial
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
How to assess the risk of a large portfolio? How to estimate a large covariance matrix?
Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk
How To Build A Predictive Model In Insurance
The Do s & Don ts of Building A Predictive Model in Insurance University of Minnesota November 9 th, 2012 Nathan Hubbell, FCAS Katy Micek, Ph.D. Agenda Travelers Broad Overview Actuarial & Analytics Career
Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]
Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
Introduction to Predictive Modeling Using GLMs
Introduction to Predictive Modeling Using GLMs Dan Tevet, FCAS, MAAA, Liberty Mutual Insurance Group Anand Khare, FCAS, MAAA, CPCU, Milliman 1 Antitrust Notice The Casualty Actuarial Society is committed
Machine Learning Methods for Demand Estimation
Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior
Chapter 6: Point Estimation. Fall 2011. - Probability & Statistics
STAT355 Chapter 6: Point Estimation Fall 2011 Chapter Fall 2011 6: Point1 Estimat / 18 Chap 6 - Point Estimation 1 6.1 Some general Concepts of Point Estimation Point Estimate Unbiasedness Principle of
MY TYPE 2 DIABETES NUMBERS
BLOOD SUGAR MANAGEMENT GUIDE MY TYPE 2 DIABETES NUMBERS Understanding and Tracking the ABCs of Type 2 Diabetes 1 BLOOD MY TYPE SUGAR 2 DIABETES MANAGEMENT ABC NUMBERS GUIDE When you have type 2 diabetes,
E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
Lecture 8: Gamma regression
Lecture 8: Gamma regression Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Models with constant coefficient of variation Gamma regression: estimation and testing
PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION
PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION Chin-Diew Lai, Department of Statistics, Massey University, New Zealand John C W Rayner, School of Mathematics and Applied Statistics,
GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE
ACTA UNIVERSITATIS AGRICULTURAE ET SILVICULTURAE MENDELIANAE BRUNENSIS Volume 62 41 Number 2, 2014 http://dx.doi.org/10.11118/actaun201462020383 GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE Silvie Kafková
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of
HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009
HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
From the help desk: hurdle models
The Stata Journal (2003) 3, Number 2, pp. 178 184 From the help desk: hurdle models Allen McDowell Stata Corporation Abstract. This article demonstrates that, although there is no command in Stata for
Statistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
Dimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.
Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
Classification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
Poisson Models for Count Data
Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the
Java Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests
Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Own Damage, Third Party Property Damage Claims and Malaysian Motor Insurance: An Empirical Examination
Australian Journal of Basic and Applied Sciences, 5(7): 1190-1198, 2011 ISSN 1991-8178 Own Damage, Third Party Property Damage Claims and Malaysian Motor Insurance: An Empirical Examination 1 Mohamed Amraja
On Marginal Effects in Semiparametric Censored Regression Models
On Marginal Effects in Semiparametric Censored Regression Models Bo E. Honoré September 3, 2008 Introduction It is often argued that estimation of semiparametric censored regression models such as the
MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal
MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims
GLM, insurance pricing & big data: paying attention to convergence issues.
GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - [email protected] Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.
Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients
Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients by Li Liu A practicum report submitted to the Department of Public Health Sciences in conformity with
ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node
Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous
Regression for nonnegative skewed dependent variables
Regression for nonnegative skewed dependent variables July 15, 2010 Problem Solutions Link to OLS Alternatives Introduction Nonnegative skewed outcomes y, e.g. labor earnings medical expenditures trade
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
Logistic regression (with R)
Logistic regression (with R) Christopher Manning 4 November 2007 1 Theory We can transform the output of a linear regression to be suitable for probabilities by using a logit link function on the lhs as
On the Creation of the BeSiVa Algorithm to Predict Voter Support
On the Creation of the BeSiVa Algorithm to Predict Voter Support By Benjamin Rogers Submitted to the graduate degree program in Political Science and the Graduate Faculty of the University of Kansas in
Multiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
Chapter 6: Multivariate Cointegration Analysis
Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration
Parallelization Strategies for Multicore Data Analysis
Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management
Statistics in Retail Finance. Chapter 2: Statistical models of default
Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision
Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
Aileen Murphy, Department of Economics, UCC, Ireland. WORKING PAPER SERIES 07-10
AN ECONOMETRIC ANALYSIS OF SMOKING BEHAVIOUR IN IRELAND Aileen Murphy, Department of Economics, UCC, Ireland. DEPARTMENT OF ECONOMICS WORKING PAPER SERIES 07-10 1 AN ECONOMETRIC ANALYSIS OF SMOKING BEHAVIOUR
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS
Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
