A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic



Similar documents
Additional sources Compilation of sources:

Binary Logistic Regression

Handling missing data in Stata a whirlwind tour

Association Between Variables

Research Methods & Experimental Design

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Fairfield Public Schools

II. DISTRIBUTIONS distribution normal distribution. standard scores

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

The Chi-Square Test. STAT E-50 Introduction to Statistics

Odds ratio, Odds ratio test for independence, chi-squared statistic.

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: POWER OF THE TEST

Ordinal Regression. Chapter

"Statistical methods are objective methods by which group trends are abstracted from observations on many separate individuals." 1

CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Sample Size and Power in Clinical Trials

Contingency Tables and the Chi Square Statistic. Interpreting Computer Printouts and Constructing Tables

Multinomial and Ordinal Logistic Regression

Session 7 Bivariate Data and Analysis

Chi-square test Fisher s Exact test

A Property and Casualty Insurance Predictive Modeling Process in SAS

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Projects Involving Statistics (& SPSS)

SPSS Guide: Regression Analysis

Statistical tests for SPSS

Descriptive Analysis

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Final Exam Practice Problem Answers

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

American Journal Of Business Education July/August 2012 Volume 5, Number 4

Imputing Missing Data using SAS

Chapter 6 Experiment Process

First-year Statistics for Psychology Students Through Worked Examples

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Predicting Defaults of Loans using Lending Club s Loan Data

Predicting Flight Delays

Chapter 5 Analysis of variance SPSS Analysis of variance

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Unit 26 Estimation with Confidence Intervals

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Section 12 Part 2. Chi-square test

Predict the Popularity of YouTube Videos Using Early View Data

Testing for Granger causality between stock prices and economic growth

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

Machine Learning Logistic Regression

Non-Parametric Tests (I)

Topic 8. Chi Square Tests

Comparing Multiple Proportions, Test of Independence and Goodness of Fit

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

CREDIT SCORING MODEL APPLICATIONS:

Part 2: Analysis of Relationship Between Two Variables

Introduction to Statistics Used in Nursing Research

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

UNDERSTANDING THE TWO-WAY ANOVA

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Testing Research and Statistical Hypotheses

10. Comparing Means Using Repeated Measures ANOVA

Two Correlated Proportions (McNemar Test)

VI. Introduction to Logistic Regression

SPSS Explore procedure

HYPOTHESIS TESTING WITH SPSS:

Introduction to Statistics and Quantitative Research Methods

Random Forest Based Imbalanced Data Cleaning and Classification

A Comparative Analysis of Speech Recognition Platforms

Is it statistically significant? The chi-square test

Introduction to Fixed Effects Methods

SAS Software to Fit the Generalized Linear Model

Multiple Imputation for Missing Data: A Cautionary Tale

Data Mining Part 5. Prediction

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Multiple logistic regression analysis of cigarette use among high school students

Variables Control Charts

Study Guide for the Final Exam

Using Stata for Categorical Data Analysis

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Elementary Statistics Sample Exam #3

An Introduction to Path Analysis. nach 3

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Transcription:

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia By Tyler Cook Chathuri Daluwatte Under the direction of Lori Thombs, Ph.D. Director, Social Sciences Statistics Center Department of Statistics University of Missouri, Columbia

Executive Summary The goal of this report is to identify variables and derive a model to predict the no show probability of a free health care clinic using an observed data set. In order to determine the probability of a no show we performed logistic regression, discriminant analysis and univariate analysis on the data set. However above mentioned multivariate statistical methods fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased the power of the analysis. Hence determining an accurate prediction model with reasonable error rates was almost impossible. However our results suggests that patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the appointment. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive.

Goal of the Study Health care is providing diagnosis, treatment and prevention for diseases. Health care systems are organizations established to provide above motioned health needs in target populations which are owned and operated by different entities in a variability of standards. Free clinics are health care systems where services are provided to public for free. The clinic of interest in this study is such free clinic which is run by volunteers and voluntary physicians. The study is observational about the no shows of scheduled appointments at the mentioned free health clinic. The goal of the study is to predict the probability of no show using information about scheduled patients which are stored in the clinic database. Identifying such a model will help the clinic management to intervene in possible ways and try reducing the likelihood of no shows thus save valuable time of the voluntary staff and physicians. Data Set As the dataset provided was from an existing database, data cleaning was required prior to data analysis. The provided data set consisted of one dependent variable, visit status which can take three values Arrived, Cancelled and No show. Since the goal of the study was to model the probability of a no show, we excluded the possibility of visit status variable having the value, cancelled by removing the cancelled appointments from the data set. The data set included three continuous independent variables namely age, distance to the clinic and days from the appointment set up to the appointment date. The data set also had six categorical

variables; previous visit Status, patient Status, visit type, reminder call status, clinic type and scheduled by. Previous visit status had five levels as shown in table 1, but due to lack of data we removed levels pending and rescheduled from the dataset. Table 1 Reminder call status too had five levels but similar to previous visit status, we removed the levels cancelled and rescheduled, table 2. Table 2 Clinic type had four levels, but as shown on table 3 three of the clinic types (Diabetes care, Dermatology, MSK night) had very low number of observations compared to the MedZou Clinic, thus we redefined the variable by categorizing other clinic types than MedZou clinic into one level named Non MedZou.

Table 3 Variable visit type had five levels but similar to clinic type, with very high frequency at level full visit as shown in table 4. Thus we redefined the variable to have two levels, full visit and non full visit. Table 4 The variable patient status had two levels new and return which we used as it is table 5. The Scheduled by variable, which represents the person who scheduled the appointment using a number, had too many levels with low counts at each level. Thus we decided not to use scheduled by variable in our analysis. Table 5

Logistic Regression In order to predict the probability of a no show in the dependent variable visit status, we attempted to fit a logistic regression model. We implemented the logistic regression model using PROC GENMOD procedure in SAS and we tried to first fit a model with all the independent variables in the model. The model suggested variables Reminder Call Status, Days, Age to be significant predictors in determining the probability of a no show. The predictive ability of the model was analyzed by calculating the error rates by using 0.5 as the threshold to determine whether the predicted probability suggests an arrival or a no show. Since the model predicts the probability of no show, if the predicted probability is greater than 0.5, that suggested a predicted no show while a probability less than or equal to 0.5 suggested a predicted arrival. Result of the model is concluded in table 6. As evident from table 6 the misclassified no show rate is 73.68% which is unacceptably high. PROC GENMOD procedure deletes observations with missing values, thus while running this model even though the dataset we provided had 771 observations, 497 observations were not used in the model prediction due to missing values which drastically degraded model s ability to predict.

Logistic Regression Prediction Error Rates for the Full model Predicted True Arrived No Show Total Arrived 163 16 179 91.06% 8.94% 100.00% No Show 70 25 95 73.68% 26.32% 100.00% Total 233 41 274 85.04% 14.96% 100.00% Table 6 By using various variable combinations in the logistic regression we were able to identify the best model fit which used following seven variables in the model; Age, Days to appointment, Distance, Patient Status, Visit Type, Reminder Call Status, Clinic type. This model selected variables Patient Status, Reminder Call Status and Age as significant predictors in determining the probability of a no show. The error rates of the model are reported in table 7 but results for this model are not very different from the full model in terms of performance. By removing the variable Previous Visit status we could increase the number of observations used in the model by 167 but still we are not using 330 observations due to missing values.

Logistic Regression Prediction Error Rates for the Best Fit model Predicted True Arrived No Show Total Arrived 272 32 304 89.47% 10.53% 100.00% No Show 99 38 137 72.26% 27.74% 100.00% Total 371 70 441 84.13% 15.87% 100.00% Table 7 Prediction ability of individual independent variables Since the logistic regression models did a poor job in predicting the probability of a no show, we analyzed the prediction ability of each independent variable by plotting the visit status against each independent variable. We first report the significant predictor data plots. Reminder call status was selected from both logistic regression models to be a good predictor. As shown in Fig 1. when reminder call status has the value confirmed we have low no show rate while for the other two values (specially for no answer ) the no show rate is high. Thus reminder call status shows moderate prediction ability.

Fig. 1 Fig. 2 shows the data distribution for days variable, where you can see the percentage of no shows is lower than arrivals towards the less days end of the graph while the no shows percentage is higher than the arrivals towards the more days end of the graph. This represents the moderate prediction ability of the variable days to appointment. Fig. 2

Visit status behavior with respect to age is reported in Fig.3, where you can see, towards the younger age of the plot the no show rate is comparatively smaller than arrivals but between 30 45 the no show percentage is higher than arrivals. Fig. 3 Fig. 4 8 shows the variation of visit status with the independent variables; patient status, previous visit status, visit type, clinic type and distance. These graphs illustrate the fact that status of the independent variable does not describe the variations of visit status as we saw earlier with the above mentioned good predictor variables (reminder call status, days and age). In other words, for all values of the independent variable, the visit status always holds a low no show percentage.

Fig. 4 Fig. 5 Fig.6 Fig. 7 Fig. 8

Discriminant Analysis We also attempted a discriminant analysis in order to classify patient status. We investigated normal distribution based methods as well as nonparametric methods using PROC DISCRIM in SAS. The goal was to find the best subset of independent variables that are able to accurately predict whether or not a patient would fail to show up to their scheduled appointment. The predictive ability of each model was assessed based on cross validation error rates. Models with lower error rates are preferred. Also, it is important keep in mind that this statistical procedure deletes observations with any missing values. This limits the amount of available data and potentially harms the model s ability to predict observations when there is a large amount of missing data like in this study. Our first approach used the normal distribution method. This technique assumes that the independent variables follow a multivariate normal distribution. The independent variables in our analysis are a mix of continuous and categorical random variables so the assumption of multivariate normality might not be appropriate. Nevertheless, this method is attractive because one is able to access the discriminant function in the output. This is desirable because the discriminant function could then be used to easily classify new observations and determine which patients are likely to not show up for their appointment. The best normal based discriminant analysis included five of the independent variables: days, reminder, distance, patient status, and age. The cross validation error rates for this model are in the table 8.

Normal Based Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived 292 28 320 91.25% 8.75% 100.00% No Show 108 32 140 77.14% 22.86% 100.00% Total 400 60 460 86.96% 13.04% 100.00% Table 8 Several conclusions can be made from the results. This model does a good job at classifying those who arrived for their appointment, only misclassifying 8.75% of these patients. However, this model does a very poor job of accurately classifying the patients who failed to show up to their appointment. This model only correctly classified 22.86% of the patients whose true status was no show. Therefore, 77.14% of these no show patients were incorrectly predicted to arrive. This is the worst error we could make since our goal is to identify patients who will no show in order to target them for some kind of intervention. Unfortunately, these results indicate that the normal based discriminant analysis is not adequately able to classify patient status. Next we attempted a nonparametric discriminant analysis. This method is more flexible since it places no distributional assumptions on the independent

variables. The downside to this method is that one cannot get the discriminant function. So it is not very practical to implement these results when attempting to classify new observations. The best nonparametric discriminant analysis used five of the independent variables: days, reminder, distance, patient status, and age. In this case, the age variable only marginally improves the model but given the overall poor performance of the other methods we decided to include age even though it means we no longer have a parsimonious model. The cross validation error results for this model are in the table 9. Nonparametric Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived 291 29 320 90.94% 9.06% 100.00% No Show 54 86 140 38.57% 61.43% 100.00% Total 345 115 460 75.00% 25.00% 100.00% Table 9 These results are a notable improvement over the normal based method. Once again the model does a good job classifying the arrived patients, only misclassifying 9.06% of these patients. The nonparametric model correctly classified 61.43% of

the no show patients. While this is a significant improvement from the normal method it is still not very useful. When the main goal is to be able to predict no show patients it is imperative to have a very low error rate for this category and misclassifying about 4 out of every 10 no show patients is disappointing. Univariate Analyses Since the logistic regression and discriminant analysis results were unsatisfactory we decided to examine each of the independent variables individually with the status outcome. T tests were performed with the continuous variables in order to test whether there was a mean difference between no show and arrived patients. Also, chi squared tests of association were performed for each of the categorical variables. The results of these analyses can be found in the table 10.

Mean Std Err Arrival No Show T Statistic p value Days 17.94 0.89 24.33 1.64 3.73 0.0002 Distance 9.81 0.76 6.28 0.65 2.90 0.0038 Age 40.40 0.56 38.58 0.71 1.94 0.0533 Chi square p value Reminder 34.53 <0.0001 Previous Visit Status 0.11 0.9470 Patient Status 14.63 0.0001 Visit Type 3.52 0.0605 MedZou Clinic 0.76 0.3837 Table 10 From table 10 we can see that four of the tests are significant at alpha=0.05 (days, distance, reminder, and patient status). Also, age and visit type are marginally significant with p values <0.10. The tests for previous visit status and clinic type are not significant at any reasonable alpha level. Therefore, there is insufficient evidence to conclude that previous visit status and clinic type are related to status. The test for the days variable has a corresponding p value of 0.0002. We reject the null hypothesis that the mean days for no show and arrived patients are the same. We conclude that there is a statistically significant difference in the means of arrived patients and no show patients. By examining the means we can see that patients who attend their appointments have a lower mean number of days

from the date the appointment is scheduled until the date of the appointment. This is an intuitive result. The longer a patient has to wait the more likely they are to forget their appointment or have other important things arise thus causing them to not show up. Next we will look closer at the distance variable. This test was significant with p value 0.0038 so we reject the null hypothesis that the mean distance is the same for the two groups of patients. The estimates of the means indicate that the patients who arrived for their appointments had a larger mean distance from the clinic which is a slightly surprising result. The final continuous variable is age. The test was marginally significant with a p value of 0.0533 so there is some evidence that the no show patients and arrived patients have different mean ages. Moreover, it appears that the mean age for the arrived patients is higher than the mean age for no show patients. It is important to note that this statistically significant difference might not be a practically significant difference. The difference between the means is less than 2 years which raises some questions about whether age can really be used to distinguish between no show and arrived patients. The first of the significant categorical variables is the reminder call. The chisquare test indicates that there is sufficient evidence to conclude that the reminder call is associated with patient status. A look at the contingency table provides some insight. The table 11 represents the total counts for each combination of patient status and reminder call.

Arrived No Show Total Left Message 78 60 138 Confirmed 237 68 305 No Answer 25 30 55 Total 340 158 498 Table 11 One can see from the table that for left message and no answer the counts are about even between arrived and no shows. The real difference that stands out is in the confirmed row. Of the patients who confirmed their appointment, 77.7% did actually arrive. So it seems that a good indicator that a patient will arrive for their appointment is that they confirmed when given a reminder call. Next we will examine patient status. Once again the chi square test indicates that there is some statistical association between patient status and arrival status. Below is the contingency table 12 for these variables: Arrived No Show Total New Patient 161 47 208 Return Patient 316 189 505 Total 477 236 713 Table 12 The table 12 indicates that the majority of both new patients and return patients did attend their scheduled appointments. Interestingly, a lower percentage of new patients failed to show up for their appointments.

Finally we will consider the results for visit type. Here we are examining whether a full visit versus not full visit is associated with arrival status. The counts for each combination can be found in table 13. Arrived No Show Total Full Visit 452 233 685 Not full Visit 24 5 29 Total 476 238 714 Table 13 The first thing to notice in table 13 is the low counts for the not full visit. With only 5 patients failing to arrive for a not full visit it is difficult to draw any conclusions and use these results for classification. Also, the majority of full visit patients did arrive for their appointment. So this appears to be another example of a statistically significant result that does not have much of a practical application. Conclusions The statistical methods utilized in this study fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased power and made accurate prediction of observations almost impossible. There are potential ways to remedy this issue that might be useful in a supplementary investigation. One possibility is to use imputation in order to fill in the missing values. The ideal solution would be to acquire a larger sample of complete observation. However, this might not be possible given the clinic is staffed by volunteers with limited time and resources.

We employed only two methods out of a wide variety of statistical tools. Future research on this issue might benefit from using additional methods. Classification and regression trees would provide another way to predict observations. Also, a multinomial logistic regression would be able to model a status outcome that has more than two categories. Even though our methods did not perform as well as we would have liked there is still some useful information to come out of this study. Several of the independent variables did have statistically significant relationships with the status outcome. In particular, patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the exam. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive. Therefore, when attempting to identify a patient as a potential no show, we would recommend giving patients who have a large number of days from the scheduled date until the appointment date additional reminder calls. Moreover, it might be beneficial to make repeated reminder calls until the patient either confirms of indicates that they will be canceling. The aim of this analysis was to develop a procedure that could be used to predict patients who will fail to show up for a scheduled appointment. In order to accomplish this goal we set out by first fitting a logistic regression model and then performing discriminant analysis. Unfortunately neither of these methods provided satisfactory results that were able to classify patients with reasonable error rates. Obtaining a larger sample of complete observations or handling the missing values

in some other way might provide the needed power to get useful results in a future study.