A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Transcription

1 A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia By Tyler Cook Chathuri Daluwatte Under the direction of Lori Thombs, Ph.D. Director, Social Sciences Statistics Center Department of Statistics University of Missouri, Columbia

2 Executive Summary The goal of this report is to identify variables and derive a model to predict the no show probability of a free health care clinic using an observed data set. In order to determine the probability of a no show we performed logistic regression, discriminant analysis and univariate analysis on the data set. However above mentioned multivariate statistical methods fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased the power of the analysis. Hence determining an accurate prediction model with reasonable error rates was almost impossible. However our results suggests that patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the appointment. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive.

3 Goal of the Study Health care is providing diagnosis, treatment and prevention for diseases. Health care systems are organizations established to provide above motioned health needs in target populations which are owned and operated by different entities in a variability of standards. Free clinics are health care systems where services are provided to public for free. The clinic of interest in this study is such free clinic which is run by volunteers and voluntary physicians. The study is observational about the no shows of scheduled appointments at the mentioned free health clinic. The goal of the study is to predict the probability of no show using information about scheduled patients which are stored in the clinic database. Identifying such a model will help the clinic management to intervene in possible ways and try reducing the likelihood of no shows thus save valuable time of the voluntary staff and physicians. Data Set As the dataset provided was from an existing database, data cleaning was required prior to data analysis. The provided data set consisted of one dependent variable, visit status which can take three values Arrived, Cancelled and No show. Since the goal of the study was to model the probability of a no show, we excluded the possibility of visit status variable having the value, cancelled by removing the cancelled appointments from the data set. The data set included three continuous independent variables namely age, distance to the clinic and days from the appointment set up to the appointment date. The data set also had six categorical

4 variables; previous visit Status, patient Status, visit type, reminder call status, clinic type and scheduled by. Previous visit status had five levels as shown in table 1, but due to lack of data we removed levels pending and rescheduled from the dataset. Table 1 Reminder call status too had five levels but similar to previous visit status, we removed the levels cancelled and rescheduled, table 2. Table 2 Clinic type had four levels, but as shown on table 3 three of the clinic types (Diabetes care, Dermatology, MSK night) had very low number of observations compared to the MedZou Clinic, thus we redefined the variable by categorizing other clinic types than MedZou clinic into one level named Non MedZou.

5 Table 3 Variable visit type had five levels but similar to clinic type, with very high frequency at level full visit as shown in table 4. Thus we redefined the variable to have two levels, full visit and non full visit. Table 4 The variable patient status had two levels new and return which we used as it is table 5. The Scheduled by variable, which represents the person who scheduled the appointment using a number, had too many levels with low counts at each level. Thus we decided not to use scheduled by variable in our analysis. Table 5

6 Logistic Regression In order to predict the probability of a no show in the dependent variable visit status, we attempted to fit a logistic regression model. We implemented the logistic regression model using PROC GENMOD procedure in SAS and we tried to first fit a model with all the independent variables in the model. The model suggested variables Reminder Call Status, Days, Age to be significant predictors in determining the probability of a no show. The predictive ability of the model was analyzed by calculating the error rates by using 0.5 as the threshold to determine whether the predicted probability suggests an arrival or a no show. Since the model predicts the probability of no show, if the predicted probability is greater than 0.5, that suggested a predicted no show while a probability less than or equal to 0.5 suggested a predicted arrival. Result of the model is concluded in table 6. As evident from table 6 the misclassified no show rate is 73.68% which is unacceptably high. PROC GENMOD procedure deletes observations with missing values, thus while running this model even though the dataset we provided had 771 observations, 497 observations were not used in the model prediction due to missing values which drastically degraded model s ability to predict.

7 Logistic Regression Prediction Error Rates for the Full model Predicted True Arrived No Show Total Arrived % 8.94% % No Show % 26.32% % Total % 14.96% % Table 6 By using various variable combinations in the logistic regression we were able to identify the best model fit which used following seven variables in the model; Age, Days to appointment, Distance, Patient Status, Visit Type, Reminder Call Status, Clinic type. This model selected variables Patient Status, Reminder Call Status and Age as significant predictors in determining the probability of a no show. The error rates of the model are reported in table 7 but results for this model are not very different from the full model in terms of performance. By removing the variable Previous Visit status we could increase the number of observations used in the model by 167 but still we are not using 330 observations due to missing values.

8 Logistic Regression Prediction Error Rates for the Best Fit model Predicted True Arrived No Show Total Arrived % 10.53% % No Show % 27.74% % Total % 15.87% % Table 7 Prediction ability of individual independent variables Since the logistic regression models did a poor job in predicting the probability of a no show, we analyzed the prediction ability of each independent variable by plotting the visit status against each independent variable. We first report the significant predictor data plots. Reminder call status was selected from both logistic regression models to be a good predictor. As shown in Fig 1. when reminder call status has the value confirmed we have low no show rate while for the other two values (specially for no answer ) the no show rate is high. Thus reminder call status shows moderate prediction ability.

9 Fig. 1 Fig. 2 shows the data distribution for days variable, where you can see the percentage of no shows is lower than arrivals towards the less days end of the graph while the no shows percentage is higher than the arrivals towards the more days end of the graph. This represents the moderate prediction ability of the variable days to appointment. Fig. 2

10 Visit status behavior with respect to age is reported in Fig.3, where you can see, towards the younger age of the plot the no show rate is comparatively smaller than arrivals but between the no show percentage is higher than arrivals. Fig. 3 Fig. 4 8 shows the variation of visit status with the independent variables; patient status, previous visit status, visit type, clinic type and distance. These graphs illustrate the fact that status of the independent variable does not describe the variations of visit status as we saw earlier with the above mentioned good predictor variables (reminder call status, days and age). In other words, for all values of the independent variable, the visit status always holds a low no show percentage.

11 Fig. 4 Fig. 5 Fig.6 Fig. 7 Fig. 8

12 Discriminant Analysis We also attempted a discriminant analysis in order to classify patient status. We investigated normal distribution based methods as well as nonparametric methods using PROC DISCRIM in SAS. The goal was to find the best subset of independent variables that are able to accurately predict whether or not a patient would fail to show up to their scheduled appointment. The predictive ability of each model was assessed based on cross validation error rates. Models with lower error rates are preferred. Also, it is important keep in mind that this statistical procedure deletes observations with any missing values. This limits the amount of available data and potentially harms the model s ability to predict observations when there is a large amount of missing data like in this study. Our first approach used the normal distribution method. This technique assumes that the independent variables follow a multivariate normal distribution. The independent variables in our analysis are a mix of continuous and categorical random variables so the assumption of multivariate normality might not be appropriate. Nevertheless, this method is attractive because one is able to access the discriminant function in the output. This is desirable because the discriminant function could then be used to easily classify new observations and determine which patients are likely to not show up for their appointment. The best normal based discriminant analysis included five of the independent variables: days, reminder, distance, patient status, and age. The cross validation error rates for this model are in the table 8.

13 Normal Based Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived % 8.75% % No Show % 22.86% % Total % 13.04% % Table 8 Several conclusions can be made from the results. This model does a good job at classifying those who arrived for their appointment, only misclassifying 8.75% of these patients. However, this model does a very poor job of accurately classifying the patients who failed to show up to their appointment. This model only correctly classified 22.86% of the patients whose true status was no show. Therefore, 77.14% of these no show patients were incorrectly predicted to arrive. This is the worst error we could make since our goal is to identify patients who will no show in order to target them for some kind of intervention. Unfortunately, these results indicate that the normal based discriminant analysis is not adequately able to classify patient status. Next we attempted a nonparametric discriminant analysis. This method is more flexible since it places no distributional assumptions on the independent

14 variables. The downside to this method is that one cannot get the discriminant function. So it is not very practical to implement these results when attempting to classify new observations. The best nonparametric discriminant analysis used five of the independent variables: days, reminder, distance, patient status, and age. In this case, the age variable only marginally improves the model but given the overall poor performance of the other methods we decided to include age even though it means we no longer have a parsimonious model. The cross validation error results for this model are in the table 9. Nonparametric Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived % 9.06% % No Show % 61.43% % Total % 25.00% % Table 9 These results are a notable improvement over the normal based method. Once again the model does a good job classifying the arrived patients, only misclassifying 9.06% of these patients. The nonparametric model correctly classified 61.43% of

15 the no show patients. While this is a significant improvement from the normal method it is still not very useful. When the main goal is to be able to predict no show patients it is imperative to have a very low error rate for this category and misclassifying about 4 out of every 10 no show patients is disappointing. Univariate Analyses Since the logistic regression and discriminant analysis results were unsatisfactory we decided to examine each of the independent variables individually with the status outcome. T tests were performed with the continuous variables in order to test whether there was a mean difference between no show and arrived patients. Also, chi squared tests of association were performed for each of the categorical variables. The results of these analyses can be found in the table 10.

16 Mean Std Err Arrival No Show T Statistic p value Days Distance Age Chi square p value Reminder < Previous Visit Status Patient Status Visit Type MedZou Clinic Table 10 From table 10 we can see that four of the tests are significant at alpha=0.05 (days, distance, reminder, and patient status). Also, age and visit type are marginally significant with p values <0.10. The tests for previous visit status and clinic type are not significant at any reasonable alpha level. Therefore, there is insufficient evidence to conclude that previous visit status and clinic type are related to status. The test for the days variable has a corresponding p value of We reject the null hypothesis that the mean days for no show and arrived patients are the same. We conclude that there is a statistically significant difference in the means of arrived patients and no show patients. By examining the means we can see that patients who attend their appointments have a lower mean number of days

17 from the date the appointment is scheduled until the date of the appointment. This is an intuitive result. The longer a patient has to wait the more likely they are to forget their appointment or have other important things arise thus causing them to not show up. Next we will look closer at the distance variable. This test was significant with p value so we reject the null hypothesis that the mean distance is the same for the two groups of patients. The estimates of the means indicate that the patients who arrived for their appointments had a larger mean distance from the clinic which is a slightly surprising result. The final continuous variable is age. The test was marginally significant with a p value of so there is some evidence that the no show patients and arrived patients have different mean ages. Moreover, it appears that the mean age for the arrived patients is higher than the mean age for no show patients. It is important to note that this statistically significant difference might not be a practically significant difference. The difference between the means is less than 2 years which raises some questions about whether age can really be used to distinguish between no show and arrived patients. The first of the significant categorical variables is the reminder call. The chisquare test indicates that there is sufficient evidence to conclude that the reminder call is associated with patient status. A look at the contingency table provides some insight. The table 11 represents the total counts for each combination of patient status and reminder call.

18 Arrived No Show Total Left Message Confirmed No Answer Total Table 11 One can see from the table that for left message and no answer the counts are about even between arrived and no shows. The real difference that stands out is in the confirmed row. Of the patients who confirmed their appointment, 77.7% did actually arrive. So it seems that a good indicator that a patient will arrive for their appointment is that they confirmed when given a reminder call. Next we will examine patient status. Once again the chi square test indicates that there is some statistical association between patient status and arrival status. Below is the contingency table 12 for these variables: Arrived No Show Total New Patient Return Patient Total Table 12 The table 12 indicates that the majority of both new patients and return patients did attend their scheduled appointments. Interestingly, a lower percentage of new patients failed to show up for their appointments.

19 Finally we will consider the results for visit type. Here we are examining whether a full visit versus not full visit is associated with arrival status. The counts for each combination can be found in table 13. Arrived No Show Total Full Visit Not full Visit Total Table 13 The first thing to notice in table 13 is the low counts for the not full visit. With only 5 patients failing to arrive for a not full visit it is difficult to draw any conclusions and use these results for classification. Also, the majority of full visit patients did arrive for their appointment. So this appears to be another example of a statistically significant result that does not have much of a practical application. Conclusions The statistical methods utilized in this study fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased power and made accurate prediction of observations almost impossible. There are potential ways to remedy this issue that might be useful in a supplementary investigation. One possibility is to use imputation in order to fill in the missing values. The ideal solution would be to acquire a larger sample of complete observation. However, this might not be possible given the clinic is staffed by volunteers with limited time and resources.

20 We employed only two methods out of a wide variety of statistical tools. Future research on this issue might benefit from using additional methods. Classification and regression trees would provide another way to predict observations. Also, a multinomial logistic regression would be able to model a status outcome that has more than two categories. Even though our methods did not perform as well as we would have liked there is still some useful information to come out of this study. Several of the independent variables did have statistically significant relationships with the status outcome. In particular, patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the exam. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive. Therefore, when attempting to identify a patient as a potential no show, we would recommend giving patients who have a large number of days from the scheduled date until the appointment date additional reminder calls. Moreover, it might be beneficial to make repeated reminder calls until the patient either confirms of indicates that they will be canceling. The aim of this analysis was to develop a procedure that could be used to predict patients who will fail to show up for a scheduled appointment. In order to accomplish this goal we set out by first fitting a logistic regression model and then performing discriminant analysis. Unfortunately neither of these methods provided satisfactory results that were able to classify patients with reasonable error rates. Obtaining a larger sample of complete observations or handling the missing values

21 in some other way might provide the needed power to get useful results in a future study.