A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic"

Transcription

1 A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia By Tyler Cook Chathuri Daluwatte Under the direction of Lori Thombs, Ph.D. Director, Social Sciences Statistics Center Department of Statistics University of Missouri, Columbia

2 Executive Summary The goal of this report is to identify variables and derive a model to predict the no show probability of a free health care clinic using an observed data set. In order to determine the probability of a no show we performed logistic regression, discriminant analysis and univariate analysis on the data set. However above mentioned multivariate statistical methods fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased the power of the analysis. Hence determining an accurate prediction model with reasonable error rates was almost impossible. However our results suggests that patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the appointment. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive.

3 Goal of the Study Health care is providing diagnosis, treatment and prevention for diseases. Health care systems are organizations established to provide above motioned health needs in target populations which are owned and operated by different entities in a variability of standards. Free clinics are health care systems where services are provided to public for free. The clinic of interest in this study is such free clinic which is run by volunteers and voluntary physicians. The study is observational about the no shows of scheduled appointments at the mentioned free health clinic. The goal of the study is to predict the probability of no show using information about scheduled patients which are stored in the clinic database. Identifying such a model will help the clinic management to intervene in possible ways and try reducing the likelihood of no shows thus save valuable time of the voluntary staff and physicians. Data Set As the dataset provided was from an existing database, data cleaning was required prior to data analysis. The provided data set consisted of one dependent variable, visit status which can take three values Arrived, Cancelled and No show. Since the goal of the study was to model the probability of a no show, we excluded the possibility of visit status variable having the value, cancelled by removing the cancelled appointments from the data set. The data set included three continuous independent variables namely age, distance to the clinic and days from the appointment set up to the appointment date. The data set also had six categorical

4 variables; previous visit Status, patient Status, visit type, reminder call status, clinic type and scheduled by. Previous visit status had five levels as shown in table 1, but due to lack of data we removed levels pending and rescheduled from the dataset. Table 1 Reminder call status too had five levels but similar to previous visit status, we removed the levels cancelled and rescheduled, table 2. Table 2 Clinic type had four levels, but as shown on table 3 three of the clinic types (Diabetes care, Dermatology, MSK night) had very low number of observations compared to the MedZou Clinic, thus we redefined the variable by categorizing other clinic types than MedZou clinic into one level named Non MedZou.

5 Table 3 Variable visit type had five levels but similar to clinic type, with very high frequency at level full visit as shown in table 4. Thus we redefined the variable to have two levels, full visit and non full visit. Table 4 The variable patient status had two levels new and return which we used as it is table 5. The Scheduled by variable, which represents the person who scheduled the appointment using a number, had too many levels with low counts at each level. Thus we decided not to use scheduled by variable in our analysis. Table 5

6 Logistic Regression In order to predict the probability of a no show in the dependent variable visit status, we attempted to fit a logistic regression model. We implemented the logistic regression model using PROC GENMOD procedure in SAS and we tried to first fit a model with all the independent variables in the model. The model suggested variables Reminder Call Status, Days, Age to be significant predictors in determining the probability of a no show. The predictive ability of the model was analyzed by calculating the error rates by using 0.5 as the threshold to determine whether the predicted probability suggests an arrival or a no show. Since the model predicts the probability of no show, if the predicted probability is greater than 0.5, that suggested a predicted no show while a probability less than or equal to 0.5 suggested a predicted arrival. Result of the model is concluded in table 6. As evident from table 6 the misclassified no show rate is 73.68% which is unacceptably high. PROC GENMOD procedure deletes observations with missing values, thus while running this model even though the dataset we provided had 771 observations, 497 observations were not used in the model prediction due to missing values which drastically degraded model s ability to predict.

7 Logistic Regression Prediction Error Rates for the Full model Predicted True Arrived No Show Total Arrived % 8.94% % No Show % 26.32% % Total % 14.96% % Table 6 By using various variable combinations in the logistic regression we were able to identify the best model fit which used following seven variables in the model; Age, Days to appointment, Distance, Patient Status, Visit Type, Reminder Call Status, Clinic type. This model selected variables Patient Status, Reminder Call Status and Age as significant predictors in determining the probability of a no show. The error rates of the model are reported in table 7 but results for this model are not very different from the full model in terms of performance. By removing the variable Previous Visit status we could increase the number of observations used in the model by 167 but still we are not using 330 observations due to missing values.

8 Logistic Regression Prediction Error Rates for the Best Fit model Predicted True Arrived No Show Total Arrived % 10.53% % No Show % 27.74% % Total % 15.87% % Table 7 Prediction ability of individual independent variables Since the logistic regression models did a poor job in predicting the probability of a no show, we analyzed the prediction ability of each independent variable by plotting the visit status against each independent variable. We first report the significant predictor data plots. Reminder call status was selected from both logistic regression models to be a good predictor. As shown in Fig 1. when reminder call status has the value confirmed we have low no show rate while for the other two values (specially for no answer ) the no show rate is high. Thus reminder call status shows moderate prediction ability.

9 Fig. 1 Fig. 2 shows the data distribution for days variable, where you can see the percentage of no shows is lower than arrivals towards the less days end of the graph while the no shows percentage is higher than the arrivals towards the more days end of the graph. This represents the moderate prediction ability of the variable days to appointment. Fig. 2

10 Visit status behavior with respect to age is reported in Fig.3, where you can see, towards the younger age of the plot the no show rate is comparatively smaller than arrivals but between the no show percentage is higher than arrivals. Fig. 3 Fig. 4 8 shows the variation of visit status with the independent variables; patient status, previous visit status, visit type, clinic type and distance. These graphs illustrate the fact that status of the independent variable does not describe the variations of visit status as we saw earlier with the above mentioned good predictor variables (reminder call status, days and age). In other words, for all values of the independent variable, the visit status always holds a low no show percentage.

11 Fig. 4 Fig. 5 Fig.6 Fig. 7 Fig. 8

12 Discriminant Analysis We also attempted a discriminant analysis in order to classify patient status. We investigated normal distribution based methods as well as nonparametric methods using PROC DISCRIM in SAS. The goal was to find the best subset of independent variables that are able to accurately predict whether or not a patient would fail to show up to their scheduled appointment. The predictive ability of each model was assessed based on cross validation error rates. Models with lower error rates are preferred. Also, it is important keep in mind that this statistical procedure deletes observations with any missing values. This limits the amount of available data and potentially harms the model s ability to predict observations when there is a large amount of missing data like in this study. Our first approach used the normal distribution method. This technique assumes that the independent variables follow a multivariate normal distribution. The independent variables in our analysis are a mix of continuous and categorical random variables so the assumption of multivariate normality might not be appropriate. Nevertheless, this method is attractive because one is able to access the discriminant function in the output. This is desirable because the discriminant function could then be used to easily classify new observations and determine which patients are likely to not show up for their appointment. The best normal based discriminant analysis included five of the independent variables: days, reminder, distance, patient status, and age. The cross validation error rates for this model are in the table 8.

13 Normal Based Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived % 8.75% % No Show % 22.86% % Total % 13.04% % Table 8 Several conclusions can be made from the results. This model does a good job at classifying those who arrived for their appointment, only misclassifying 8.75% of these patients. However, this model does a very poor job of accurately classifying the patients who failed to show up to their appointment. This model only correctly classified 22.86% of the patients whose true status was no show. Therefore, 77.14% of these no show patients were incorrectly predicted to arrive. This is the worst error we could make since our goal is to identify patients who will no show in order to target them for some kind of intervention. Unfortunately, these results indicate that the normal based discriminant analysis is not adequately able to classify patient status. Next we attempted a nonparametric discriminant analysis. This method is more flexible since it places no distributional assumptions on the independent

14 variables. The downside to this method is that one cannot get the discriminant function. So it is not very practical to implement these results when attempting to classify new observations. The best nonparametric discriminant analysis used five of the independent variables: days, reminder, distance, patient status, and age. In this case, the age variable only marginally improves the model but given the overall poor performance of the other methods we decided to include age even though it means we no longer have a parsimonious model. The cross validation error results for this model are in the table 9. Nonparametric Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived % 9.06% % No Show % 61.43% % Total % 25.00% % Table 9 These results are a notable improvement over the normal based method. Once again the model does a good job classifying the arrived patients, only misclassifying 9.06% of these patients. The nonparametric model correctly classified 61.43% of

15 the no show patients. While this is a significant improvement from the normal method it is still not very useful. When the main goal is to be able to predict no show patients it is imperative to have a very low error rate for this category and misclassifying about 4 out of every 10 no show patients is disappointing. Univariate Analyses Since the logistic regression and discriminant analysis results were unsatisfactory we decided to examine each of the independent variables individually with the status outcome. T tests were performed with the continuous variables in order to test whether there was a mean difference between no show and arrived patients. Also, chi squared tests of association were performed for each of the categorical variables. The results of these analyses can be found in the table 10.

16 Mean Std Err Arrival No Show T Statistic p value Days Distance Age Chi square p value Reminder < Previous Visit Status Patient Status Visit Type MedZou Clinic Table 10 From table 10 we can see that four of the tests are significant at alpha=0.05 (days, distance, reminder, and patient status). Also, age and visit type are marginally significant with p values <0.10. The tests for previous visit status and clinic type are not significant at any reasonable alpha level. Therefore, there is insufficient evidence to conclude that previous visit status and clinic type are related to status. The test for the days variable has a corresponding p value of We reject the null hypothesis that the mean days for no show and arrived patients are the same. We conclude that there is a statistically significant difference in the means of arrived patients and no show patients. By examining the means we can see that patients who attend their appointments have a lower mean number of days

17 from the date the appointment is scheduled until the date of the appointment. This is an intuitive result. The longer a patient has to wait the more likely they are to forget their appointment or have other important things arise thus causing them to not show up. Next we will look closer at the distance variable. This test was significant with p value so we reject the null hypothesis that the mean distance is the same for the two groups of patients. The estimates of the means indicate that the patients who arrived for their appointments had a larger mean distance from the clinic which is a slightly surprising result. The final continuous variable is age. The test was marginally significant with a p value of so there is some evidence that the no show patients and arrived patients have different mean ages. Moreover, it appears that the mean age for the arrived patients is higher than the mean age for no show patients. It is important to note that this statistically significant difference might not be a practically significant difference. The difference between the means is less than 2 years which raises some questions about whether age can really be used to distinguish between no show and arrived patients. The first of the significant categorical variables is the reminder call. The chisquare test indicates that there is sufficient evidence to conclude that the reminder call is associated with patient status. A look at the contingency table provides some insight. The table 11 represents the total counts for each combination of patient status and reminder call.

18 Arrived No Show Total Left Message Confirmed No Answer Total Table 11 One can see from the table that for left message and no answer the counts are about even between arrived and no shows. The real difference that stands out is in the confirmed row. Of the patients who confirmed their appointment, 77.7% did actually arrive. So it seems that a good indicator that a patient will arrive for their appointment is that they confirmed when given a reminder call. Next we will examine patient status. Once again the chi square test indicates that there is some statistical association between patient status and arrival status. Below is the contingency table 12 for these variables: Arrived No Show Total New Patient Return Patient Total Table 12 The table 12 indicates that the majority of both new patients and return patients did attend their scheduled appointments. Interestingly, a lower percentage of new patients failed to show up for their appointments.

19 Finally we will consider the results for visit type. Here we are examining whether a full visit versus not full visit is associated with arrival status. The counts for each combination can be found in table 13. Arrived No Show Total Full Visit Not full Visit Total Table 13 The first thing to notice in table 13 is the low counts for the not full visit. With only 5 patients failing to arrive for a not full visit it is difficult to draw any conclusions and use these results for classification. Also, the majority of full visit patients did arrive for their appointment. So this appears to be another example of a statistically significant result that does not have much of a practical application. Conclusions The statistical methods utilized in this study fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased power and made accurate prediction of observations almost impossible. There are potential ways to remedy this issue that might be useful in a supplementary investigation. One possibility is to use imputation in order to fill in the missing values. The ideal solution would be to acquire a larger sample of complete observation. However, this might not be possible given the clinic is staffed by volunteers with limited time and resources.

20 We employed only two methods out of a wide variety of statistical tools. Future research on this issue might benefit from using additional methods. Classification and regression trees would provide another way to predict observations. Also, a multinomial logistic regression would be able to model a status outcome that has more than two categories. Even though our methods did not perform as well as we would have liked there is still some useful information to come out of this study. Several of the independent variables did have statistically significant relationships with the status outcome. In particular, patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the exam. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive. Therefore, when attempting to identify a patient as a potential no show, we would recommend giving patients who have a large number of days from the scheduled date until the appointment date additional reminder calls. Moreover, it might be beneficial to make repeated reminder calls until the patient either confirms of indicates that they will be canceling. The aim of this analysis was to develop a procedure that could be used to predict patients who will fail to show up for a scheduled appointment. In order to accomplish this goal we set out by first fitting a logistic regression model and then performing discriminant analysis. Unfortunately neither of these methods provided satisfactory results that were able to classify patients with reasonable error rates. Obtaining a larger sample of complete observations or handling the missing values

21 in some other way might provide the needed power to get useful results in a future study.

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

13.2 The Chi Square Test for Homogeneity of Populations The setting: Used to compare distribution of proportions in two or more populations.

13.2 The Chi Square Test for Homogeneity of Populations The setting: Used to compare distribution of proportions in two or more populations. 13.2 The Chi Square Test for Homogeneity of Populations The setting: Used to compare distribution of proportions in two or more populations. Data is organized in a two way table Explanatory variable (Treatments)

More information

Statistical matching: Experimental results and future research questions

Statistical matching: Experimental results and future research questions Statistical matching: Experimental results and future research questions 2015 19 Ton de Waal Content 1. Introduction 4 2. Methods for statistical matching 5 2.1 Introduction to statistical matching 5 2.2

More information

Research Methods & Experimental Design

Research Methods & Experimental Design Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and

More information

How to Conduct a Hypothesis Test

How to Conduct a Hypothesis Test How to Conduct a Hypothesis Test The idea of hypothesis testing is relatively straightforward. In various studies we observe certain events. We must ask, is the event due to chance alone, or is there some

More information

Inferential Statistics

Inferential Statistics Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

Statistics and research

Statistics and research Statistics and research Usaneya Perngparn Chitlada Areesantichai Drug Dependence Research Center (WHOCC for Research and Training in Drug Dependence) College of Public Health Sciences Chulolongkorn University,

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

TRANSCRIPT: In this lecture, we will talk about both theoretical and applied concepts related to hypothesis testing.

TRANSCRIPT: In this lecture, we will talk about both theoretical and applied concepts related to hypothesis testing. This is Dr. Chumney. The focus of this lecture is hypothesis testing both what it is, how hypothesis tests are used, and how to conduct hypothesis tests. 1 In this lecture, we will talk about both theoretical

More information

Paper Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals

Paper Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals Paper 255-28 Let the Data Speak: New Regression Diagnostics Based on Cumulative Residuals Gordon Johnston and Ying So SAS Institute Inc. Cary, North Carolina, USA Abstract Residuals have long been used

More information

Hypothesis Testing: General Framework 1 1

Hypothesis Testing: General Framework 1 1 Hypothesis Testing: General Framework Lecture 2 K. Zuev February 22, 26 In previous lectures we learned how to estimate parameters in parametric and nonparametric settings. Quite often, however, researchers

More information

SPSS: Descriptive and Inferential Statistics. For Windows

SPSS: Descriptive and Inferential Statistics. For Windows For Windows August 2012 Table of Contents Section 1: Summarizing Data...3 1.1 Descriptive Statistics...3 Section 2: Inferential Statistics... 10 2.1 Chi-Square Test... 10 2.2 T tests... 11 2.3 Correlation...

More information

The Chi-Square Test. STAT E-50 Introduction to Statistics

The Chi-Square Test. STAT E-50 Introduction to Statistics STAT -50 Introduction to Statistics The Chi-Square Test The Chi-square test is a nonparametric test that is used to compare experimental results with theoretical models. That is, we will be comparing observed

More information

BIOS 665: Analysis of Categorical Data

BIOS 665: Analysis of Categorical Data BIOS 665: Analysis of Categorical Data Course Syllabus Fall 2016 Meeting Times Lecture: Tuesdays & Thursdays, 11:00am-12:15pm, Michael Hooker Research Center 0001 Recitation Session Hours: Tuesdays 3:30-4:30pm,

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Logistic Regression. Introduction. The Purpose Of Logistic Regression

Logistic Regression. Introduction. The Purpose Of Logistic Regression Logistic Regression...1 Introduction...1 The Purpose Of Logistic Regression...1 Assumptions Of Logistic Regression...2 The Logistic Regression Equation...3 Interpreting Log Odds And The Odds Ratio...4

More information

Chi Square Analysis. When do we use chi square?

Chi Square Analysis. When do we use chi square? Chi Square Analysis When do we use chi square? More often than not in psychological research, we find ourselves collecting scores from participants. These data are usually continuous measures, and might

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Programme du parcours Clinical Epidemiology 2014-2015. UMR 1. Methods in therapeutic evaluation A Dechartres/A Flahault

Programme du parcours Clinical Epidemiology 2014-2015. UMR 1. Methods in therapeutic evaluation A Dechartres/A Flahault Programme du parcours Clinical Epidemiology 2014-2015 UR 1. ethods in therapeutic evaluation A /A Date cours Horaires 15/10/2014 14-17h General principal of therapeutic evaluation (1) 22/10/2014 14-17h

More information

Logistic regression diagnostics

Logistic regression diagnostics Logistic regression diagnostics Biometry 755 Spring 2009 Logistic regression diagnostics p. 1/28 Assessing model fit A good model is one that fits the data well, in the sense that the values predicted

More information

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects

More information

Predicting Defaults of Loans using Lending Club s Loan Data

Predicting Defaults of Loans using Lending Club s Loan Data Predicting Defaults of Loans using Lending Club s Loan Data Oleh Dubno Fall 2014 General Assembly Data Science Link to my Developer Notebook (ipynb) - http://nbviewer.ipython.org/gist/odubno/0b767a47f75adb382246

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test.

Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test. The Analysis of Research Data The design of any project will determine what sort of statistical tests you should perform on your data and how successful the data analysis will be. For example if you decide

More information

Logistic Regression With SAS

Logistic Regression With SAS Logistic Regression With SAS Please read my introductory handout on logistic regression before reading this one. The introductory handout can be found at. Run the program LOGISTIC.SAS from my SAS programs

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Sample Size and Power in Clinical Trials

Sample Size and Power in Clinical Trials Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance

More information

Statistics for Clinical Trial SAS Programmers 1: paired t-test Kevin Lee, Covance Inc., Conshohocken, PA

Statistics for Clinical Trial SAS Programmers 1: paired t-test Kevin Lee, Covance Inc., Conshohocken, PA Statistics for Clinical Trial SAS Programmers 1: paired t-test Kevin Lee, Covance Inc., Conshohocken, PA ABSTRACT This paper is intended for SAS programmers who are interested in understanding common statistical

More information

Objectives. 9.1, 9.2 Inference for two-way tables. The hypothesis: no association. Expected cell counts. The chi-square test.

Objectives. 9.1, 9.2 Inference for two-way tables. The hypothesis: no association. Expected cell counts. The chi-square test. Objectives 9.1, 9.2 Inference for two-way tables The hypothesis: no association Expected cell counts The chi-square test Using software Further reading: http://onlinestatbook.com/2/chi_square/contingency.html

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

Module 5 Hypotheses Tests: Comparing Two Groups

Module 5 Hypotheses Tests: Comparing Two Groups Module 5 Hypotheses Tests: Comparing Two Groups Objective: In medical research, we often compare the outcomes between two groups of patients, namely exposed and unexposed groups. At the completion of this

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Negative Binomials Regression Model in Analysis of Wait Time at Hospital Emergency Department

Negative Binomials Regression Model in Analysis of Wait Time at Hospital Emergency Department Negative Binomials Regression Model in Analysis of Wait Time at Hospital Emergency Department Bill Cai 1, Iris Shimizu 1 1 National Center for Health Statistic, 3311 Toledo Road, Hyattsville, MD 20782

More information

UNDERSTANDING CLINICAL TRIAL STATISTICS. Prepared by Urania Dafni, Xanthi Pedeli, Zoi Tsourti

UNDERSTANDING CLINICAL TRIAL STATISTICS. Prepared by Urania Dafni, Xanthi Pedeli, Zoi Tsourti UNDERSTANDING CLINICAL TRIAL STATISTICS Prepared by Urania Dafni, Xanthi Pedeli, Zoi Tsourti DISCLOSURES Urania Dafni has reported no conflict of interest Xanthi Pedeli has reported no conflict of interest

More information

AP Statistics 1998 Scoring Guidelines

AP Statistics 1998 Scoring Guidelines AP Statistics 1998 Scoring Guidelines These materials are intended for non-commercial use by AP teachers for course and exam preparation; permission for any other use must be sought from the Advanced Placement

More information

Bivariate Analysis. Correlation. Correlation. Pearson's Correlation Coefficient. Variable 1. Variable 2

Bivariate Analysis. Correlation. Correlation. Pearson's Correlation Coefficient. Variable 1. Variable 2 Bivariate Analysis Variable 2 LEVELS >2 LEVELS COTIUOUS Correlation Used when you measure two continuous variables. Variable 2 2 LEVELS X 2 >2 LEVELS X 2 COTIUOUS t-test X 2 X 2 AOVA (F-test) t-test AOVA

More information

PASS Sample Size Software

PASS Sample Size Software Chapter 250 Introduction The Chi-square test is often used to test whether sets of frequencies or proportions follow certain patterns. The two most common instances are tests of goodness of fit using multinomial

More information

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1: Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 000: Page 1: CHI-SQUARE TESTS: When to use a Chi-Square test: Usually in psychological research, we aim to obtain one or more scores

More information

Predictive Modelling Pilot Project

Predictive Modelling Pilot Project Predictive Modelling Pilot Project 1. Introduction The Long Term Conditions QIPP (quality, innovation, productivity and prevention) workstream seeks to improve clinical outcomes and experience for patients

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Lecture 22: Introduction to Log-linear Models

Lecture 22: Introduction to Log-linear Models Lecture 22: Introduction to Log-linear Models Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina

More information

CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY

CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY The hypothesis testing statistics detailed thus far in this text have all been designed to allow comparison of the means of two or more samples

More information

Chi-Square Test. Contingency Tables. Contingency Tables. Chi-Square Test for Independence. Chi-Square Tests for Goodnessof-Fit

Chi-Square Test. Contingency Tables. Contingency Tables. Chi-Square Test for Independence. Chi-Square Tests for Goodnessof-Fit Chi-Square Tests 15 Chapter Chi-Square Test for Independence Chi-Square Tests for Goodness Uniform Goodness- Poisson Goodness- Goodness Test ECDF Tests (Optional) McGraw-Hill/Irwin Copyright 2009 by The

More information

HYPOTHESIS TESTING: POWER OF THE TEST

HYPOTHESIS TESTING: POWER OF THE TEST HYPOTHESIS TESTING: POWER OF THE TEST The first 6 steps of the 9-step test of hypothesis are called "the test". These steps are not dependent on the observed data values. When planning a research project,

More information

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

CHI-SQUARE: TESTING FOR GOODNESS OF FIT CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity

More information

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION N PROBLEM DEFINITION Opportunity New Booking - Time of Arrival Shortest Route (Distance/Time) Taxi-Passenger Demand Distribution Value Accurate

More information

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the

More information

Multiple Regression in SPSS STAT 314

Multiple Regression in SPSS STAT 314 Multiple Regression in SPSS STAT 314 I. The accompanying data is on y = profit margin of savings and loan companies in a given year, x 1 = net revenues in that year, and x 2 = number of savings and loan

More information

χ 2 = (O i E i ) 2 E i

χ 2 = (O i E i ) 2 E i Chapter 24 Two-Way Tables and the Chi-Square Test We look at two-way tables to determine association of paired qualitative data. We look at marginal distributions, conditional distributions and bar graphs.

More information

The Effect of a Carve-out Advanced Access Scheduling System on No-show Rates

The Effect of a Carve-out Advanced Access Scheduling System on No-show Rates Practice Management Vol. 41, No. 1 51 The Effect of a Carve-out Advanced Access Scheduling System on No-show Rates Kevin J. Bennett, PhD; Elizabeth G. Baxley, MD Background and Objectives: The relationship

More information

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Odds ratio, Odds ratio test for independence, chi-squared statistic. Odds ratio, Odds ratio test for independence, chi-squared statistic. Announcements: Assignment 5 is live on webpage. Due Wed Aug 1 at 4:30pm. (9 days, 1 hour, 58.5 minutes ) Final exam is Aug 9. Review

More information

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. 277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies

More information

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared

More information

Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1. 1. Introduction p. 2. 2. Statistical Methods Used p. 5. 3. 10 and under Males p.

Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1. 1. Introduction p. 2. 2. Statistical Methods Used p. 5. 3. 10 and under Males p. Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1 Table of Contents 1. Introduction p. 2 2. Statistical Methods Used p. 5 3. 10 and under Males p. 8 4. 11 and up Males p. 10 5. 10 and under

More information

Chi-square test. More types of inference for nominal variables

Chi-square test. More types of inference for nominal variables Chi-square test FPP 28 More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies of nominal variable to hypothesized probabilities

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

CATEGORICAL DATA Chi-Square Tests for Univariate Data

CATEGORICAL DATA Chi-Square Tests for Univariate Data CATEGORICAL DATA Chi-Square Tests For Univariate Data 1 CATEGORICAL DATA Chi-Square Tests for Univariate Data Recall that a categorical variable is one in which the possible values are categories or groupings.

More information

Paper Beyond Breslow-Day: Homogeneity Across R x C Tables ABSTRACT INTRODUCTION SAMPLE DATA K 2 2 TABLES

Paper Beyond Breslow-Day: Homogeneity Across R x C Tables ABSTRACT INTRODUCTION SAMPLE DATA K 2 2 TABLES Paper 74949 Beyond Breslow-Day: Homogeneity Across R x C Tables Ginny P. Lai, David R. Mink, David J. Pasta, ICON Late Phase & Outcomes Research, San Francisco, CA ABSTRACT In the epidemiological world,

More information

Variable Selection and Transformation of Variables in SAS Enterprise Miner

Variable Selection and Transformation of Variables in SAS Enterprise Miner Variable Selection and Transformation of Variables in SAS Enterprise Miner Kattamuri S. Sarma, Ph.D Ecostat Research Corp., White Plains NY kssarma@worldnet.att.net kssarma@ecostat-research.com 2 Issues

More information

Is it statistically significant? The chi-square test

Is it statistically significant? The chi-square test UAS Conference Series 2013/14 Is it statistically significant? The chi-square test Dr Gosia Turner Student Data Management and Analysis 14 September 2010 Page 1 Why chi-square? Tests whether two categorical

More information

CRJ Doctoral Comprehensive Exam Statistics Friday August 23, :00pm 5:30pm

CRJ Doctoral Comprehensive Exam Statistics Friday August 23, :00pm 5:30pm CRJ Doctoral Comprehensive Exam Statistics Friday August 23, 23 2:pm 5:3pm Instructions: (Answer all questions below) Question I: Data Collection and Bivariate Hypothesis Testing. Answer the following

More information

CHAPTER 11. GOODNESS OF FIT AND CONTINGENCY TABLES

CHAPTER 11. GOODNESS OF FIT AND CONTINGENCY TABLES CHAPTER 11. GOODNESS OF FIT AND CONTINGENCY TABLES The chi-square distribution was discussed in Chapter 4. We now turn to some applications of this distribution. As previously discussed, chi-square is

More information

Analyzing Titanic Survival Rates Carly Barry 12 April, 2012

Analyzing Titanic Survival Rates Carly Barry 12 April, 2012 http://blog.minitab.com/blog/real-world-quality-improvement/analyzing-titanic-survival-rates Analyzing Titanic Survival Rates Carly Barry 12 April, 2012 April 15, 2012 marks the 100th anniversary of the

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Investigating the Investigative Task: Testing for Skewness An Investigation of Different Test Statistics and their Power to Detect Skewness

Investigating the Investigative Task: Testing for Skewness An Investigation of Different Test Statistics and their Power to Detect Skewness Investigating the Investigative Task: Testing for Skewness An Investigation of Different Test Statistics and their Power to Detect Skewness Josh Tabor Canyon del Oro High School Journal of Statistics Education

More information

Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME:

Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME: Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME: Instructions: This exam is 30% of your course grade. The maximum number of points for the course is 1,000; hence this exam is worth 300

More information

1/2/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

1/2/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 When and why do we use logistic regression? Binary Multinomial Theory behind logistic regression Assessing the model Assessing predictors

More information

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

9-3.4 Likelihood ratio test. Neyman-Pearson lemma

9-3.4 Likelihood ratio test. Neyman-Pearson lemma 9-3.4 Likelihood ratio test Neyman-Pearson lemma 9-1 Hypothesis Testing 9-1.1 Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental

More information

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions

Chi Squared and Fisher's Exact Tests. Observed vs Expected Distributions BMS 617 Statistical Techniques for the Biomedical Sciences Lecture 11: Chi-Squared and Fisher's Exact Tests Chi Squared and Fisher's Exact Tests This lecture presents two similarly structured tests, Chi-squared

More information

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

More information

Inferential Statistics. What are they? When would you use them?

Inferential Statistics. What are they? When would you use them? Inferential Statistics What are they? When would you use them? What are inferential statistics? Why learn about inferential statistics? Why use inferential statistics? When are inferential statistics utilized?

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

The general form of the PROC GLM statement is

The general form of the PROC GLM statement is Linear Regression Analysis using PROC GLM Regression analysis is a statistical method of obtaining an equation that represents a linear relationship between two variables (simple linear regression), or

More information

Death on the Titanic

Death on the Titanic Death on the Titanic Introduction On its maiden voyage, the cruise ship Titanic collided with an iceberg and sank. There was much loss of life. It is of interest to test how well sample proportions from

More information

Semester 1 Statistics Short courses

Semester 1 Statistics Short courses Semester 1 Statistics Short courses Course: STAA0001 Basic Statistics Blackboard Site: STAA0001 Dates: Sat. March 12 th and Sat. April 30 th (9 am 5 pm) Assumed Knowledge: None Course Description Statistical

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

HYPOTHESIS TESTING WITH SPSS:

HYPOTHESIS TESTING WITH SPSS: HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

SOME NOTES ON STATISTICAL INTERPRETATION. Below I provide some basic notes on statistical interpretation for some selected procedures.

SOME NOTES ON STATISTICAL INTERPRETATION. Below I provide some basic notes on statistical interpretation for some selected procedures. 1 SOME NOTES ON STATISTICAL INTERPRETATION Below I provide some basic notes on statistical interpretation for some selected procedures. The information provided here is not exhaustive. There is more to

More information

AP Statistics 2002 Scoring Guidelines

AP Statistics 2002 Scoring Guidelines AP Statistics 2002 Scoring Guidelines The materials included in these files are intended for use by AP teachers for course and exam preparation in the classroom; permission for any other use must be sought

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

American Journal Of Business Education July/August 2012 Volume 5, Number 4

American Journal Of Business Education July/August 2012 Volume 5, Number 4 The Impact Of The Principles Of Accounting Experience On Student Preparation For Intermediate Accounting Linda G. Carrington, Ph.D., Sam Houston State University, USA ABSTRACT Both students and instructors

More information

T adult = 96 T child = 114.

T adult = 96 T child = 114. Homework Solutions Do all tests at the 5% level and quote p-values when possible. When answering each question uses sentences and include the relevant JMP output and plots (do not include the data in your

More information

SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Contingency Tables and the Chi Square Statistic. Interpreting Computer Printouts and Constructing Tables

Contingency Tables and the Chi Square Statistic. Interpreting Computer Printouts and Constructing Tables Contingency Tables and the Chi Square Statistic Interpreting Computer Printouts and Constructing Tables Contingency Tables/Chi Square Statistics What are they? A contingency table is a table that shows

More information

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp. 394-398, 404-408, 410-420

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp. 394-398, 404-408, 410-420 BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp. 394-398, 404-408, 410-420 1. Which of the following will increase the value of the power in a statistical test

More information

Statistical Modeling Using SAS

Statistical Modeling Using SAS Statistical Modeling Using SAS Xiangming Fang Department of Biostatistics East Carolina University SAS Code Workshop Series 2012 Xiangming Fang (Department of Biostatistics) Statistical Modeling Using

More information

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

12.5: CHI-SQUARE GOODNESS OF FIT TESTS 125: Chi-Square Goodness of Fit Tests CD12-1 125: CHI-SQUARE GOODNESS OF FIT TESTS In this section, the χ 2 distribution is used for testing the goodness of fit of a set of data to a specific probability

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Hatice Camgöz Akdağ. findings of previous research in which two independent firm clusters were

Hatice Camgöz Akdağ. findings of previous research in which two independent firm clusters were Innovative Culture and Total Quality Management as a Tool for Sustainable Competitiveness: A Case Study of Turkish Fruit and Vegetable Processing Industry SMEs, Sedef Akgüngör Hatice Camgöz Akdağ Aslı

More information