A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
|
|
- Stewart Fowler
- 7 years ago
- Views:
Transcription
1 A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia By Tyler Cook Chathuri Daluwatte Under the direction of Lori Thombs, Ph.D. Director, Social Sciences Statistics Center Department of Statistics University of Missouri, Columbia
2 Executive Summary The goal of this report is to identify variables and derive a model to predict the no show probability of a free health care clinic using an observed data set. In order to determine the probability of a no show we performed logistic regression, discriminant analysis and univariate analysis on the data set. However above mentioned multivariate statistical methods fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased the power of the analysis. Hence determining an accurate prediction model with reasonable error rates was almost impossible. However our results suggests that patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the appointment. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive.
3 Goal of the Study Health care is providing diagnosis, treatment and prevention for diseases. Health care systems are organizations established to provide above motioned health needs in target populations which are owned and operated by different entities in a variability of standards. Free clinics are health care systems where services are provided to public for free. The clinic of interest in this study is such free clinic which is run by volunteers and voluntary physicians. The study is observational about the no shows of scheduled appointments at the mentioned free health clinic. The goal of the study is to predict the probability of no show using information about scheduled patients which are stored in the clinic database. Identifying such a model will help the clinic management to intervene in possible ways and try reducing the likelihood of no shows thus save valuable time of the voluntary staff and physicians. Data Set As the dataset provided was from an existing database, data cleaning was required prior to data analysis. The provided data set consisted of one dependent variable, visit status which can take three values Arrived, Cancelled and No show. Since the goal of the study was to model the probability of a no show, we excluded the possibility of visit status variable having the value, cancelled by removing the cancelled appointments from the data set. The data set included three continuous independent variables namely age, distance to the clinic and days from the appointment set up to the appointment date. The data set also had six categorical
4 variables; previous visit Status, patient Status, visit type, reminder call status, clinic type and scheduled by. Previous visit status had five levels as shown in table 1, but due to lack of data we removed levels pending and rescheduled from the dataset. Table 1 Reminder call status too had five levels but similar to previous visit status, we removed the levels cancelled and rescheduled, table 2. Table 2 Clinic type had four levels, but as shown on table 3 three of the clinic types (Diabetes care, Dermatology, MSK night) had very low number of observations compared to the MedZou Clinic, thus we redefined the variable by categorizing other clinic types than MedZou clinic into one level named Non MedZou.
5 Table 3 Variable visit type had five levels but similar to clinic type, with very high frequency at level full visit as shown in table 4. Thus we redefined the variable to have two levels, full visit and non full visit. Table 4 The variable patient status had two levels new and return which we used as it is table 5. The Scheduled by variable, which represents the person who scheduled the appointment using a number, had too many levels with low counts at each level. Thus we decided not to use scheduled by variable in our analysis. Table 5
6 Logistic Regression In order to predict the probability of a no show in the dependent variable visit status, we attempted to fit a logistic regression model. We implemented the logistic regression model using PROC GENMOD procedure in SAS and we tried to first fit a model with all the independent variables in the model. The model suggested variables Reminder Call Status, Days, Age to be significant predictors in determining the probability of a no show. The predictive ability of the model was analyzed by calculating the error rates by using 0.5 as the threshold to determine whether the predicted probability suggests an arrival or a no show. Since the model predicts the probability of no show, if the predicted probability is greater than 0.5, that suggested a predicted no show while a probability less than or equal to 0.5 suggested a predicted arrival. Result of the model is concluded in table 6. As evident from table 6 the misclassified no show rate is 73.68% which is unacceptably high. PROC GENMOD procedure deletes observations with missing values, thus while running this model even though the dataset we provided had 771 observations, 497 observations were not used in the model prediction due to missing values which drastically degraded model s ability to predict.
7 Logistic Regression Prediction Error Rates for the Full model Predicted True Arrived No Show Total Arrived % 8.94% % No Show % 26.32% % Total % 14.96% % Table 6 By using various variable combinations in the logistic regression we were able to identify the best model fit which used following seven variables in the model; Age, Days to appointment, Distance, Patient Status, Visit Type, Reminder Call Status, Clinic type. This model selected variables Patient Status, Reminder Call Status and Age as significant predictors in determining the probability of a no show. The error rates of the model are reported in table 7 but results for this model are not very different from the full model in terms of performance. By removing the variable Previous Visit status we could increase the number of observations used in the model by 167 but still we are not using 330 observations due to missing values.
8 Logistic Regression Prediction Error Rates for the Best Fit model Predicted True Arrived No Show Total Arrived % 10.53% % No Show % 27.74% % Total % 15.87% % Table 7 Prediction ability of individual independent variables Since the logistic regression models did a poor job in predicting the probability of a no show, we analyzed the prediction ability of each independent variable by plotting the visit status against each independent variable. We first report the significant predictor data plots. Reminder call status was selected from both logistic regression models to be a good predictor. As shown in Fig 1. when reminder call status has the value confirmed we have low no show rate while for the other two values (specially for no answer ) the no show rate is high. Thus reminder call status shows moderate prediction ability.
9 Fig. 1 Fig. 2 shows the data distribution for days variable, where you can see the percentage of no shows is lower than arrivals towards the less days end of the graph while the no shows percentage is higher than the arrivals towards the more days end of the graph. This represents the moderate prediction ability of the variable days to appointment. Fig. 2
10 Visit status behavior with respect to age is reported in Fig.3, where you can see, towards the younger age of the plot the no show rate is comparatively smaller than arrivals but between the no show percentage is higher than arrivals. Fig. 3 Fig. 4 8 shows the variation of visit status with the independent variables; patient status, previous visit status, visit type, clinic type and distance. These graphs illustrate the fact that status of the independent variable does not describe the variations of visit status as we saw earlier with the above mentioned good predictor variables (reminder call status, days and age). In other words, for all values of the independent variable, the visit status always holds a low no show percentage.
11 Fig. 4 Fig. 5 Fig.6 Fig. 7 Fig. 8
12 Discriminant Analysis We also attempted a discriminant analysis in order to classify patient status. We investigated normal distribution based methods as well as nonparametric methods using PROC DISCRIM in SAS. The goal was to find the best subset of independent variables that are able to accurately predict whether or not a patient would fail to show up to their scheduled appointment. The predictive ability of each model was assessed based on cross validation error rates. Models with lower error rates are preferred. Also, it is important keep in mind that this statistical procedure deletes observations with any missing values. This limits the amount of available data and potentially harms the model s ability to predict observations when there is a large amount of missing data like in this study. Our first approach used the normal distribution method. This technique assumes that the independent variables follow a multivariate normal distribution. The independent variables in our analysis are a mix of continuous and categorical random variables so the assumption of multivariate normality might not be appropriate. Nevertheless, this method is attractive because one is able to access the discriminant function in the output. This is desirable because the discriminant function could then be used to easily classify new observations and determine which patients are likely to not show up for their appointment. The best normal based discriminant analysis included five of the independent variables: days, reminder, distance, patient status, and age. The cross validation error rates for this model are in the table 8.
13 Normal Based Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived % 8.75% % No Show % 22.86% % Total % 13.04% % Table 8 Several conclusions can be made from the results. This model does a good job at classifying those who arrived for their appointment, only misclassifying 8.75% of these patients. However, this model does a very poor job of accurately classifying the patients who failed to show up to their appointment. This model only correctly classified 22.86% of the patients whose true status was no show. Therefore, 77.14% of these no show patients were incorrectly predicted to arrive. This is the worst error we could make since our goal is to identify patients who will no show in order to target them for some kind of intervention. Unfortunately, these results indicate that the normal based discriminant analysis is not adequately able to classify patient status. Next we attempted a nonparametric discriminant analysis. This method is more flexible since it places no distributional assumptions on the independent
14 variables. The downside to this method is that one cannot get the discriminant function. So it is not very practical to implement these results when attempting to classify new observations. The best nonparametric discriminant analysis used five of the independent variables: days, reminder, distance, patient status, and age. In this case, the age variable only marginally improves the model but given the overall poor performance of the other methods we decided to include age even though it means we no longer have a parsimonious model. The cross validation error results for this model are in the table 9. Nonparametric Discriminant Analysis Cross Validation Error Rates Predicted True Arrived No Show Total Arrived % 9.06% % No Show % 61.43% % Total % 25.00% % Table 9 These results are a notable improvement over the normal based method. Once again the model does a good job classifying the arrived patients, only misclassifying 9.06% of these patients. The nonparametric model correctly classified 61.43% of
15 the no show patients. While this is a significant improvement from the normal method it is still not very useful. When the main goal is to be able to predict no show patients it is imperative to have a very low error rate for this category and misclassifying about 4 out of every 10 no show patients is disappointing. Univariate Analyses Since the logistic regression and discriminant analysis results were unsatisfactory we decided to examine each of the independent variables individually with the status outcome. T tests were performed with the continuous variables in order to test whether there was a mean difference between no show and arrived patients. Also, chi squared tests of association were performed for each of the categorical variables. The results of these analyses can be found in the table 10.
16 Mean Std Err Arrival No Show T Statistic p value Days Distance Age Chi square p value Reminder < Previous Visit Status Patient Status Visit Type MedZou Clinic Table 10 From table 10 we can see that four of the tests are significant at alpha=0.05 (days, distance, reminder, and patient status). Also, age and visit type are marginally significant with p values <0.10. The tests for previous visit status and clinic type are not significant at any reasonable alpha level. Therefore, there is insufficient evidence to conclude that previous visit status and clinic type are related to status. The test for the days variable has a corresponding p value of We reject the null hypothesis that the mean days for no show and arrived patients are the same. We conclude that there is a statistically significant difference in the means of arrived patients and no show patients. By examining the means we can see that patients who attend their appointments have a lower mean number of days
17 from the date the appointment is scheduled until the date of the appointment. This is an intuitive result. The longer a patient has to wait the more likely they are to forget their appointment or have other important things arise thus causing them to not show up. Next we will look closer at the distance variable. This test was significant with p value so we reject the null hypothesis that the mean distance is the same for the two groups of patients. The estimates of the means indicate that the patients who arrived for their appointments had a larger mean distance from the clinic which is a slightly surprising result. The final continuous variable is age. The test was marginally significant with a p value of so there is some evidence that the no show patients and arrived patients have different mean ages. Moreover, it appears that the mean age for the arrived patients is higher than the mean age for no show patients. It is important to note that this statistically significant difference might not be a practically significant difference. The difference between the means is less than 2 years which raises some questions about whether age can really be used to distinguish between no show and arrived patients. The first of the significant categorical variables is the reminder call. The chisquare test indicates that there is sufficient evidence to conclude that the reminder call is associated with patient status. A look at the contingency table provides some insight. The table 11 represents the total counts for each combination of patient status and reminder call.
18 Arrived No Show Total Left Message Confirmed No Answer Total Table 11 One can see from the table that for left message and no answer the counts are about even between arrived and no shows. The real difference that stands out is in the confirmed row. Of the patients who confirmed their appointment, 77.7% did actually arrive. So it seems that a good indicator that a patient will arrive for their appointment is that they confirmed when given a reminder call. Next we will examine patient status. Once again the chi square test indicates that there is some statistical association between patient status and arrival status. Below is the contingency table 12 for these variables: Arrived No Show Total New Patient Return Patient Total Table 12 The table 12 indicates that the majority of both new patients and return patients did attend their scheduled appointments. Interestingly, a lower percentage of new patients failed to show up for their appointments.
19 Finally we will consider the results for visit type. Here we are examining whether a full visit versus not full visit is associated with arrival status. The counts for each combination can be found in table 13. Arrived No Show Total Full Visit Not full Visit Total Table 13 The first thing to notice in table 13 is the low counts for the not full visit. With only 5 patients failing to arrive for a not full visit it is difficult to draw any conclusions and use these results for classification. Also, the majority of full visit patients did arrive for their appointment. So this appears to be another example of a statistically significant result that does not have much of a practical application. Conclusions The statistical methods utilized in this study fell victim to the high number of missing values in the dataset. The large amount of incomplete data decreased power and made accurate prediction of observations almost impossible. There are potential ways to remedy this issue that might be useful in a supplementary investigation. One possibility is to use imputation in order to fill in the missing values. The ideal solution would be to acquire a larger sample of complete observation. However, this might not be possible given the clinic is staffed by volunteers with limited time and resources.
20 We employed only two methods out of a wide variety of statistical tools. Future research on this issue might benefit from using additional methods. Classification and regression trees would provide another way to predict observations. Also, a multinomial logistic regression would be able to model a status outcome that has more than two categories. Even though our methods did not perform as well as we would have liked there is still some useful information to come out of this study. Several of the independent variables did have statistically significant relationships with the status outcome. In particular, patients who did not show up for a scheduled appointment tended to have a larger average number of days between when the visit was scheduled and the date of the exam. Also, the patients who confirmed their appointment during the reminder call were much more likely to actually arrive. Therefore, when attempting to identify a patient as a potential no show, we would recommend giving patients who have a large number of days from the scheduled date until the appointment date additional reminder calls. Moreover, it might be beneficial to make repeated reminder calls until the patient either confirms of indicates that they will be canceling. The aim of this analysis was to develop a procedure that could be used to predict patients who will fail to show up for a scheduled appointment. In order to accomplish this goal we set out by first fitting a logistic regression model and then performing discriminant analysis. Unfortunately neither of these methods provided satisfactory results that were able to classify patients with reasonable error rates. Obtaining a larger sample of complete observations or handling the missing values
21 in some other way might provide the needed power to get useful results in a future study.
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationBinary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
More informationHandling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
More informationAssociation Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
More informationResearch Methods & Experimental Design
Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and
More information12.5: CHI-SQUARE GOODNESS OF FIT TESTS
125: Chi-Square Goodness of Fit Tests CD12-1 125: CHI-SQUARE GOODNESS OF FIT TESTS In this section, the χ 2 distribution is used for testing the goodness of fit of a set of data to a specific probability
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationII. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
More informationCHI-SQUARE: TESTING FOR GOODNESS OF FIT
CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationClass 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)
Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the
More informationThe Chi-Square Test. STAT E-50 Introduction to Statistics
STAT -50 Introduction to Statistics The Chi-Square Test The Chi-square test is a nonparametric test that is used to compare experimental results with theoretical models. That is, we will be comparing observed
More informationOdds ratio, Odds ratio test for independence, chi-squared statistic.
Odds ratio, Odds ratio test for independence, chi-squared statistic. Announcements: Assignment 5 is live on webpage. Due Wed Aug 1 at 4:30pm. (9 days, 1 hour, 58.5 minutes ) Final exam is Aug 9. Review
More informationCAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION
CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION N PROBLEM DEFINITION Opportunity New Booking - Time of Arrival Shortest Route (Distance/Time) Taxi-Passenger Demand Distribution Value Accurate
More informationHYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
More informationHYPOTHESIS TESTING: POWER OF THE TEST
HYPOTHESIS TESTING: POWER OF THE TEST The first 6 steps of the 9-step test of hypothesis are called "the test". These steps are not dependent on the observed data values. When planning a research project,
More informationOrdinal Regression. Chapter
Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe
More information"Statistical methods are objective methods by which group trends are abstracted from observations on many separate individuals." 1
BASIC STATISTICAL THEORY / 3 CHAPTER ONE BASIC STATISTICAL THEORY "Statistical methods are objective methods by which group trends are abstracted from observations on many separate individuals." 1 Medicine
More informationCONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont
CONTINGENCY TABLES ARE NOT ALL THE SAME David C. Howell University of Vermont To most people studying statistics a contingency table is a contingency table. We tend to forget, if we ever knew, that contingency
More informationAddressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association
Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationMethods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
More informationSample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
More informationContingency Tables and the Chi Square Statistic. Interpreting Computer Printouts and Constructing Tables
Contingency Tables and the Chi Square Statistic Interpreting Computer Printouts and Constructing Tables Contingency Tables/Chi Square Statistics What are they? A contingency table is a table that shows
More informationNegative Binomials Regression Model in Analysis of Wait Time at Hospital Emergency Department
Negative Binomials Regression Model in Analysis of Wait Time at Hospital Emergency Department Bill Cai 1, Iris Shimizu 1 1 National Center for Health Statistic, 3311 Toledo Road, Hyattsville, MD 20782
More informationMultinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
More informationSession 7 Bivariate Data and Analysis
Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares
More informationChi-square test Fisher s Exact test
Lesson 1 Chi-square test Fisher s Exact test McNemar s Test Lesson 1 Overview Lesson 11 covered two inference methods for categorical data from groups Confidence Intervals for the difference of two proportions
More informationA Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationChapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
More informationProjects Involving Statistics (& SPSS)
Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,
More informationSPSS Guide: Regression Analysis
SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar
More informationStatistical tests for SPSS
Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly
More informationDescriptive Analysis
Research Methods William G. Zikmund Basic Data Analysis: Descriptive Statistics Descriptive Analysis The transformation of raw data into a form that will make them easy to understand and interpret; rearranging,
More informationSCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES
SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR
More informationFinal Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More informationresearch/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other
1 Hypothesis Testing Richard S. Balkin, Ph.D., LPC-S, NCC 2 Overview When we have questions about the effect of a treatment or intervention or wish to compare groups, we use hypothesis testing Parametric
More informationAmerican Journal Of Business Education July/August 2012 Volume 5, Number 4
The Impact Of The Principles Of Accounting Experience On Student Preparation For Intermediate Accounting Linda G. Carrington, Ph.D., Sam Houston State University, USA ABSTRACT Both students and instructors
More informationImputing Missing Data using SAS
ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
More informationChapter 6 Experiment Process
Chapter 6 Process ation is not simple; we have to prepare, conduct and analyze experiments properly. One of the main advantages of an experiment is the control of, for example, subjects, objects and instrumentation.
More informationFirst-year Statistics for Psychology Students Through Worked Examples
First-year Statistics for Psychology Students Through Worked Examples 1. THE CHI-SQUARE TEST A test of association between categorical variables by Charles McCreery, D.Phil Formerly Lecturer in Experimental
More informationUSING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA
USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique
More informationA Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
More informationPredicting Defaults of Loans using Lending Club s Loan Data
Predicting Defaults of Loans using Lending Club s Loan Data Oleh Dubno Fall 2014 General Assembly Data Science Link to my Developer Notebook (ipynb) - http://nbviewer.ipython.org/gist/odubno/0b767a47f75adb382246
More informationPredicting Flight Delays
Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
More informationChapter 5 Analysis of variance SPSS Analysis of variance
Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,
More informationBasic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
More informationUnit 26 Estimation with Confidence Intervals
Unit 26 Estimation with Confidence Intervals Objectives: To see how confidence intervals are used to estimate a population proportion, a population mean, a difference in population proportions, or a difference
More informationBuilding risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg
Building risk prediction models - with a focus on Genome-Wide Association Studies Risk prediction models Based on data: (D i, X i1,..., X ip ) i = 1,..., n we like to fit a model P(D = 1 X 1,..., X p )
More informationPsychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!
Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck! Name: 1. The basic idea behind hypothesis testing: A. is important only if you want to compare two populations. B. depends on
More informationASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
More informationSection 12 Part 2. Chi-square test
Section 12 Part 2 Chi-square test McNemar s Test Section 12 Part 2 Overview Section 12, Part 1 covered two inference methods for categorical data from 2 groups Confidence Intervals for the difference of
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationTHE USE OF LONGITUDINAL DATA FROM THE EU SILC IN MONITORING THE EMPLOYMENT STRATEGY
THE USE OF LONGITUDINAL DATA FROM THE EU SILC IN MONITORING THE EMPLOYMENT STRATEGY Iveta Stankovičová Ľudmila Ivančíková Róbert Vlačuha Abstract The National Employment Strategy for the period until 2020,
More informationTesting for Granger causality between stock prices and economic growth
MPRA Munich Personal RePEc Archive Testing for Granger causality between stock prices and economic growth Pasquale Foresti 2006 Online at http://mpra.ub.uni-muenchen.de/2962/ MPRA Paper No. 2962, posted
More informationX X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
More informationHandling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
More informationPart 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217
Part 3 Comparing Groups Chapter 7 Comparing Paired Groups 189 Chapter 8 Comparing Two Independent Groups 217 Chapter 9 Comparing More Than Two Groups 257 188 Elementary Statistics Using SAS Chapter 7 Comparing
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationNon-Parametric Tests (I)
Lecture 5: Non-Parametric Tests (I) KimHuat LIM lim@stats.ox.ac.uk http://www.stats.ox.ac.uk/~lim/teaching.html Slide 1 5.1 Outline (i) Overview of Distribution-Free Tests (ii) Median Test for Two Independent
More informationTopic 8. Chi Square Tests
BE540W Chi Square Tests Page 1 of 5 Topic 8 Chi Square Tests Topics 1. Introduction to Contingency Tables. Introduction to the Contingency Table Hypothesis Test of No Association.. 3. The Chi Square Test
More informationComparing Multiple Proportions, Test of Independence and Goodness of Fit
Comparing Multiple Proportions, Test of Independence and Goodness of Fit Content Testing the Equality of Population Proportions for Three or More Populations Test of Independence Goodness of Fit Test 2
More informationIBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationCREDIT SCORING MODEL APPLICATIONS:
Örebro University Örebro University School of Business Master in Applied Statistics Thomas Laitila Sune Karlsson May, 2014 CREDIT SCORING MODEL APPLICATIONS: TESTING MULTINOMIAL TARGETS Gabriela De Rossi
More informationPart 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
More informationIntroduction to Statistics Used in Nursing Research
Introduction to Statistics Used in Nursing Research Laura P. Kimble, PhD, RN, FNP-C, FAAN Professor and Piedmont Healthcare Endowed Chair in Nursing Georgia Baptist College of Nursing Of Mercer University
More informationStudents' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)
Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared
More informationUNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
More informationCOMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.
277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies
More informationIntroduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses
Introduction to Hypothesis Testing 1 Hypothesis Testing A hypothesis test is a statistical procedure that uses sample data to evaluate a hypothesis about a population Hypothesis is stated in terms of the
More informationTesting Research and Statistical Hypotheses
Testing Research and Statistical Hypotheses Introduction In the last lab we analyzed metric artifact attributes such as thickness or width/thickness ratio. Those were continuous variables, which as you
More information10. Comparing Means Using Repeated Measures ANOVA
10. Comparing Means Using Repeated Measures ANOVA Objectives Calculate repeated measures ANOVAs Calculate effect size Conduct multiple comparisons Graphically illustrate mean differences Repeated measures
More informationTwo Correlated Proportions (McNemar Test)
Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with
More informationThe Effect of a Carve-out Advanced Access Scheduling System on No-show Rates
Practice Management Vol. 41, No. 1 51 The Effect of a Carve-out Advanced Access Scheduling System on No-show Rates Kevin J. Bennett, PhD; Elizabeth G. Baxley, MD Background and Objectives: The relationship
More informationVI. Introduction to Logistic Regression
VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models
More informationSPSS Explore procedure
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
More informationHYPOTHESIS TESTING WITH SPSS:
HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER
More informationClocking In Facebook Hours. A Statistics Project on Who Uses Facebook More Middle School or High School?
Clocking In Facebook Hours A Statistics Project on Who Uses Facebook More Middle School or High School? Mira Mehta and Joanne Chiao May 28 th, 2010 Introduction With Today s technology, adolescents no
More informationIntroduction to Statistics and Quantitative Research Methods
Introduction to Statistics and Quantitative Research Methods Purpose of Presentation To aid in the understanding of basic statistics, including terminology, common terms, and common statistical methods.
More informationRandom Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
More informationA Comparative Analysis of Speech Recognition Platforms
Communications of the IIMA Volume 9 Issue 3 Article 2 2009 A Comparative Analysis of Speech Recognition Platforms Ore A. Iona College Follow this and additional works at: http://scholarworks.lib.csusb.edu/ciima
More informationIs it statistically significant? The chi-square test
UAS Conference Series 2013/14 Is it statistically significant? The chi-square test Dr Gosia Turner Student Data Management and Analysis 14 September 2010 Page 1 Why chi-square? Tests whether two categorical
More informationIntroduction to Fixed Effects Methods
Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationMultiple Imputation for Missing Data: A Cautionary Tale
Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationMultiple logistic regression analysis of cigarette use among high school students
Multiple logistic regression analysis of cigarette use among high school students ABSTRACT Joseph Adwere-Boamah Alliant International University A binary logistic regression analysis was performed to predict
More informationVariables Control Charts
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. Variables
More informationStudy Guide for the Final Exam
Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make
More informationUsing Stata for Categorical Data Analysis
Using Stata for Categorical Data Analysis NOTE: These problems make extensive use of Nick Cox s tab_chi, which is actually a collection of routines, and Adrian Mander s ipf command. From within Stata,
More informationStepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection
Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.
More informationElementary Statistics Sample Exam #3
Elementary Statistics Sample Exam #3 Instructions. No books or telephones. Only the supplied calculators are allowed. The exam is worth 100 points. 1. A chi square goodness of fit test is considered to
More informationAn Introduction to Path Analysis. nach 3
An Introduction to Path Analysis Developed by Sewall Wright, path analysis is a method employed to determine whether or not a multivariate set of nonexperimental data fits well with a particular (a priori)
More informationLean Six Sigma Analyze Phase Introduction. TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY
TECH 50800 QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY Before we begin: Turn on the sound on your computer. There is audio to accompany this presentation. Audio will accompany most of the online
More information