Application of Predictive Analytics to Higher Degree Research Course Completion Times Application of Decision Theory to PhD Course Completions (2006 2013) Rachna 1 I Dhand, Senior Strategic Information COMPANY Analyst NAME
Application of Predictive Analytics to Higher Degree Research Course Completion Times Higher Degree Research (HDR) students carry significant costs for Universities. Failure of students to complete either on time or at all results in sub optimal resource utilisation and impacts to government grant allocations and ratings. Objective The objective of this project was to analyse the historical completion time for ECU PhD candidates and identify the primary determinants for the same, so that some intervention strategies can be implemented for future students for their timely completion. Methodology Classification Decision Science Models are used to predict the Completion Times for HDR candidates. Some shortlisted Models are CHAID, QUEST and C5.0. 2 I
What is Predictive Analytics? 3 I
Data Mining Introduction, Modelling and Prediction Accuracy Use of Data Mining strategies to identify who is at-risk of drop-out or who is likely to take longer time to finish his or her degree is not a new subject in Institutional Research (IR). Explanatory models by regression and path analysis have contributed substantially to our understanding of student retention (Adam and Gaither, 2005; Pascarella and Terenzini, 2005; Braxton, 2000). Though published studies on the use and prediction accuracy of data-mining approaches in IR are few. Luan (2002) explained the application of neural-net and decision tree analysis in predicting the transfer of college students to four-year institutions. Byres Gonzalez and Desjardins (2002) showed neural-network model predicts with better accuracy over Binary Logistic Regression. Prediction Accuracy does not solely depends on the type of model chosen for predictions but is also dependant on the independent variables chosen, their measurement levels and data size. 4 I
Predicting Completion Status for Higher Degree Research Candidates Is it possible to predict the Completion Status of the HDR student given the 1) Variable set of Demographic and Course information 2) Research Experience and Nature of Research Project 3) Faculty and School information 4) Supervisor information... 5 I
HDR Completion Status Prediction Analysis and Modeling Approach Analysis Modelling 1 2 Objective The main objective of the model is to predict the Completion Time for future applicants where Completion Time is defined as the period spent by the candidate since commencement of the HDR degree till the completion of the degree. Analysis The analysis takes into account the PhD students from 2006 to 2013 with their research performance and demographic information from all faculties of ECU as the history data set. The completion time is estimated. 3 Information Value Analysis (IVA) The research performance and demographic variables are pooled through IVA to filter the most correlated variables with Completion Time and finalized for Modeling dataset. 4 Methodology The dataset and estimated Completion time are modeled using Decision based Predictive Modeling. The tested models are C5.0, CHAID and QUEST, following classification based paradigm for modeling and scoring. 6 I
HDR Completion Status Prediction Historical data analysis filters PhD candidacy outcomes for ECU with reference years from 2006 to 2012. 1 Information Value Analysis screens out the primary determinants correlated with Completion Time that is to be targeted with Model Building. 2 Completion Time Prediction Modelling 3 Target definition is based on the Historical outcome and Model building is initiated using classification Models CHAID, C5.0 and Quest. 4 The Model with best result and most accurate emulation of the actual target is chosen to score the future candidates 7 I
Predicting Completion Status for Higher Degree Research Candidates Historical Analysis and Understanding Data used for Prediction... 8 I
HDR Cohort Analysis Count by Citizenship (2006 2012) Note: The small cohort size for HDR Enrolments poses constraints to Modelling process thereby making classification models more suitable for building and training process.
HDR Candidacy Time Distribution Domestic Enrolled Candidates (2006 2013) Candidacy Time is the time spent by the student since commencement of the PhD degree till he or she reaches the final state of completion or discontinuity of the degree or stay enrolled for longer duration.
HDR Course Attempt Distribution by Final Outcome Domestic Vs International
HDR Completion Status Estimation Scope of Modeling HDR Research Report Data (Set of 45 variables related to student Research Experience and Candidature Progress) Student Course Details (Set of 215 variables) Research Data (Milestone status, ABS Research Classifications and Scholarship Status data) Domestic PhD + International PhD Candidates Only Target Definition for Modelling (Completion Outcome) Discard Inactive and Intermittent Status Modelling Dataset Student Demographic and Course Information Training Dataset (90%) (2006 2012) Testing & Scoring Dataset (10%) (2013) C5.0, CHAID or QUEST Decision Tree Model
HDR Completion Status Estimation Target Definition Actual Candidacy Status Completion Time (Calculated in Years) Target: T_ATTEMPT_STATUS date_years_difference ( D_COURSE_COMMENCEM ENT_DT, D_COURSE_COMPLETION_ DT) T_COURSE_ATTEMPT_STAT US matches "ENROL*" then 2013 - D_INTAKE_PERIOD 1. WILL_COMPLETE ( Candidacy <= 4 Years) 2. WILL_COMPLETE_LATE (Candidacy > 4 Years) 3. STILL_ENROLLED (Candidacy > 3.5 Years) 4. WLL_DISCONTINUE (Attrition Flags set for all Teaching Periods) date_years_difference (D_COURSE_COMMENCE MENT_DT,T_COURSE_DISC ONTINUED_DT 5. IMMATURE VINTAGE (Candidacy < 3.5 Years) [Discarded]
Predicting Completion Status for Higher Degree Research Candidates Using Classification Models for Prediction... 14 I
Decision Tree Models Due to non-linear relationships of indicator with Target and having a nominal target outcome, Decision Tree Models were selected for predicting the Completion Time outcomes for currently enrolled domestic students as well as International students. The outcome from the Model is: 1)Target Prediction (STILL_ENROLLED, WILL_COMPLETE,WILL_DISCONTINUE, WILL_COMPLETE_LATE). 2)Confidence Score for each enrolled student (ranges between 0 and 1). Rule Induction is basically categorised into: C5.0, Chi-Square Automatic Interaction Detection (CHAID), QUEST and Classification and Regression (C&R )Tree. C5.0 Model handles Nominal or Flag targets with All Predictor categories (nominal, Continuous, or Flag). 15 I
Decision Tree Models Model Criteria C5.0 CHAID QUEST Type of Split for Categorical Targets Multiple Multiple Binary Continuous Target No Yes No Continuous Predictors Yes No Yes Criteria for Predictor Selection Information Measure Chi-Square F-Test for Continuous Statistical Supports Bagging/Boosting Yes Yes Yes 16 I
Predictor Importance F-Test Association with Target CHAID Milestones Achieved School Name Load Completed Field of Education Course Fraction Completed Meeting Frequency Basis of Admission Research Literature Funding Category Changed Predicts Annual the Leaves target Availed with Best Accuracy. CHAID performs Chi-Square tests for Predictor Importance and Variable Reduction. The test preferably gives higher importance to continuous variables rather than nominal or categorical. 17 I
Predictor Importance Information Value Analysis C5.0 Course Fraction Completed Literature Review Feedback Field of Education Milestone Achieved Mode of Attendance Age at Enrolment NESB Indicator C5.0 performs Information Value (IV) and Weight of Evidence (WoE) Method for Variable Reduction. While WoE analyzes the predictive power of a variable in relation to the targeted outcome. IV assesses the overall predictive power of the variable being considered. 18 I
Completion Status Modeling Decision Tree Models Used C5.0 Decision Tree Model Completion Status Estimation CHAID QUEST
Actual Candidature Status C 5.0 Decision Tree Model Target Following Predicted Candidature Status STILL_ENROLLED 18.0% 101 STILL_ENROLLED 33.51% 188 WILL_COMPLETE 14.08% 79 WILL_COMPLETE 8.91% 50 WILL_COMPLETE_LATE 14.8% 83 WILL_COMPLETE_LATE 4.46% 25 WILL_DISCONTINUE 53.12% 298 WILL_DISCONTINUE 53.12% 298 20 I
C 5.0 Decision Tree Model Model Evaluation & Analysis Results for output field T_ATTEMPT_STATUS Comparing $C-T_ATTEMPT_STATUS with T_ATTEMPT_STATUS Correct 409 72.91% Wrong 152 27.09% Total 561 Performance Evaluation STILL_ENROLLED 1.0 WILL_COMPLETE 1.545 WILL_COMPLETE_LATE 1.636 WILL_DISCONTINUE 0.515 Confidence Values Report for $CC-T_ATTEMPT_STATUS Range 0.35-0.906 Mean Correct 0.728 Mean Incorrect 0.553 Always Correct Above 0.906 (0% of cases) Always Incorrect Below 0.35 (0% of cases) 85.56% Accuracy Above 0.478 2.0 Fold Correct Above 0.86 (53.57% of cases) 21 I
Actual Candidature Status CHAID Decision Tree Model Target Following Predicted Candidature Status STILL_ENROLLED 18.0% 101 STILL_ENROLLED 11.51% 65 WILL_COMPLETE 14.08% 79 WILL_COMPLETE 14.26% 80 WILL_COMPLETE_LATE 14.8% 83 WILL_COMPLETE_LATE 13.9% 78 WILL_DISCONTINUE 53.12% 298 WILL_DISCONTINUE 60.25% 338 22 I
CHAID Decision Tree Model Model Evaluation & Analysis Results for output field T_ATTEMPT_STATUS Comparing $R-T_ATTEMPT_STATUS with T_ATTEMPT_STATUS Correct 412 73.44% Wrong 149 26.56% Total 561 Performance Evaluation STILL_ENROLLED 1.566 WILL_COMPLETE 1.242 WILL_COMPLETE_LATE 1.192 WILL_DISCONTINUE 0.441 Confidence Values Report for $RC-T_ATTEMPT_STATUS Range 0.3-0.978 Mean Correct 0.775 Mean Incorrect 0.459 Always Correct Above 0.978 (0% of cases) Always Incorrect Below 0.3 (0% of cases) 76.05% Accuracy Above 0.379 2.0 Fold Correct Above 0.875 (42.4% of cases) 23 I
Actual Candidature Status QUEST Decision Tree Model Target Following Predicted Candidature Status STILL_ENROLLED 18.0% 101 STILL_ENROLLED 25.18% 141 WILL_COMPLETE 14.08% 79 WILL_COMPLETE 12.32% 69 WILL_COMPLETE_LATE 14.8% 83 WILL_COMPLETE_LATE 2.5% 14 WILL_DISCONTINUE 53.12% 298 WILL_DISCONTINUE 60.0% 336 24 I
QUEST Decision Tree Model Model Evaluation & Analysis Results for output field T_ATTEMPT_STATUS Comparing $R-T_ATTEMPT_STATUS with T_ATTEMPT_STATUS Correct 374 66.79% Wrong 186 33.21% Total 560 Performance Evaluation STILL_ENROLLED 1.082 WILL_COMPLETE 0.902 WILL_COMPLETE_LATE 1.349 WILL_DISCONTINUE 0.404 Confidence Values Report for $RC-T_ATTEMPT_STATUS Range 0.321-0.808 Mean Correct 0.687 Mean Incorrect 0.542 Always Correct Above 0.808 (0% of cases) Always Incorrect Below 0.321 (0% of cases) 77% Accuracy Above 0.482 25 I
Predicting Completion Status for Higher Degree Research Candidates Validating Prediction Accuracy... 26 I
Model Comparison Confidence Level Distributions CHAID QUEST C5.0 Predicts the target with Best Accuracy. Predicts the Target with weak accuracy of the three models used. Predicts the target with average accuracy Strong Weak Average 27 I
Completion Status Estimation Prediction Accuracy by Target Values Field STILL_ENROLLED* WILL_COMPLETE* WILL_COMPLETE_LATE* WILL_DISCONTINUE* Importance $RC-T_ATTEMPT_STATUS 0.666 0.472 0.460 0.822 1.000 Important Field STILL_ENROLLED* WILL_COMPLETE* WILL_COMPLETE_LATE* WILL_DISCONTINUE* Importance $CC-T_ATTEMPT_STATUS 0.492 0.584 0.543 0.809 1.000 Important Field STILL_ENROLLED* WILL_COMPLETE* WILL_COMPLETE_LATE* WILL_DISCONTINUE* Importance $RC-T_ATTEMPT_STATUS 0.521 0.527 0.524 0.740 1.000 Important QUEST C5.0 CHAID 28 I
Completion Status Modeling Conclusion HDR DQ standards need to be raised. Data has good predictor strength. But it should be consistently populated over the span time used for prediction. This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. The model has good prediction accuracy, though ECU s HDR Cohort is very small (700 students Approx). This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. The limitation with the modeling process was that only classification models can be used because of the limited size of the cohort. Neural Net and Logistic Regression modeling cannot be applied. This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. The next phase will be to design the Reporting Standards and Intervention Strategies, so that the modeling outcome can be used effectively to reduce the completion time for future students. This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. QUEST C5.0 CHAID 29 I
References 1. Adam, J., and Gaither, G. H. Retention in Higher Education: A Selective Resource Guide. In G.H. Gaither (ed.), Minority Retention: What Works? New Directions for Institutional Research, no. 125. San Francisco: Jossey-Bass, 2005. 2. Pascarella, E., and Terenzini, P. How College Affects Students. San Francisco: Jossey- Bass, 2005. 3. Braxton, J. Reworking the Student Departure Puzzle. Nashville, Tenn.: Vanderbilt University Press, 2000. 4. Luan, J. Data Mining and its Applications in Higher Education. In A. M. Serban and J. Luan (eds.), Knowledge Management: Building a Competitive Advantage in Higher Education. New Directions for Institutional Research, no. 113. San Francisco: Jossey- Bass, 2002. 5. Byers Gonzalez, J., and DesJardins, S. Artificial Neural Networks: A New Approach for Predicting Application Behaviour. Research in Higher Education, 2002, 43 (2), 235-258. 30 I
Questions 31 I