Application of Predictive Analytics to Higher Degree Research Course Completion Times



Similar documents
Data Mining Applications in Higher Education

Estimating Student Retention and Degree-Completion Time: Decision Trees and Neural Networks Vis-à-Vis Regression

What is Predictive Analytics?

Predictive Modeling in Enrollment Management: New Insights and Techniques. uversity.com/research

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Start-up Companies Predictive Models Analysis. Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Gerry Hobbs, Department of Statistics, West Virginia University

Dawn Broschard, EdD Senior Research Analyst Office of Retention and Graduation Success

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

A Property & Casualty Insurance Predictive Modeling Process in SAS

Predictive Analytics in Action

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Data Mining Methods: Applications for Institutional Research

Prediction of Stock Performance Using Analytical Techniques

Predictive Modeling Techniques in Insurance

Predicting Student Performance by Using Data Mining Methods for Classification

THE PREDICTIVE MODELLING PROCESS

Promoting Student Retention Through Classroom Practice * Vincent Tinto Syracuse University USA

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Chapter 12 Discovering New Knowledge Data Mining

Weight of Evidence Module

Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D.

Data Mining Techniques Chapter 6: Decision Trees

Newcastle University. Educational Partnerships. Framework for Joint and Dual PhDs

Community College Transfer Students Persistence at University

Graduate Student Perceptions of the Use of Online Course Tools to Support Engagement

USING PREDICTIVE ANALYTICS TO UNDERSTAND HOUSING ENROLLMENTS

T Non-discriminatory Machine Learning

CATHOLIC UNIVERSITY OF HEALTH AND ALLIED SCIENCES - GUIDELINES FOR JOINT PHD PROGRAMMES

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

In the past two decades, the federal government has dramatically

Application for Admission to a Higher Degree by Research International

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining is the process of knowledge discovery involving finding

Learning Analytics: Targeting Instruction, Curricula and Student Support

PhD and Research Master Degree Scholarship Guidelines

Evaluation in Online STEM Courses

EARLY VS. LATE ENROLLERS: DOES ENROLLMENT PROCRASTINATION AFFECT ACADEMIC SUCCESS?

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Easily Identify Your Best Customers

DATA MINING TECHNIQUES AND APPLICATIONS

Persistence in University Continuing Education Online Classes

School of Psychology and Counselling PhD Scholarship 2014

Attrition in Online and Campus Degree Programs

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Using Predictive Analytics to Improve the Bottom Line *****

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

Predictive Modeling of Titanic Survivors: a Learning Competition

How To Get A Degree From Une

Predictive modelling around the world

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Benchmarking of different classes of models used for credit scoring

IBM SPSS Direct Marketing 22

TNS EX A MINE BehaviourForecast Predictive Analytics for CRM. TNS Infratest Applied Marketing Science

Master of Clinical Psychology (Program coursework) Doctor of Philosophy (Clinical Psychology) (Program 9064 research)

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Binary Logistic Regression

Data Mining Practical Machine Learning Tools and Techniques

PharmaSUG2011 Paper HS03

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Microsoft Azure Machine learning Algorithms

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Sutee Sujitparapitaya, Ph.D. Institutional Effectiveness and Analytics San José State University

Alex Vidras, David Tysinger. Merkle Inc.

CHAID Decision Tree: Reverse Mortgage Loan Termination Example

Strategies for Promoting Gatekeeper Course Success Among Students Needing Remediation: Research Report for the Virginia Community College System

WILLIS A. JONES. Office: (859)

Transcription:

Application of Predictive Analytics to Higher Degree Research Course Completion Times Application of Decision Theory to PhD Course Completions (2006 2013) Rachna 1 I Dhand, Senior Strategic Information COMPANY Analyst NAME

Application of Predictive Analytics to Higher Degree Research Course Completion Times Higher Degree Research (HDR) students carry significant costs for Universities. Failure of students to complete either on time or at all results in sub optimal resource utilisation and impacts to government grant allocations and ratings. Objective The objective of this project was to analyse the historical completion time for ECU PhD candidates and identify the primary determinants for the same, so that some intervention strategies can be implemented for future students for their timely completion. Methodology Classification Decision Science Models are used to predict the Completion Times for HDR candidates. Some shortlisted Models are CHAID, QUEST and C5.0. 2 I

What is Predictive Analytics? 3 I

Data Mining Introduction, Modelling and Prediction Accuracy Use of Data Mining strategies to identify who is at-risk of drop-out or who is likely to take longer time to finish his or her degree is not a new subject in Institutional Research (IR). Explanatory models by regression and path analysis have contributed substantially to our understanding of student retention (Adam and Gaither, 2005; Pascarella and Terenzini, 2005; Braxton, 2000). Though published studies on the use and prediction accuracy of data-mining approaches in IR are few. Luan (2002) explained the application of neural-net and decision tree analysis in predicting the transfer of college students to four-year institutions. Byres Gonzalez and Desjardins (2002) showed neural-network model predicts with better accuracy over Binary Logistic Regression. Prediction Accuracy does not solely depends on the type of model chosen for predictions but is also dependant on the independent variables chosen, their measurement levels and data size. 4 I

Predicting Completion Status for Higher Degree Research Candidates Is it possible to predict the Completion Status of the HDR student given the 1) Variable set of Demographic and Course information 2) Research Experience and Nature of Research Project 3) Faculty and School information 4) Supervisor information... 5 I

HDR Completion Status Prediction Analysis and Modeling Approach Analysis Modelling 1 2 Objective The main objective of the model is to predict the Completion Time for future applicants where Completion Time is defined as the period spent by the candidate since commencement of the HDR degree till the completion of the degree. Analysis The analysis takes into account the PhD students from 2006 to 2013 with their research performance and demographic information from all faculties of ECU as the history data set. The completion time is estimated. 3 Information Value Analysis (IVA) The research performance and demographic variables are pooled through IVA to filter the most correlated variables with Completion Time and finalized for Modeling dataset. 4 Methodology The dataset and estimated Completion time are modeled using Decision based Predictive Modeling. The tested models are C5.0, CHAID and QUEST, following classification based paradigm for modeling and scoring. 6 I

HDR Completion Status Prediction Historical data analysis filters PhD candidacy outcomes for ECU with reference years from 2006 to 2012. 1 Information Value Analysis screens out the primary determinants correlated with Completion Time that is to be targeted with Model Building. 2 Completion Time Prediction Modelling 3 Target definition is based on the Historical outcome and Model building is initiated using classification Models CHAID, C5.0 and Quest. 4 The Model with best result and most accurate emulation of the actual target is chosen to score the future candidates 7 I

Predicting Completion Status for Higher Degree Research Candidates Historical Analysis and Understanding Data used for Prediction... 8 I

HDR Cohort Analysis Count by Citizenship (2006 2012) Note: The small cohort size for HDR Enrolments poses constraints to Modelling process thereby making classification models more suitable for building and training process.

HDR Candidacy Time Distribution Domestic Enrolled Candidates (2006 2013) Candidacy Time is the time spent by the student since commencement of the PhD degree till he or she reaches the final state of completion or discontinuity of the degree or stay enrolled for longer duration.

HDR Course Attempt Distribution by Final Outcome Domestic Vs International

HDR Completion Status Estimation Scope of Modeling HDR Research Report Data (Set of 45 variables related to student Research Experience and Candidature Progress) Student Course Details (Set of 215 variables) Research Data (Milestone status, ABS Research Classifications and Scholarship Status data) Domestic PhD + International PhD Candidates Only Target Definition for Modelling (Completion Outcome) Discard Inactive and Intermittent Status Modelling Dataset Student Demographic and Course Information Training Dataset (90%) (2006 2012) Testing & Scoring Dataset (10%) (2013) C5.0, CHAID or QUEST Decision Tree Model

HDR Completion Status Estimation Target Definition Actual Candidacy Status Completion Time (Calculated in Years) Target: T_ATTEMPT_STATUS date_years_difference ( D_COURSE_COMMENCEM ENT_DT, D_COURSE_COMPLETION_ DT) T_COURSE_ATTEMPT_STAT US matches "ENROL*" then 2013 - D_INTAKE_PERIOD 1. WILL_COMPLETE ( Candidacy <= 4 Years) 2. WILL_COMPLETE_LATE (Candidacy > 4 Years) 3. STILL_ENROLLED (Candidacy > 3.5 Years) 4. WLL_DISCONTINUE (Attrition Flags set for all Teaching Periods) date_years_difference (D_COURSE_COMMENCE MENT_DT,T_COURSE_DISC ONTINUED_DT 5. IMMATURE VINTAGE (Candidacy < 3.5 Years) [Discarded]

Predicting Completion Status for Higher Degree Research Candidates Using Classification Models for Prediction... 14 I

Decision Tree Models Due to non-linear relationships of indicator with Target and having a nominal target outcome, Decision Tree Models were selected for predicting the Completion Time outcomes for currently enrolled domestic students as well as International students. The outcome from the Model is: 1)Target Prediction (STILL_ENROLLED, WILL_COMPLETE,WILL_DISCONTINUE, WILL_COMPLETE_LATE). 2)Confidence Score for each enrolled student (ranges between 0 and 1). Rule Induction is basically categorised into: C5.0, Chi-Square Automatic Interaction Detection (CHAID), QUEST and Classification and Regression (C&R )Tree. C5.0 Model handles Nominal or Flag targets with All Predictor categories (nominal, Continuous, or Flag). 15 I

Decision Tree Models Model Criteria C5.0 CHAID QUEST Type of Split for Categorical Targets Multiple Multiple Binary Continuous Target No Yes No Continuous Predictors Yes No Yes Criteria for Predictor Selection Information Measure Chi-Square F-Test for Continuous Statistical Supports Bagging/Boosting Yes Yes Yes 16 I

Predictor Importance F-Test Association with Target CHAID Milestones Achieved School Name Load Completed Field of Education Course Fraction Completed Meeting Frequency Basis of Admission Research Literature Funding Category Changed Predicts Annual the Leaves target Availed with Best Accuracy. CHAID performs Chi-Square tests for Predictor Importance and Variable Reduction. The test preferably gives higher importance to continuous variables rather than nominal or categorical. 17 I

Predictor Importance Information Value Analysis C5.0 Course Fraction Completed Literature Review Feedback Field of Education Milestone Achieved Mode of Attendance Age at Enrolment NESB Indicator C5.0 performs Information Value (IV) and Weight of Evidence (WoE) Method for Variable Reduction. While WoE analyzes the predictive power of a variable in relation to the targeted outcome. IV assesses the overall predictive power of the variable being considered. 18 I

Completion Status Modeling Decision Tree Models Used C5.0 Decision Tree Model Completion Status Estimation CHAID QUEST

Actual Candidature Status C 5.0 Decision Tree Model Target Following Predicted Candidature Status STILL_ENROLLED 18.0% 101 STILL_ENROLLED 33.51% 188 WILL_COMPLETE 14.08% 79 WILL_COMPLETE 8.91% 50 WILL_COMPLETE_LATE 14.8% 83 WILL_COMPLETE_LATE 4.46% 25 WILL_DISCONTINUE 53.12% 298 WILL_DISCONTINUE 53.12% 298 20 I

C 5.0 Decision Tree Model Model Evaluation & Analysis Results for output field T_ATTEMPT_STATUS Comparing $C-T_ATTEMPT_STATUS with T_ATTEMPT_STATUS Correct 409 72.91% Wrong 152 27.09% Total 561 Performance Evaluation STILL_ENROLLED 1.0 WILL_COMPLETE 1.545 WILL_COMPLETE_LATE 1.636 WILL_DISCONTINUE 0.515 Confidence Values Report for $CC-T_ATTEMPT_STATUS Range 0.35-0.906 Mean Correct 0.728 Mean Incorrect 0.553 Always Correct Above 0.906 (0% of cases) Always Incorrect Below 0.35 (0% of cases) 85.56% Accuracy Above 0.478 2.0 Fold Correct Above 0.86 (53.57% of cases) 21 I

Actual Candidature Status CHAID Decision Tree Model Target Following Predicted Candidature Status STILL_ENROLLED 18.0% 101 STILL_ENROLLED 11.51% 65 WILL_COMPLETE 14.08% 79 WILL_COMPLETE 14.26% 80 WILL_COMPLETE_LATE 14.8% 83 WILL_COMPLETE_LATE 13.9% 78 WILL_DISCONTINUE 53.12% 298 WILL_DISCONTINUE 60.25% 338 22 I

CHAID Decision Tree Model Model Evaluation & Analysis Results for output field T_ATTEMPT_STATUS Comparing $R-T_ATTEMPT_STATUS with T_ATTEMPT_STATUS Correct 412 73.44% Wrong 149 26.56% Total 561 Performance Evaluation STILL_ENROLLED 1.566 WILL_COMPLETE 1.242 WILL_COMPLETE_LATE 1.192 WILL_DISCONTINUE 0.441 Confidence Values Report for $RC-T_ATTEMPT_STATUS Range 0.3-0.978 Mean Correct 0.775 Mean Incorrect 0.459 Always Correct Above 0.978 (0% of cases) Always Incorrect Below 0.3 (0% of cases) 76.05% Accuracy Above 0.379 2.0 Fold Correct Above 0.875 (42.4% of cases) 23 I

Actual Candidature Status QUEST Decision Tree Model Target Following Predicted Candidature Status STILL_ENROLLED 18.0% 101 STILL_ENROLLED 25.18% 141 WILL_COMPLETE 14.08% 79 WILL_COMPLETE 12.32% 69 WILL_COMPLETE_LATE 14.8% 83 WILL_COMPLETE_LATE 2.5% 14 WILL_DISCONTINUE 53.12% 298 WILL_DISCONTINUE 60.0% 336 24 I

QUEST Decision Tree Model Model Evaluation & Analysis Results for output field T_ATTEMPT_STATUS Comparing $R-T_ATTEMPT_STATUS with T_ATTEMPT_STATUS Correct 374 66.79% Wrong 186 33.21% Total 560 Performance Evaluation STILL_ENROLLED 1.082 WILL_COMPLETE 0.902 WILL_COMPLETE_LATE 1.349 WILL_DISCONTINUE 0.404 Confidence Values Report for $RC-T_ATTEMPT_STATUS Range 0.321-0.808 Mean Correct 0.687 Mean Incorrect 0.542 Always Correct Above 0.808 (0% of cases) Always Incorrect Below 0.321 (0% of cases) 77% Accuracy Above 0.482 25 I

Predicting Completion Status for Higher Degree Research Candidates Validating Prediction Accuracy... 26 I

Model Comparison Confidence Level Distributions CHAID QUEST C5.0 Predicts the target with Best Accuracy. Predicts the Target with weak accuracy of the three models used. Predicts the target with average accuracy Strong Weak Average 27 I

Completion Status Estimation Prediction Accuracy by Target Values Field STILL_ENROLLED* WILL_COMPLETE* WILL_COMPLETE_LATE* WILL_DISCONTINUE* Importance $RC-T_ATTEMPT_STATUS 0.666 0.472 0.460 0.822 1.000 Important Field STILL_ENROLLED* WILL_COMPLETE* WILL_COMPLETE_LATE* WILL_DISCONTINUE* Importance $CC-T_ATTEMPT_STATUS 0.492 0.584 0.543 0.809 1.000 Important Field STILL_ENROLLED* WILL_COMPLETE* WILL_COMPLETE_LATE* WILL_DISCONTINUE* Importance $RC-T_ATTEMPT_STATUS 0.521 0.527 0.524 0.740 1.000 Important QUEST C5.0 CHAID 28 I

Completion Status Modeling Conclusion HDR DQ standards need to be raised. Data has good predictor strength. But it should be consistently populated over the span time used for prediction. This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. The model has good prediction accuracy, though ECU s HDR Cohort is very small (700 students Approx). This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. The limitation with the modeling process was that only classification models can be used because of the limited size of the cohort. Neural Net and Logistic Regression modeling cannot be applied. This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. The next phase will be to design the Reporting Standards and Intervention Strategies, so that the modeling outcome can be used effectively to reduce the completion time for future students. This is an example text. Example text. Go ahead and replace it. This is an example text. Example text. QUEST C5.0 CHAID 29 I

References 1. Adam, J., and Gaither, G. H. Retention in Higher Education: A Selective Resource Guide. In G.H. Gaither (ed.), Minority Retention: What Works? New Directions for Institutional Research, no. 125. San Francisco: Jossey-Bass, 2005. 2. Pascarella, E., and Terenzini, P. How College Affects Students. San Francisco: Jossey- Bass, 2005. 3. Braxton, J. Reworking the Student Departure Puzzle. Nashville, Tenn.: Vanderbilt University Press, 2000. 4. Luan, J. Data Mining and its Applications in Higher Education. In A. M. Serban and J. Luan (eds.), Knowledge Management: Building a Competitive Advantage in Higher Education. New Directions for Institutional Research, no. 113. San Francisco: Jossey- Bass, 2002. 5. Byers Gonzalez, J., and DesJardins, S. Artificial Neural Networks: A New Approach for Predicting Application Behaviour. Research in Higher Education, 2002, 43 (2), 235-258. 30 I

Questions 31 I