What is Predictive Analytics?



Similar documents
Using Predictive Analytics to Improve the Bottom Line *****

What High School Curricular Experience Tells Us About College Success *****

Co-Curricular Activities and Academic Performance -A Study of the Student Leadership Initiative Programs. Office of Institutional Research

Dawn Broschard, EdD Senior Research Analyst Office of Retention and Graduation Success

Application of Predictive Analytics to Higher Degree Research Course Completion Times

Credit Risk Analysis Using Logistic Regression Modeling

Redefining At-Risk Through Predictive Analytics: A Targeted Approach to Enhancing Student Success

The Influence of a Summer Bridge Program on College Adjustment and Success: The Importance of Early Intervention and Creating a Sense of Community

The Role of Perceptions of Remediation on the Persistence of Developmental Students in Higher Education

Student Success Courses and Educational Outcomes at Virginia Community Colleges

A Comparison of the College Outcomes of AP and Dual Enrollment Students In Progress. Jeff Wyatt, Brian Patterson, and F.

Multivariate Models of Student Success

This presentation was made at the California Association for Institutional Research Conference on November 19, 2010.

Community College Transfer Students Persistence at University

Understanding the leaky STEM Pipeline by taking a close look at factors influencing STEM Retention and Graduation Rates

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Excess Units in Pursuit of the Bachelor s Degree

Binary Logistic Regression

EARLY VS. LATE ENROLLERS: DOES ENROLLMENT PROCRASTINATION AFFECT ACADEMIC SUCCESS?

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

EARNING COLLEGE CREDIT WHILE IN HIGH SCHOOL

The Impact of Living Learning Community Participation on 1 st -Year Students GPA, Retention, and Engagement

Adrian A. Schiess Director for Student Success and Retention Xavier University, Cincinnati, Ohio

Attrition in Online and Campus Degree Programs

Evaluation of College Possible Postsecondary Outcomes,

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

WHEN LEARNING ANALYTICS MEET BIG DATA. The Predictive Analytics Reporting (PAR) Framework November 8, 2012

Data Mining Applications in Higher Education

Predicting Defaults of Loans using Lending Club s Loan Data

Running Head: Promoting Student Success: Evaluation of a Freshman Orientation Course

Optimizing Enrollment Management with Predictive Modeling

How To Know How Successful A First Generation Student Is

The Nevada Millennium Scholarship ***** Impact on Student Access and Success at the State s Land-Grant University. Part I

Predictive Modeling in Enrollment Management: New Insights and Techniques. uversity.com/research

Enhancing Educational Attainment in Arizona

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

15 to Finish. The Benefits to Students and Institutions. Leading the Way: Access, Success, Impact

February 2003 Report No

INIGRAL INSIGHTS. The Social Side of Student Retention. Rose Broome Brandon Croke Michael Staton Hannah Zachritz

How to Get More Value from Your Survey Data

U.S. News & World Report 2015 Best Colleges Rankings Summary Report

In the past two decades, the federal government has dramatically

NORTH CENTRAL MICHIGAN COLLEGE PARTNERSHIP

PREDICTIVE ANALYTICS IN HIGHER EDUCATION SEPTEMBER 23, 2014

The Bottom Line on Student Retention: Data-Driven Approaches that Work

Session S2H. Retention in Engineering and Where Students Go When They Leave Engineering RESULTS AND DISCUSSION

PREDICTING STUDENT RETENTION & SUCCESS IN ONLINE PROGRAMS. William Bloemer & Karen Swan UNIVERSITY OF ILLINOIS SPRINGFIELD

Changing Distributions: How online college classes alter student and professor performance

What Makes a Quality Online Course? The Student Perspective. Penny Ralston-Berg, M.S. Senior Instructional Designer University of Wisconsin-Extension

Everglades University - Boca Raton Boca Raton, FL

The State University System. Presented by the WHS Counseling Department

Learning Analytics: Targeting Instruction, Curricula and Student Support

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Virginia University of Lynchburg Lynchburg, VA

Discussion of State Appropriation to Eastern Michigan University August 2013

Examination of Four-Year Baccalaureate Completion Rates at Purdue University. April Enrollment Management Analysis and Reporting

Summary of Study and Findings J. Lisa Stewart, Ph.D.

In Search of Peer Institutions for UC Merced

Actionable Intelligence: Big Data for Student Success. Brian A. Haugabrook, Valdosta State University Barrie D. Fitzgerald, Valdosta State University

Facilitating Student Success for Entering California Community College Students: How One Institution Can Make an Impact

The Impact of Pell Grants on Academic Outcomes for Low-Income California Community College Students

Predictive Analytics Applied: Marketing and Web

Climate Change: Transforming a Summer Bridge Program with Student Support

Transcription:

What is Predictive Analytics? Firstly, Analytics is the use of data, statistical analysis, and explanatory and predictive models to gain insights and act on complex issues. EDUCAUSE Center for Applied Research Predictive Analytics is simply predictive modeling, or creating a statistical model that best predicts the probability of an outcome. Examples include Enrollment forecasting Classifying/scoring at risk students Predicting Gainful Employment 1

Predictive Analytics...helps answer questions like: Which student variables are most useful for predicting outcomes like retention/degree completion/gainful employment? What is the best combination of variables to optimize predictions in the sample? How useful is this combination for identifying at risk students? Examples of Higher Ed Institutions Using Predictive Analytics University of Nevada at Reno Florida State University Missouri State University San Jose State University Virginia Tech University Arizona State University University of Texas at Austin California State University System.. 2

UH Manoa Example Build and implement a predictive model for estimating freshmen retention outcomes. Campus context: Approximately 2,000 first time freshmen entering every fall at Manoa. 79% retention rate (fall to fall) versus 79% predicted retention rate versus 82% peer average 4 Steps to Modeling Retention 1. Get Freshmen Data. (i.e. Iused fall 2009, 2010, 2011 data to build a training data set.) 2. Build Model. 3. Apply model s parameter estimates to new data. (i.e. model validation, scoring) 4. Check the actual 2012 retention outcomes to see how well the model performed. 3

Relevant Previous Research Astin, A. W. (1993). What matters in college? Four critical years revisited. San Francisco: Jossey Bass. Bean, J. P. (1985). Interaction effects based on class level in an explanatory model of college student dropout syndrome. American Educational Research Journal, 22(1), 35 64. Caison, A. L. (2006). Analysis of institutionally specific retention research: A comparison between survey and institutional database methods. Research in Higher Education, 48(4), 435 451. Herzog, S. (2006). Estimating student retention and degree completion time. Decision trees and neural networks vis à vis regression. New Directions for Institutional Research, 131, 17 33. Olson, D.L. (White Paper). Data set balancing. University of Nebraska, Department of Management. Available at: http://cbafiles.unl.edu/public/cbainternal/facstaffuploads/chinaolson.pdf Pascarella, E., and Terenzini, P. (2005). How College Affects Student: Volume 2, A Third Decade of Research. San Francisco: Jossey Bass. Sujitparapitaya, S. (2006). Considering student mobility in retention outcomes. New Directions for Institutional Research, 131, 35 51. Tinto, V. (1975). Dropout from higher education: A theoretical synthesis of recent research. Review of Educational Research, 45(1), 89 125. Methodology Binary Logistic Regression model a student s binary choice to return a second fall semester while controlling for a student s characteristics. Types of covariates included in model: Demographic Pre Collegiate Experience Academic Experience Campus Experience Financial Aid Achievement Milestones Interaction Variables 4

Examples of Student Variables Analyzed DEMOGRAPHICS Gender Age Ethnicity Residency Geographic Origin On Campus Employment Housing Student Life Activities Athletics STAR Usage Average Class Size CAMPUS EXPERIENCE PRE- COLLEGE ACADEMIC High School GPA & Rank SAT AP CLEP Educational Goals Transfer GPA # Transfer Credits Major Credit Load Credits Earned First Term GPA Distance Education Dual Enrollment High Failure Rate Courses Courses Taken (including Math & English) RETENTION / GRADUATION Ethnicity by Geographic Origin Employment by Housing High School GPA by First Term GPA Residency by Need Based Aid Ratio of Successful Adds to Drops Need Based Aid Non-need Based Aid Pell Grant Work Study % of Aid Met Credits earned Credits attempted Credit Completion Ratio Math/English Enrollment/Completion Continuous Enrollment Milestone metrics INTERACTIONS FINANCIAL NEED MILE- STONES Data Description Data sources Matriculation system (Banner/ Banner ODS) Student cohorts New full-time freshmen Fall entry 09, 10, 11 for model dev. (training set, N=2,470) Fall entry 2012 for model validation (holdout set, N=1,912) Data elements at start of first semester Student demographics (residency) Academic preparation (high school GPA/test score index) Financial aid profile (unmet need) Credits enrolled, campus housing (y/n), first year experience class (y/n), educational goals questionnaire (y/n) Data elements after start of first semester Credits completed End-of-semester GPA 2 nd educational goals questionnaire 5

Data Management Tasks Exploratory data analysis Variable selection (bivariate regression on outcome variable) Variable coding (continuous/categorical/dummy in logit model) Missing data imputation, constant-$ conversion (fin. aid data) Composite variable(s) not used in today s model Acad prep index = (HSGPA*12.5)+(ACTM*.69)+(ACTE*.69) Variables excluded: age, gender, ethnicty, college remediation, ACT/SAT test date Logistic regression model Maximize model fit (-2LL test/score, pseudo R 2, HL sig.) Create balanced sample in training dataset to optimize correct classification rate (CCR) for enrollees vs. nonenrollees (i.e. model sensitivity vs. specificity): all nonenrollees plus random sample of enrollees of ~ equal N) Data Management Tasks Scoring of relative dropout/retention risk p = exp (a+b 1 x 1 +b 2 x 2 +b 3 x 3 +b 4 x 4.) 1 + exp (a+b 1 x 1 +b 2 x 2 +b 3 x 3 +b 4 x 4.) Where: p = probability of enrollment/non enrollment exp = base of natural logarithms (~ 2.72) a = constant/intercept of the equation b = coefficient of predictors (parameter estimates) 6

Example: Early Warning Model Step 1 a Balanced Model Parameter Estimates B S.E. Wald df Sig. Exp(B) hawaii 1.223.106 133.474 1.000 3.398 fye_class.280.116 5.793 1.016 1.323 housing.071.103.478 1.489 1.073 ed_goals.325.133 5.988 1.014 1.384 credits_attemped.054.026 4.322 1.038 1.056 hs_gpa.905.125 52.395 1.000 2.471 sat_m.002.001 7.881 1.005 1.002 sat_r -.002.001 11.714 1.001.998 ap.475.128 13.739 1.000 1.608 unmet_need -.221.119 3.425 1.064.802 Constant -4.546.632 51.784 1.000.011 a. Single-step variable entry Pseudo Rsquare =.17 A More Graphical View of Parameter Estimates (Early Warning Model) Strongest HS GPA AP Credit Hawaii These variables account for approximately 17% of the variance in a student s likelihood of returning for a third semester (Pseudo R Square =.168). SAT R RETENTION SAT M Ed Goals Weakest FYE Class *Wald statistic (sig.) 0.05 (*), 0.01 (**), and 0.001 (***) The Wald test statistic was used to indicate strength of the variable instead of the coefficient, standardized beta. Because of the nature of the logistic regression, the coefficient is not easily interpretable to indicate strength. 7

CCR Out of Sample Classification Table Comparison (Training vs Validation Dataset) a RETAINED Percentage RETAINED Percentage Correct Correct Observed No Yes No Yes No 849 317 72.8 266 110 70.7 RETAINED Yes 492 677 57.9 483 826 63.1 Overall Percentage 65.4 64.8 a. The cut value is.550; Nagelkerke R-sq =.17 Training Dataset Predicted Validation Dataset Example: End of Semester Updated Model Step 1 a B S.E. Wald df Sig. Exp(B) hawaii 1.376.119 133.951 1.000 3.960 fye_class.055.128 0.184 1.668 1.056 housing.030.116.067 1.796 1.030 ed_goals2.627.100 38.994 1.000 1.872 credits_attempted -.075.035 4.647 1.031 0.928 hs_gpa -.069.146 0.220 1.639 0.934 sat_m.002.001 7.512 1.006 1.002 sat_r -.003.001 15.361 1.000 0.997 ap.300.142 4.440 1.035 1.350 unmet_need -.186.132 1.988 1.159.830 CREDITS EARNED.121.026 22.342 1.000 1.128 EOS GPA.835.087 92.213 1.000 2.304 Constant -3.072.696 19.459 1.000.046 a. Single-step variable entry Balanced Model Parameter Estimates Pseudo Rsquare =.37 8

Gauging Value of EOS Data Early Warning End-of-Semester Predictors Wald Sig. Wald Sig. hawaii 133.474 *** 133.951 *** fye_class 5.793 * 0.184 housing.478.067 ed_goals 5.988 * 38.994 *** credits_attemped 4.322 * 4.647 * hs_gpa 52.395 *** 0.220 sat_m 7.881 ** 7.512 ** sat_r 11.714 ** 15.361 *** ap 13.739 ** 4.440 * unmet_need 3.425 1.988 CREDITS EARNED 22.342 *** EOS GPA 92.213 *** Nagelkerke R-sq 0.17 0.37 CCR of At-Risk 71% 73% Scoring Students Scoring of relative dropout/retention risk p = exp (a+b 1 x 1 +b 2 x 2 +b 3 x 3 +b 4 x 4.) 1 + exp (a+b 1 x 1 +b 2 x 2 +b 3 x 3 +b 4 x 4.) Where: p = probability of enrollment/non enrollment exp = base of natural logarithms (~ 2.72) a = constant/intercept of the equation b = coefficient of predictors (parameter estimates) 9

Example: John is at risk of dropping John: is from the continental U.S. (0) has a below average high school GPA (3.0) is enrolled in 12 credits (12) has a low % of financial need met (.65) isn t not working on campus (0) isn t enrolled in CAS 110 (0) didn t specify any educational goals in survey (0) Probability of Dropping: 0.75 Sample Data for Advisors/ Success Coaches UH ID LAST NAME FIRST NAME CURRENT EMAIL CREDITS HS AP/ RESIDENT GP CLEP A WORK ON CAMP 1 st YR EXP CLASS % FIN NEED MET STAR LOGINS ADVISOR PREVIOUS CONTACT 001 15 HI 6 3.80 Y Y 77% 5 Y 002 14 HI 0 3.33 N Y 63% 3 N 003 12 CA 6 3.00 N N 45% 0 N UH ID AGE GENDER ETHNICITY COLLEGE MAJOR DEGREE Ed Goal Specified Relative Risk Value Risk Level 001 18 F CH CA&H ART BA Yes 14.92 LOW 002 18 F HW CSS SOC BA Yes 36.88 MEDIUM 003 18 M UNDEC UNDEC UNDEC UNDEC No 89.18 HIGH 10

Classification Accuracy: Unbalanced vs Balanced Model RETAINED Percentage RETAINED Percentage Correct Correct Observed No Yes No Yes No 47 329 12.5 266 110 70.7 RETAINED Yes 42 1267 96.8 483 826 63.1 Overall Percentage 78.0 64.8 a. The cut value is.550 Classification Table Comparison (Validation Datasets) a Unbalanced Dataset Predicted Balanced Dataset Model Degeneracy Model Degeneracy Predictive model classifies all samples in dominant category The more skewed the data set is, the higher the overall classification rate is BUT MODEL DOESN T HELP! Performs poorly at classifying dropped students which are the very students we want to identify. 11

Model Performance Average Retention Rate 100% 90% 80% 70% 60% 50% 40% 30% 20% 52% Distribution of Estimates by Decile Group Fall 2012 Freshmen 60% 73% 70% 74% 85% 84% 91% 92% 10 9 8 7 6 5 4 3 2 1 Risk Values Low to High ( 170 students in each group) 95% Model Performance Average Retention Rate 100% 90% 80% 70% 60% 50% 40% 30% 20% 25% Distribution of Estimates by Decile Group Fall 2012 Freshmen (End-of-Semester Model) 63% 70% 74% 82% 90% 91% 94% 95% 96% 10 9 8 7 6 5 4 3 2 1 Risk Values Low to High ( 170 students in each group) 12

Classification Accuracy: Early Warning vs End of Semester Model RETAINED Percentage RETAINED Percentage Correct Correct Observed No Yes No Yes No 266 110 70.7 272 102 72.7 RETAINED Yes 483 826 63.1 348 961 73.4 Overall Percentage 64.8 73.3 a. The cut value is.550 Classification Table Comparison (Validation Datasets) a Balanced Early Warning Model Predicted Balanced End-of-Semester Model Model Performance Average Retention Rate 100% 90% 80% 70% 60% 50% 40% 30% 20% 52% 25% Distribution of Estimates by Decile Group Fall 2012 Freshmen (Early Warning vs EOS) Early Warning Model EOS Model 63% 60% 73% 74% 74% 70% 70% 94% 95% 95% 96% 91% 92% 90% 91% 85% 84% 82% 10 9 8 7 6 5 4 3 2 1 Risk Values Low to High ( 170 students in each group) 13

Impact on Campus 422 (of 1,912) freshmen from 2012 dropped out in first year. Retaining 40 students would have improved Mānoa s freshmen retention rate from 78% to 80%. Additional Revenue from Tuition and Fees $450,224 (for 28 HI, 12 WUE, excludes out-of-state.). Takeaways on Implementation Early alert data key Identify results that are actionable. Support for student advising and first year experience staff Involve colleges and departments. Use prediction data as component part of student dropout risk assessment plan. Ways to increase awareness of retention and graduation rates: Campaigns Showing impact on the bottom line 14

Challenges to Implementation Culture change Wary of misuse of data Questions about data used in model to generate risk scores Students rights to access risk scores More accountability Faculty buy in Summary Predicting students at risk Keep prediction model parsimonious Keep prediction data for student advising intuitive and simple (actionable) Triangulate prediction data with multiple sources of information Use prediction data as component part of student dropout risk assessment Follow best practices in predictive analytics and keep abreast of changes in analytical and data reporting tools Using prediction data Embrace the use of available data Ensure users conceptually understand what s behind the data Use data as a complementary piece of information when advising students Timing can be critical in terms of student intervention as well as maximizing advising resources Stay abreast of new research on predictive analytics: E.g. Analytics in Higher Education by J. Bichsel, Educause, 2012 15

Mahalo John Stanley Institutional Analyst University of Hawaii at Manoa Questions: jstanley@hawaii.edu Link to this presentation: http://manoa.hawaii.edu/ovcaa/mir/pdf/datasummit2013.pdf 16