College Tuition: Data mining and analysis



Similar documents
Data Mining III: Numeric Estimation

Polynomial Neural Network Discovery Client User Guide

Easily Identify Your Best Customers

Car Insurance. Prvák, Tomi, Havri

Studying Auto Insurance Data

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Module 5: Statistical Analysis

IBM SPSS Direct Marketing 23

Decision Trees from large Databases: SLIQ

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Time Clock Import Setup & Use

Data Mining - Evaluation of Classifiers

Data Mining Applications in Higher Education

TI-Inspire manual 1. Instructions. Ti-Inspire for statistics. General Introduction

Data Mining Techniques Chapter 6: Decision Trees

Pearson's Correlation Tests

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Data Mining Lab 5: Introduction to Neural Networks

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

NATIONAL CENTER FOR EDUCATION STATISTICS. Integrated Postsecondary Education Data System (IPEDS) IPEDS Data Center User Manual

Microsoft Azure Machine learning Algorithms

IBM SPSS Direct Marketing 22

Additional sources Compilation of sources:

MS Access: Advanced Tables and Queries. Lesson Notes Author: Pamela Schmidt

CALCULATIONS & STATISTICS

Beating the NCAA Football Point Spread

A Property & Casualty Insurance Predictive Modeling Process in SAS

Prediction of Stock Performance Using Analytical Techniques

Simple Predictive Analytics Curtis Seare

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Beating the MLB Moneyline

Advanced analytics at your hands

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Data quality in Accounting Information Systems

U.S. News & World Report 2015 Best Colleges Rankings Summary Report

Issues in Information Systems Volume 16, Issue IV, pp , 2015

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Lucky vs. Unlucky Teams in Sports

Data Mining Methods: Applications for Institutional Research

Web Document Clustering

Introduction to Learning & Decision Trees

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Co-Curricular Activities and Academic Performance -A Study of the Student Leadership Initiative Programs. Office of Institutional Research

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

Data, Measurements, Features

Introduce Decimals with an Art Project Criteria Charts, Rubrics, Standards By Susan Ferdman

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Using multiple models: Bagging, Boosting, Ensembles, Forests

IBM SPSS Data Preparation 22

An Overview and Evaluation of Decision Tree Methodology

Data Mining Analysis (breast-cancer data)

Social Media Mining. Data Mining Essentials

Data Mining Algorithms Part 1. Dejan Sarka

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Vendor: Brio Software Product: Brio Performance Suite

Start-up Companies Predictive Models Analysis. Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov

Introduction to IBM Watson Analytics Data Loading and Data Quality

STATISTICA Formula Guide: Logistic Regression. Table of Contents

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

Chapter 6. The stacking ensemble approach

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

IBM SPSS Direct Marketing 19

1. Classification problems

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

A Program for PCB Estimation with Altium Designer

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

JetBlue Airways Stock Price Analysis and Prediction

Big Data: Rethinking Text Visualization

Knowledge Discovery from patents using KMX Text Analytics

Using SPSS, Chapter 2: Descriptive Statistics

IBM SPSS Direct Marketing 20

River Dell Regional School District. Computer Programming with Python Curriculum

Creating a Gradebook in Excel

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

How U.S. News Calculated the 2015 Best Colleges Rankings

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

GeoGebra Statistics and Probability

Introduction to SQL for Data Scientists

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

How To Analyze Data In Excel 2003 With A Powerpoint 3.5

SEIZE THE DATA SEIZE THE DATA. 2015

Pentaho Data Mining Last Modified on January 22, 2007

Predictive Dynamix Inc

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Data Mining Applications in Fund Raising

Transcription:

CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010

Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report released by Collegeboard, the average increase in tuition and fees at four-year colleges is around 6 percent. Roughly 15 percent of students that attend four-year colleges have experienced 15 percent or more increase in tuition pricing. How do colleges determine their financial worth? We hypothesized that several factors contribute to this large figure that are paid off by loans, which numerous students and parents pledge 20 years of their lives to paying off. The main contributors include the following attributes: Attribute Size of the Student Body Size of Faculty and Staff Location of the school Reasoning The students supply the revenue that keeps the college running The college needs to pay the salaries, a larger budget is needed for a larger staff The college needs to account for cost of maintenance Dataset Description The entire table was compiled by using data from the U.S. News and World Report College Rankings of National Colleges and the complete University Guide found towards the back of the survey. The university guide included the setting, room and board expenses, and financial aid packages, all of which we believed would affect the tuition costs. Attribute Type Description Hypothesis Rank Based on the Survey s Higher-ranked schools overall college scoring may charge more for the Average Freshman Retention Percent of Class Sizes under 20 Percent of Class Sizes over 50 Percent of Full-time Faculty % of the freshmen that continue at that college % of classes with less than 20 students % of classes with more than 50 students % of faculty that work full time prestige and reputation A higher retention rate would mean more students to pay the tuition. This might lower the tuition per student. Fewer students in the classroom might mean higher tuition since this may increase the studentteacher ratio More students may decrease tuition costs since there are more students paying Tuition may be higher if the percentage was higher so that the college can pay salaries

Freshmen in top 10% of HS College Acceptance Rate Alumni Contribution Rate % of freshmen that graduated in the top 10% of their high schools % of applicants who are accepted into the college % of alumni contributing to the college Room & Board Room and board expenses at the college Percent with Financial Need Average Aid Package % of students with determined financial assistance Average financial aid package awarded to students Setting Text Location of the college: rural, suburban, urban Size of Full-time Students Size of the full-time student population Recruiting more of the smarter students may lead to an increase in tuition for an overall perception of a better school A higher acceptance rate may lead to an increase in tuition (supply increases, demand increases) A greater contribution from alumni may lead to a decrease in tuition A greater cost of room and board may lead to an increase in tuition (cost of living)? If more students are determined to need financial assistance, tuition may increase A larger aid package may lead to higher tuition costs; raking in more revenue from students The cost of living varies; the cost of land, building, maintenance may be higher in urban areas compared to rural areas More full-time students may lead to a decrease in tuition since more revenue is coming in from current students Data preparation Formatting the data We briefly swept through the data after eliminating the attributes and corrected minor issues such as apostrophes and spelling mistakes. We formatted the data to exclude symbols such as percentages and dollar signs, and made sure to eliminate unique identifiers, in this case, the individual college names. We chose to run numeric estimation which requires numeric inputs to deliver numeric outputs, therefore if we wanted to use setting, we had to use numbers to distinguish them setting 3=urban, 2=suburban, 1=rural.

Before running the tests, we randomized and split our main data (133 instances) into 1/3 for the test data and 2/3 for the training data. We eliminated the following attributes: Overall Score, Peer Score, Predicted Retention Rate, Actual Retention Rate. We feel that these four did not have a direct relationship with the attribute Tuition. We also wanted to use attributes that were more accessible and readily available to students and users. Creating the Database createcollegedatadb.py: This program was written with SQL within Python to create the college data database containing a table with the attributes. It also parses through the comma separated value file (CSV) to insert the values in the correct columns. This is the database that our tuition calculator will be pulling data from so it only contains attributes relevant to the main model that we chose. (See Appendix A for full code). 1. Connected to the database (essentially creating the database file) 2. Created a handler to execute queries a. Executed the following query to create the table: CREATE TABLE collegedata ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ) 3. Connected to the CSV file containing the data 4. Read in each row of the data, splitting the string by the commas a. Executed the following query in Python to insert the values: for record in readcsv: record = string.split(record, ',') INSERT INTO collegedata VALUES (?,?,?,?,?,?,?,?) parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor.execute(sql, parameters) Tuition-Calculator.py: This program was written to project tuition costs for any college, given the data inputs used in the model. The calculator was written based on the Linear Regression model. (See Appendix B for full code). 1. Connects to the database 2. Prompts user to enter a college name a. If the college appears in the table, the tuition is automatically calculated using the model b. If the college does not, the user is prompted to enter the attributes and this is added to the table

Data Analysis Our goal of this project is to understand the weights and relationships of the variables used by U.S. News in determining the ranking of the Top National Universities and their effect on tuition. Using data from U.S. News college ranking, we ran three different numeric models and derive the equations and regression tree that estimate tuition based on the given attributes. We utilized ten-fold cross-validation on all data runs. The following tables give the results of the three models in Weka: Approaches Test Data Correlation Coefficient Training Data Correlation Coefficient Linear Regression 0.8140 0.8789 M5P 0.8783 0.8799 LeastMedSqd 0.7723 0.8725 DecisionTable 0.7115 0.6566 SMOreg 0.8556 0.8878 The following models were used for analysis (Weka documentation): Linear Regression Model: Class for using linear regression for prediction. Uses the Akaike criterion for model selection, and is able to deal with weighted instances. This model gives us better correlation and yields a simple regression equation that is much more useful. We select this approach for further analysis. M5 Pruned Model Tree: Split the parameter space into areas (subspaces) and build in each of them a linear regression model. This model follows a decision tree approach, but uses linear regression. In this case, the tree breaks down into many-level nodes. Even though it yields good correlation, this model isn t practical in comparison to the Linear Regression. Least Median Squared: Implements a least median squared linear regression utilizing the existing Weka LinearRegression class to form predictions. The basis of the algorithm is Robust regression and outlier detection. DecisionTable: Class for building and using a simple decision table majority classifier. The result came up with 10 rules for the data. There is a loss of accuracy in the training data. We believe this model is too simple for data and the many attributes. SMOreg: Sequential minimal optimization algorithm for training a support vector regression using polynomial or RBF kernels. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. This model has the best correlation. However, the algorithm normalizes all attributes. Results

We select the Linear Regression Model as our main model due to its correlation, simplicity and practicality in nature. The following result and analysis show the relationships between the attributes and their effect on Tuition. This regression equation gives us a lot of insights about the data. Tuition = Effect on Tuition Analysis -89.1861 * Rank + Negative Better rankings (lower number) make tuition more expensive. -8758.1325 * %ClassesUnder20 + Negative Schools with smaller classrooms tend to be cheaper. -16707.8015 * AlumniGivingRate + Negative Schools where Alumni donate a lot of money tend to be cheaper. 1.0977 * BoardCost + Positive Higher boarding cost means higher tuition cost as well. 0.3953 * AvgFinAid + Positive Higher financial aid package means tuition will be higher as well. -0.2794 * SizeFullTimeStudents Negative More students decrease tuition. + 27317.312 From looking at this table, we can see the weights of these 6 attributes on determining tuition. They give insights about colleges that one might not think about. Better ranked schools have higher tuition, but they also give higher financial aid. The linear regression shows us hidden data.

Graphic 1: Tuition (color) vs. Rank (size) Treemap: Better ranking, higher tuition for colleges http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/rank-vs-tuition-treemap The smaller boxes indicate better rankings. The darker the color indicates the more expensive the school is. In the Urban segment, the concentration of better-ranked colleges is noticeably more expensive. So when a college improves its ranking, it has more leverage in the market to increase its tuition. When its tuition increases, the college can increases its financial aid as well, and both increases would make the school looks better and more prestigious. Graphic 2: Tuition vs. Rank on the Size of the school http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/tuition-vs-rank-on-size-of-school This is a different representation of the tree map above. In addition, the larger the dot indicates the larger the size of the school. From this depiction, there seem to be no large school with tuition above

$30,000. This is probably due to the differences between public and private schools in term of ranking. State schools tend to be cheaper and larger, which dominates the $20-30 thousand range. Graphic 3: Average Financial Aid (color) vs. Acceptance Rate (size) http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/average-financial-aid-vs-acceptanc The larger boxes indicate higher acceptance rate. The darker boxes indicate better financial assistance. In this case, the graphic makes it apparent that there is a correlation between the two attributes. More selective schools give out better financial aid packages. Subsequently, more selective schools are likely to be better-ranked with higher tuition. The regression model also tells us that higher tuition means higher financial aid as well. The regression model and the graphics above all tie together the relationships of the attributes and how a college would charge its tuition. Conclusion From our analysis, we learn that better ranked and more selective schools do not simply imply to be more expensive. There are many attributes that factor into the cost of tuition. Our analysis also shows the natural relationship of costs in attending college. Tuition Cost + Board = Total Cost Total Cost Financial Aid = Net Cost Here, we can see that schools with high Total Cost will compensate by offering better financial assistance. We would advise high school seniors to not be afraid to apply for better-ranked schools that looked expensive and prestigious, but may turn out to have a lower Net Cost.

We also conclude that the colleges operate on economy of scale. The schools with more economy of scale can reduce its average cost of operations and allocate that surplus to financial aid for students. Extensions To increase the accuracy of our model, we acknowledge that we could have further categorized our data by private and public 4-year colleges, and private and public 2-year colleges. Public colleges have a tendency to charge less for overall tuition and this factor may change our variables. Possible Extensions to Python modules: 1. We could make this more customizable by calculating the tuition based on other criteria a. E.g., user wants the predicted tuition for colleges with a specific class size b. This calculator could then return all colleges and predicted tuition costs that fall within the specified class size 2. We could turn this into a web page calculator a. Command Python to interpret part of the text as HTML b. Use forms either drop down to select a specific college, or input search criteria c. Select the necessary attribute information based on the user input d. Code in the model to calculate the tuition cost

Appendices Appendix A: createcollegedatadb.py # CS105 project: Predicting College Tuition # by: Jeanette Chu & Khiem Tran # name: createcollegedatadb.py # description: create the college data database and insert all data into the table # # import necessary modules import sqlite3 as db import string # create/connect to the database filename = "collegedata.db" conn = db.connect(filename) cursor = conn.cursor() # create the table try: cursor.execute(""" CREATE TABLE collegedata ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ); """) except: pass # commit changes and close cursor conn.commit() cursor.close() # open file to read in readcsv = open('collegetuitiondataset-forpythoncalculator.csv', 'r') for record in readcsv: record = string.split(record, ",") # insert the values sql = """INSERT INTO collegedata VALUES (?,?,?,?,?,?,?,?)""" parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor = conn.cursor() cursor.execute(sql, parameters) conn.commit() cursor.close()

Appendix B: tuitioncalculator.py # CS105 project: Predicting College Tuition # by: Jeanette Chu & Khiem Tran # name: tuitioncalculator.py # description: calculates the tuition based on selected model # # import necessary modules import sqlite3 as db import math ############################################################################### # to format dollars def dollarformat(total): # Converting number to string, cut out cents to work with dollars only total = float(total) total = "%.2f" % total cents = total[-3: ] total = total[0:-3] # Format with dollar sign and commas if len(total) <= 3: numform = "$"+total+cents else: n = len(total) numform = "" while n > 3: last = total[-3: ] numform = ","+last + numform n -= 3 total = total[0:-3] numform = "$"+total+numform+cents return numform ############################################################################### # to format numbers def numformat(number): # Converting number to string, cut out cents to work with dollars only number = str(number) # Format with dollar sign and commas if len(number)> 3: n = len(number) numform = "" while n > 3: last = number[-3: ] numform = ","+last + numform n -= 3 number = number[0:-3] numform = number+numform return numform ############################################################################### # to format percent def percentformat(decimal): decimal = decimal * 100 decimal = "%.2f" % decimal decimal = str(decimal) + "%" return decimal ############################################################################### def main(): # connect to the database and create handler filename = "collegedata.db" conn = db.connect(filename) cursor = conn.cursor()

# prompt for college name collegename = raw_input("enter name of college: ") # check for results from sql = ("SELECT * FROM collegedata WHERE CollegeName=?") cursor.execute(sql, [collegename]) count = 0 # for row in results: for row in cursor: rank = row[1] classes20 = row[2] alumnirate = row[3] board = row[5] avgaid = row[6] sizeft = row[7] count = count + 1 if count == 0: # prompt user for college info rank = input("enter the school rank: ") classes20 = input("enter % of class sizes under 20 (0 if N/A): ") alumnirate = input("enter alumni contribution rate: ") board = input("enter room & board expenses: ") avgaid = input("enter average financial aid package: ") sizeft = input("enter size of full-time students: ") # apply model equation calctuition = -89.1861 * rank + -8758.1325 * classes20 + -16707.8015 * alumnirate + 1.0977 * board + 0.3953 * avgaid + -0.2794 * float(sizeft) + 27317.312 # update the table to include entered data sql = ("INSERT INTO CollegeData VALUES (?,?,?,?,?,?,?,?)") parameters = (collegename, rank, classes20, alumnirate, calctuition, board, avgaid, sizeft) cursor.execute(sql, parameters) # commit changes conn.commit() cursor.close() # output results to user print "The predicted cost for %s is: %s\n" % (collegename, dollarformat(calctuition)) print "*"*50 # more results hit enter print "To see the factors affecting the cost hit enter:" enter = raw_input(">") print """ %s ------------------------------------- School Rank: %d %% of Classes < 20 Students: %s Alumni Contribution Rate: %s Room & Board Expenses: %s Average Financial Aid Package: %s Size of Full-time Students: %s """ % (collegename, rank, percentformat(classes20), percentformat(alumnirate), dollarformat(board), dollarformat(avgaid), numformat(sizeft)) if name == " main ": main()