College Tuition: Data mining and analysis

Size: px
Start display at page:

Download "College Tuition: Data mining and analysis"

Transcription

1 CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010

2 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report released by Collegeboard, the average increase in tuition and fees at four-year colleges is around 6 percent. Roughly 15 percent of students that attend four-year colleges have experienced 15 percent or more increase in tuition pricing. How do colleges determine their financial worth? We hypothesized that several factors contribute to this large figure that are paid off by loans, which numerous students and parents pledge 20 years of their lives to paying off. The main contributors include the following attributes: Attribute Size of the Student Body Size of Faculty and Staff Location of the school Reasoning The students supply the revenue that keeps the college running The college needs to pay the salaries, a larger budget is needed for a larger staff The college needs to account for cost of maintenance Dataset Description The entire table was compiled by using data from the U.S. News and World Report College Rankings of National Colleges and the complete University Guide found towards the back of the survey. The university guide included the setting, room and board expenses, and financial aid packages, all of which we believed would affect the tuition costs. Attribute Type Description Hypothesis Rank Based on the Survey s Higher-ranked schools overall college scoring may charge more for the Average Freshman Retention Percent of Class Sizes under 20 Percent of Class Sizes over 50 Percent of Full-time Faculty % of the freshmen that continue at that college % of classes with less than 20 students % of classes with more than 50 students % of faculty that work full time prestige and reputation A higher retention rate would mean more students to pay the tuition. This might lower the tuition per student. Fewer students in the classroom might mean higher tuition since this may increase the studentteacher ratio More students may decrease tuition costs since there are more students paying Tuition may be higher if the percentage was higher so that the college can pay salaries

3 Freshmen in top 10% of HS College Acceptance Rate Alumni Contribution Rate % of freshmen that graduated in the top 10% of their high schools % of applicants who are accepted into the college % of alumni contributing to the college Room & Board Room and board expenses at the college Percent with Financial Need Average Aid Package % of students with determined financial assistance Average financial aid package awarded to students Setting Text Location of the college: rural, suburban, urban Size of Full-time Students Size of the full-time student population Recruiting more of the smarter students may lead to an increase in tuition for an overall perception of a better school A higher acceptance rate may lead to an increase in tuition (supply increases, demand increases) A greater contribution from alumni may lead to a decrease in tuition A greater cost of room and board may lead to an increase in tuition (cost of living)? If more students are determined to need financial assistance, tuition may increase A larger aid package may lead to higher tuition costs; raking in more revenue from students The cost of living varies; the cost of land, building, maintenance may be higher in urban areas compared to rural areas More full-time students may lead to a decrease in tuition since more revenue is coming in from current students Data preparation Formatting the data We briefly swept through the data after eliminating the attributes and corrected minor issues such as apostrophes and spelling mistakes. We formatted the data to exclude symbols such as percentages and dollar signs, and made sure to eliminate unique identifiers, in this case, the individual college names. We chose to run numeric estimation which requires numeric inputs to deliver numeric outputs, therefore if we wanted to use setting, we had to use numbers to distinguish them setting 3=urban, 2=suburban, 1=rural.

4 Before running the tests, we randomized and split our main data (133 instances) into 1/3 for the test data and 2/3 for the training data. We eliminated the following attributes: Overall Score, Peer Score, Predicted Retention Rate, Actual Retention Rate. We feel that these four did not have a direct relationship with the attribute Tuition. We also wanted to use attributes that were more accessible and readily available to students and users. Creating the Database createcollegedatadb.py: This program was written with SQL within Python to create the college data database containing a table with the attributes. It also parses through the comma separated value file (CSV) to insert the values in the correct columns. This is the database that our tuition calculator will be pulling data from so it only contains attributes relevant to the main model that we chose. (See Appendix A for full code). 1. Connected to the database (essentially creating the database file) 2. Created a handler to execute queries a. Executed the following query to create the table: CREATE TABLE collegedata ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ) 3. Connected to the CSV file containing the data 4. Read in each row of the data, splitting the string by the commas a. Executed the following query in Python to insert the values: for record in readcsv: record = string.split(record, ',') INSERT INTO collegedata VALUES (?,?,?,?,?,?,?,?) parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor.execute(sql, parameters) Tuition-Calculator.py: This program was written to project tuition costs for any college, given the data inputs used in the model. The calculator was written based on the Linear Regression model. (See Appendix B for full code). 1. Connects to the database 2. Prompts user to enter a college name a. If the college appears in the table, the tuition is automatically calculated using the model b. If the college does not, the user is prompted to enter the attributes and this is added to the table

5 Data Analysis Our goal of this project is to understand the weights and relationships of the variables used by U.S. News in determining the ranking of the Top National Universities and their effect on tuition. Using data from U.S. News college ranking, we ran three different numeric models and derive the equations and regression tree that estimate tuition based on the given attributes. We utilized ten-fold cross-validation on all data runs. The following tables give the results of the three models in Weka: Approaches Test Data Correlation Coefficient Training Data Correlation Coefficient Linear Regression M5P LeastMedSqd DecisionTable SMOreg The following models were used for analysis (Weka documentation): Linear Regression Model: Class for using linear regression for prediction. Uses the Akaike criterion for model selection, and is able to deal with weighted instances. This model gives us better correlation and yields a simple regression equation that is much more useful. We select this approach for further analysis. M5 Pruned Model Tree: Split the parameter space into areas (subspaces) and build in each of them a linear regression model. This model follows a decision tree approach, but uses linear regression. In this case, the tree breaks down into many-level nodes. Even though it yields good correlation, this model isn t practical in comparison to the Linear Regression. Least Median Squared: Implements a least median squared linear regression utilizing the existing Weka LinearRegression class to form predictions. The basis of the algorithm is Robust regression and outlier detection. DecisionTable: Class for building and using a simple decision table majority classifier. The result came up with 10 rules for the data. There is a loss of accuracy in the training data. We believe this model is too simple for data and the many attributes. SMOreg: Sequential minimal optimization algorithm for training a support vector regression using polynomial or RBF kernels. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. This model has the best correlation. However, the algorithm normalizes all attributes. Results

6 We select the Linear Regression Model as our main model due to its correlation, simplicity and practicality in nature. The following result and analysis show the relationships between the attributes and their effect on Tuition. This regression equation gives us a lot of insights about the data. Tuition = Effect on Tuition Analysis * Rank + Negative Better rankings (lower number) make tuition more expensive * %ClassesUnder20 + Negative Schools with smaller classrooms tend to be cheaper * AlumniGivingRate + Negative Schools where Alumni donate a lot of money tend to be cheaper * BoardCost + Positive Higher boarding cost means higher tuition cost as well * AvgFinAid + Positive Higher financial aid package means tuition will be higher as well * SizeFullTimeStudents Negative More students decrease tuition From looking at this table, we can see the weights of these 6 attributes on determining tuition. They give insights about colleges that one might not think about. Better ranked schools have higher tuition, but they also give higher financial aid. The linear regression shows us hidden data.

7 Graphic 1: Tuition (color) vs. Rank (size) Treemap: Better ranking, higher tuition for colleges The smaller boxes indicate better rankings. The darker the color indicates the more expensive the school is. In the Urban segment, the concentration of better-ranked colleges is noticeably more expensive. So when a college improves its ranking, it has more leverage in the market to increase its tuition. When its tuition increases, the college can increases its financial aid as well, and both increases would make the school looks better and more prestigious. Graphic 2: Tuition vs. Rank on the Size of the school This is a different representation of the tree map above. In addition, the larger the dot indicates the larger the size of the school. From this depiction, there seem to be no large school with tuition above

8 $30,000. This is probably due to the differences between public and private schools in term of ranking. State schools tend to be cheaper and larger, which dominates the $20-30 thousand range. Graphic 3: Average Financial Aid (color) vs. Acceptance Rate (size) The larger boxes indicate higher acceptance rate. The darker boxes indicate better financial assistance. In this case, the graphic makes it apparent that there is a correlation between the two attributes. More selective schools give out better financial aid packages. Subsequently, more selective schools are likely to be better-ranked with higher tuition. The regression model also tells us that higher tuition means higher financial aid as well. The regression model and the graphics above all tie together the relationships of the attributes and how a college would charge its tuition. Conclusion From our analysis, we learn that better ranked and more selective schools do not simply imply to be more expensive. There are many attributes that factor into the cost of tuition. Our analysis also shows the natural relationship of costs in attending college. Tuition Cost + Board = Total Cost Total Cost Financial Aid = Net Cost Here, we can see that schools with high Total Cost will compensate by offering better financial assistance. We would advise high school seniors to not be afraid to apply for better-ranked schools that looked expensive and prestigious, but may turn out to have a lower Net Cost.

9 We also conclude that the colleges operate on economy of scale. The schools with more economy of scale can reduce its average cost of operations and allocate that surplus to financial aid for students. Extensions To increase the accuracy of our model, we acknowledge that we could have further categorized our data by private and public 4-year colleges, and private and public 2-year colleges. Public colleges have a tendency to charge less for overall tuition and this factor may change our variables. Possible Extensions to Python modules: 1. We could make this more customizable by calculating the tuition based on other criteria a. E.g., user wants the predicted tuition for colleges with a specific class size b. This calculator could then return all colleges and predicted tuition costs that fall within the specified class size 2. We could turn this into a web page calculator a. Command Python to interpret part of the text as HTML b. Use forms either drop down to select a specific college, or input search criteria c. Select the necessary attribute information based on the user input d. Code in the model to calculate the tuition cost

10 Appendices Appendix A: createcollegedatadb.py # CS105 project: Predicting College Tuition # by: Jeanette Chu & Khiem Tran # name: createcollegedatadb.py # description: create the college data database and insert all data into the table # # import necessary modules import sqlite3 as db import string # create/connect to the database filename = "collegedata.db" conn = db.connect(filename) cursor = conn.cursor() # create the table try: cursor.execute(""" CREATE TABLE collegedata ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ); """) except: pass # commit changes and close cursor conn.commit() cursor.close() # open file to read in readcsv = open('collegetuitiondataset-forpythoncalculator.csv', 'r') for record in readcsv: record = string.split(record, ",") # insert the values sql = """INSERT INTO collegedata VALUES (?,?,?,?,?,?,?,?)""" parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor = conn.cursor() cursor.execute(sql, parameters) conn.commit() cursor.close()

11 Appendix B: tuitioncalculator.py # CS105 project: Predicting College Tuition # by: Jeanette Chu & Khiem Tran # name: tuitioncalculator.py # description: calculates the tuition based on selected model # # import necessary modules import sqlite3 as db import math ############################################################################### # to format dollars def dollarformat(total): # Converting number to string, cut out cents to work with dollars only total = float(total) total = "%.2f" % total cents = total[-3: ] total = total[0:-3] # Format with dollar sign and commas if len(total) <= 3: numform = "$"+total+cents else: n = len(total) numform = "" while n > 3: last = total[-3: ] numform = ","+last + numform n -= 3 total = total[0:-3] numform = "$"+total+numform+cents return numform ############################################################################### # to format numbers def numformat(number): # Converting number to string, cut out cents to work with dollars only number = str(number) # Format with dollar sign and commas if len(number)> 3: n = len(number) numform = "" while n > 3: last = number[-3: ] numform = ","+last + numform n -= 3 number = number[0:-3] numform = number+numform return numform ############################################################################### # to format percent def percentformat(decimal): decimal = decimal * 100 decimal = "%.2f" % decimal decimal = str(decimal) + "%" return decimal ############################################################################### def main(): # connect to the database and create handler filename = "collegedata.db" conn = db.connect(filename) cursor = conn.cursor()

12 # prompt for college name collegename = raw_input("enter name of college: ") # check for results from sql = ("SELECT * FROM collegedata WHERE CollegeName=?") cursor.execute(sql, [collegename]) count = 0 # for row in results: for row in cursor: rank = row[1] classes20 = row[2] alumnirate = row[3] board = row[5] avgaid = row[6] sizeft = row[7] count = count + 1 if count == 0: # prompt user for college info rank = input("enter the school rank: ") classes20 = input("enter % of class sizes under 20 (0 if N/A): ") alumnirate = input("enter alumni contribution rate: ") board = input("enter room & board expenses: ") avgaid = input("enter average financial aid package: ") sizeft = input("enter size of full-time students: ") # apply model equation calctuition = * rank * classes * alumnirate * board * avgaid * float(sizeft) # update the table to include entered data sql = ("INSERT INTO CollegeData VALUES (?,?,?,?,?,?,?,?)") parameters = (collegename, rank, classes20, alumnirate, calctuition, board, avgaid, sizeft) cursor.execute(sql, parameters) # commit changes conn.commit() cursor.close() # output results to user print "The predicted cost for %s is: %s\n" % (collegename, dollarformat(calctuition)) print "*"*50 # more results hit enter print "To see the factors affecting the cost hit enter:" enter = raw_input(">") print """ %s School Rank: %d %% of Classes < 20 Students: %s Alumni Contribution Rate: %s Room & Board Expenses: %s Average Financial Aid Package: %s Size of Full-time Students: %s """ % (collegename, rank, percentformat(classes20), percentformat(alumnirate), dollarformat(board), dollarformat(avgaid), numformat(sizeft)) if name == " main ": main()

Data Mining III: Numeric Estimation

Data Mining III: Numeric Estimation Data Mining III: Numeric Estimation Computer Science 105 Boston University David G. Sullivan, Ph.D. Review: Numeric Estimation Numeric estimation is like classification learning. it involves learning a

More information

Polynomial Neural Network Discovery Client User Guide

Polynomial Neural Network Discovery Client User Guide Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3

More information

Easily Identify Your Best Customers

Easily Identify Your Best Customers IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

More information

Car Insurance. Prvák, Tomi, Havri

Car Insurance. Prvák, Tomi, Havri Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Module 5: Statistical Analysis

Module 5: Statistical Analysis Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

1. How can I see the representation allowance in IT for position classes 46-54?

1. How can I see the representation allowance in IT for position classes 46-54? PayMonitor Secrets Revealed 1. How can I see the representation allowance in IT for position classes 46-54? - Customise - Report templates - Standard detail report modify - Select fields: Select only annual

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation

Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing

More information

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE MAT 119 STATISTICS AND ELEMENTARY ALGEBRA 5 Lecture Hours, 2 Lab Hours, 3 Credits Pre-

More information

Time Clock Import Setup & Use

Time Clock Import Setup & Use Time Clock Import Setup & Use Document # Product Module Category CenterPoint Payroll Processes (How To) This document outlines how to setup and use of the Time Clock Import within CenterPoint Payroll.

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

TI-Inspire manual 1. Instructions. Ti-Inspire for statistics. General Introduction

TI-Inspire manual 1. Instructions. Ti-Inspire for statistics. General Introduction TI-Inspire manual 1 General Introduction Instructions Ti-Inspire for statistics TI-Inspire manual 2 TI-Inspire manual 3 Press the On, Off button to go to Home page TI-Inspire manual 4 Use the to navigate

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Pearson's Correlation Tests

Pearson's Correlation Tests Chapter 800 Pearson's Correlation Tests Introduction The correlation coefficient, ρ (rho), is a popular statistic for describing the strength of the relationship between two variables. The correlation

More information

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 2 nd, 2014 Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition

More information

Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

More information

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

NATIONAL CENTER FOR EDUCATION STATISTICS. Integrated Postsecondary Education Data System (IPEDS) IPEDS Data Center User Manual

NATIONAL CENTER FOR EDUCATION STATISTICS. Integrated Postsecondary Education Data System (IPEDS) IPEDS Data Center User Manual ASDFASFDAS NATIONAL CENTER FOR EDUCATION STATISTICS Integrated Postsecondary Education Data System (IPEDS) IPEDS Data Center User Manual ASDFASFDAS INTEGRATED POSTSECONDARY EDUCATION DATA SYSTEM (IPEDS)

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

MS Access: Advanced Tables and Queries. Lesson Notes Author: Pamela Schmidt

MS Access: Advanced Tables and Queries. Lesson Notes Author: Pamela Schmidt Lesson Notes Author: Pamela Schmidt Tables Text Fields (Default) Text or combinations of text and numbers, as well as numbers that don't require calculations, such as phone numbers. or the length set by

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Beating the NCAA Football Point Spread

Beating the NCAA Football Point Spread Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

U.S. News & World Report 2015 Best Colleges Rankings Summary Report

U.S. News & World Report 2015 Best Colleges Rankings Summary Report U.S. News & World Report 2015 Best Colleges ings Summary Report U.S. News World Report releases the Best Colleges ings each fall. These undergraduate rankings are used by students to help them decide where

More information

Issues in Information Systems Volume 16, Issue IV, pp. 30-36, 2015

Issues in Information Systems Volume 16, Issue IV, pp. 30-36, 2015 DATA MINING ANALYSIS AND PREDICTIONS OF REAL ESTATE PRICES Victor Gan, Seattle University, gany@seattleu.edu Vaishali Agarwal, Seattle University, agarwal1@seattleu.edu Ben Kim, Seattle University, bkim@taseattleu.edu

More information

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Lucky vs. Unlucky Teams in Sports

Lucky vs. Unlucky Teams in Sports Lucky vs. Unlucky Teams in Sports Introduction Assuming gambling odds give true probabilities, one can classify a team as having been lucky or unlucky so far. Do results of matches between lucky and unlucky

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

CS5950 - Machine Learning Identification of Duplicate Album Covers. Arrendondo, Brandon Jenkins, James Jones, Austin

CS5950 - Machine Learning Identification of Duplicate Album Covers. Arrendondo, Brandon Jenkins, James Jones, Austin CS5950 - Machine Learning Identification of Duplicate Album Covers Arrendondo, Brandon Jenkins, James Jones, Austin June 29, 2015 2 FIRST STEPS 1 Introduction This paper covers the results of our groups

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0. Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

More information

Co-Curricular Activities and Academic Performance -A Study of the Student Leadership Initiative Programs. Office of Institutional Research

Co-Curricular Activities and Academic Performance -A Study of the Student Leadership Initiative Programs. Office of Institutional Research Co-Curricular Activities and Academic Performance -A Study of the Student Leadership Initiative Programs Office of Institutional Research July 2014 Introduction The Leadership Initiative (LI) is a certificate

More information

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Introduce Decimals with an Art Project Criteria Charts, Rubrics, Standards By Susan Ferdman

Introduce Decimals with an Art Project Criteria Charts, Rubrics, Standards By Susan Ferdman Introduce Decimals with an Art Project Criteria Charts, Rubrics, Standards By Susan Ferdman hundredths tenths ones tens Decimal Art An Introduction to Decimals Directions: Part 1: Coloring Have children

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

IBM SPSS Data Preparation 22

IBM SPSS Data Preparation 22 IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

Data Mining Analysis (breast-cancer data)

Data Mining Analysis (breast-cancer data) Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Determining Minimum Sample Sizes for Estimating Prediction Equations for College Freshman Grade Average

Determining Minimum Sample Sizes for Estimating Prediction Equations for College Freshman Grade Average A C T Research Report Series 87-4 Determining Minimum Sample Sizes for Estimating Prediction Equations for College Freshman Grade Average Richard Sawyer March 1987 For additional copies write: ACT Research

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

More information

Vendor: Brio Software Product: Brio Performance Suite

Vendor: Brio Software Product: Brio Performance Suite 1 Ability to access the database platforms desired (text, spreadsheet, Oracle, Sybase and other databases, OLAP engines.) yes yes Brio is recognized for it Universal database access. Any source that is

More information

Does it Pay to Attend an Elite Liberal Arts College?

Does it Pay to Attend an Elite Liberal Arts College? Maloney 1 Does it Pay to Attend an Elite Liberal Arts College? Paul Maloney Abstract: One of the most important decisions in a person s life is what college they will attend. The choice of college can

More information

Start-up Companies Predictive Models Analysis. Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov

Start-up Companies Predictive Models Analysis. Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov Start-up Companies Predictive Models Analysis Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov Abstract: A quantitative research is performed to derive a model for predicting the success of Bulgarian start-up

More information

Introduction to IBM Watson Analytics Data Loading and Data Quality

Introduction to IBM Watson Analytics Data Loading and Data Quality Introduction to IBM Watson Analytics Data Loading and Data Quality December 16, 2014 Document version 2.0 This document applies to IBM Watson Analytics. Licensed Materials - Property of IBM Copyright IBM

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin My presentation is about data visualization. How to use visual graphs and charts in order to explore data, discover meaning and report findings. The goal is to show that visual displays can be very effective

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

More information

IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

A Program for PCB Estimation with Altium Designer

A Program for PCB Estimation with Altium Designer A Program for PCB Estimation with Altium Designer By: Steve Hageman AnalogHome.com One thing that I have had to do over and over on my new PCB jobs is to make an estimate of how long I think the layout

More information

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Business Intelligence Professor Chen NAME: Due Date: Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Tutorial Summary Objective: Richard would

More information

JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Using SPSS, Chapter 2: Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics 1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,

More information

IBM SPSS Direct Marketing 20

IBM SPSS Direct Marketing 20 IBM SPSS Direct Marketing 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This edition applies to IBM SPSS Statistics 20 and to

More information

River Dell Regional School District. Computer Programming with Python Curriculum

River Dell Regional School District. Computer Programming with Python Curriculum River Dell Regional School District Computer Programming with Python Curriculum 2015 Mr. Patrick Fletcher Superintendent River Dell Regional Schools Ms. Lorraine Brooks Principal River Dell High School

More information

Creating a Gradebook in Excel

Creating a Gradebook in Excel Creating a Spreadsheet Gradebook 1 Creating a Gradebook in Excel Spreadsheets are a great tool for creating gradebooks. With a little bit of work, you can create a customized gradebook that will provide

More information

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

How U.S. News Calculated the 2015 Best Colleges Rankings

How U.S. News Calculated the 2015 Best Colleges Rankings How U.S. News Calculated the 2015 Best Colleges Rankings Here's how you can make the most of the key college statistics. The U.S. News Best Colleges rankings can help prospective students and their families

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

GeoGebra Statistics and Probability

GeoGebra Statistics and Probability GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,

More information

Introduction to SQL for Data Scientists

Introduction to SQL for Data Scientists Introduction to SQL for Data Scientists Ben O. Smith College of Business Administration University of Nebraska at Omaha Learning Objectives By the end of this document you will learn: 1. How to perform

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati

More information

How To Analyze Data In Excel 2003 With A Powerpoint 3.5

How To Analyze Data In Excel 2003 With A Powerpoint 3.5 Microsoft Excel 2003 Data Analysis Larry F. Vint, Ph.D lvint@niu.edu 815-753-8053 Technical Advisory Group Customer Support Services Northern Illinois University 120 Swen Parson Hall DeKalb, IL 60115 Copyright

More information

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015 1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Deep dive into Haven Predictive Analytics Powered by HP Distributed R and

More information

WKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates

WKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates Research Report June 7, 2011 WKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates ABSTRACT In the study of higher education, few topics receive as much attention

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Predictive Dynamix Inc

Predictive Dynamix Inc Predictive Modeling Technology Predictive modeling is concerned with analyzing patterns and trends in historical and operational data in order to transform data into actionable decisions. This is accomplished

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Data Mining Applications in Fund Raising

Data Mining Applications in Fund Raising Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,

More information