College Tuition: Data mining and analysis
|
|
- Arabella Roberts
- 8 years ago
- Views:
Transcription
1 CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010
2 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report released by Collegeboard, the average increase in tuition and fees at four-year colleges is around 6 percent. Roughly 15 percent of students that attend four-year colleges have experienced 15 percent or more increase in tuition pricing. How do colleges determine their financial worth? We hypothesized that several factors contribute to this large figure that are paid off by loans, which numerous students and parents pledge 20 years of their lives to paying off. The main contributors include the following attributes: Attribute Size of the Student Body Size of Faculty and Staff Location of the school Reasoning The students supply the revenue that keeps the college running The college needs to pay the salaries, a larger budget is needed for a larger staff The college needs to account for cost of maintenance Dataset Description The entire table was compiled by using data from the U.S. News and World Report College Rankings of National Colleges and the complete University Guide found towards the back of the survey. The university guide included the setting, room and board expenses, and financial aid packages, all of which we believed would affect the tuition costs. Attribute Type Description Hypothesis Rank Based on the Survey s Higher-ranked schools overall college scoring may charge more for the Average Freshman Retention Percent of Class Sizes under 20 Percent of Class Sizes over 50 Percent of Full-time Faculty % of the freshmen that continue at that college % of classes with less than 20 students % of classes with more than 50 students % of faculty that work full time prestige and reputation A higher retention rate would mean more students to pay the tuition. This might lower the tuition per student. Fewer students in the classroom might mean higher tuition since this may increase the studentteacher ratio More students may decrease tuition costs since there are more students paying Tuition may be higher if the percentage was higher so that the college can pay salaries
3 Freshmen in top 10% of HS College Acceptance Rate Alumni Contribution Rate % of freshmen that graduated in the top 10% of their high schools % of applicants who are accepted into the college % of alumni contributing to the college Room & Board Room and board expenses at the college Percent with Financial Need Average Aid Package % of students with determined financial assistance Average financial aid package awarded to students Setting Text Location of the college: rural, suburban, urban Size of Full-time Students Size of the full-time student population Recruiting more of the smarter students may lead to an increase in tuition for an overall perception of a better school A higher acceptance rate may lead to an increase in tuition (supply increases, demand increases) A greater contribution from alumni may lead to a decrease in tuition A greater cost of room and board may lead to an increase in tuition (cost of living)? If more students are determined to need financial assistance, tuition may increase A larger aid package may lead to higher tuition costs; raking in more revenue from students The cost of living varies; the cost of land, building, maintenance may be higher in urban areas compared to rural areas More full-time students may lead to a decrease in tuition since more revenue is coming in from current students Data preparation Formatting the data We briefly swept through the data after eliminating the attributes and corrected minor issues such as apostrophes and spelling mistakes. We formatted the data to exclude symbols such as percentages and dollar signs, and made sure to eliminate unique identifiers, in this case, the individual college names. We chose to run numeric estimation which requires numeric inputs to deliver numeric outputs, therefore if we wanted to use setting, we had to use numbers to distinguish them setting 3=urban, 2=suburban, 1=rural.
4 Before running the tests, we randomized and split our main data (133 instances) into 1/3 for the test data and 2/3 for the training data. We eliminated the following attributes: Overall Score, Peer Score, Predicted Retention Rate, Actual Retention Rate. We feel that these four did not have a direct relationship with the attribute Tuition. We also wanted to use attributes that were more accessible and readily available to students and users. Creating the Database createcollegedatadb.py: This program was written with SQL within Python to create the college data database containing a table with the attributes. It also parses through the comma separated value file (CSV) to insert the values in the correct columns. This is the database that our tuition calculator will be pulling data from so it only contains attributes relevant to the main model that we chose. (See Appendix A for full code). 1. Connected to the database (essentially creating the database file) 2. Created a handler to execute queries a. Executed the following query to create the table: CREATE TABLE collegedata ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ) 3. Connected to the CSV file containing the data 4. Read in each row of the data, splitting the string by the commas a. Executed the following query in Python to insert the values: for record in readcsv: record = string.split(record, ',') INSERT INTO collegedata VALUES (?,?,?,?,?,?,?,?) parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor.execute(sql, parameters) Tuition-Calculator.py: This program was written to project tuition costs for any college, given the data inputs used in the model. The calculator was written based on the Linear Regression model. (See Appendix B for full code). 1. Connects to the database 2. Prompts user to enter a college name a. If the college appears in the table, the tuition is automatically calculated using the model b. If the college does not, the user is prompted to enter the attributes and this is added to the table
5 Data Analysis Our goal of this project is to understand the weights and relationships of the variables used by U.S. News in determining the ranking of the Top National Universities and their effect on tuition. Using data from U.S. News college ranking, we ran three different numeric models and derive the equations and regression tree that estimate tuition based on the given attributes. We utilized ten-fold cross-validation on all data runs. The following tables give the results of the three models in Weka: Approaches Test Data Correlation Coefficient Training Data Correlation Coefficient Linear Regression M5P LeastMedSqd DecisionTable SMOreg The following models were used for analysis (Weka documentation): Linear Regression Model: Class for using linear regression for prediction. Uses the Akaike criterion for model selection, and is able to deal with weighted instances. This model gives us better correlation and yields a simple regression equation that is much more useful. We select this approach for further analysis. M5 Pruned Model Tree: Split the parameter space into areas (subspaces) and build in each of them a linear regression model. This model follows a decision tree approach, but uses linear regression. In this case, the tree breaks down into many-level nodes. Even though it yields good correlation, this model isn t practical in comparison to the Linear Regression. Least Median Squared: Implements a least median squared linear regression utilizing the existing Weka LinearRegression class to form predictions. The basis of the algorithm is Robust regression and outlier detection. DecisionTable: Class for building and using a simple decision table majority classifier. The result came up with 10 rules for the data. There is a loss of accuracy in the training data. We believe this model is too simple for data and the many attributes. SMOreg: Sequential minimal optimization algorithm for training a support vector regression using polynomial or RBF kernels. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. This model has the best correlation. However, the algorithm normalizes all attributes. Results
6 We select the Linear Regression Model as our main model due to its correlation, simplicity and practicality in nature. The following result and analysis show the relationships between the attributes and their effect on Tuition. This regression equation gives us a lot of insights about the data. Tuition = Effect on Tuition Analysis * Rank + Negative Better rankings (lower number) make tuition more expensive * %ClassesUnder20 + Negative Schools with smaller classrooms tend to be cheaper * AlumniGivingRate + Negative Schools where Alumni donate a lot of money tend to be cheaper * BoardCost + Positive Higher boarding cost means higher tuition cost as well * AvgFinAid + Positive Higher financial aid package means tuition will be higher as well * SizeFullTimeStudents Negative More students decrease tuition From looking at this table, we can see the weights of these 6 attributes on determining tuition. They give insights about colleges that one might not think about. Better ranked schools have higher tuition, but they also give higher financial aid. The linear regression shows us hidden data.
7 Graphic 1: Tuition (color) vs. Rank (size) Treemap: Better ranking, higher tuition for colleges The smaller boxes indicate better rankings. The darker the color indicates the more expensive the school is. In the Urban segment, the concentration of better-ranked colleges is noticeably more expensive. So when a college improves its ranking, it has more leverage in the market to increase its tuition. When its tuition increases, the college can increases its financial aid as well, and both increases would make the school looks better and more prestigious. Graphic 2: Tuition vs. Rank on the Size of the school This is a different representation of the tree map above. In addition, the larger the dot indicates the larger the size of the school. From this depiction, there seem to be no large school with tuition above
8 $30,000. This is probably due to the differences between public and private schools in term of ranking. State schools tend to be cheaper and larger, which dominates the $20-30 thousand range. Graphic 3: Average Financial Aid (color) vs. Acceptance Rate (size) The larger boxes indicate higher acceptance rate. The darker boxes indicate better financial assistance. In this case, the graphic makes it apparent that there is a correlation between the two attributes. More selective schools give out better financial aid packages. Subsequently, more selective schools are likely to be better-ranked with higher tuition. The regression model also tells us that higher tuition means higher financial aid as well. The regression model and the graphics above all tie together the relationships of the attributes and how a college would charge its tuition. Conclusion From our analysis, we learn that better ranked and more selective schools do not simply imply to be more expensive. There are many attributes that factor into the cost of tuition. Our analysis also shows the natural relationship of costs in attending college. Tuition Cost + Board = Total Cost Total Cost Financial Aid = Net Cost Here, we can see that schools with high Total Cost will compensate by offering better financial assistance. We would advise high school seniors to not be afraid to apply for better-ranked schools that looked expensive and prestigious, but may turn out to have a lower Net Cost.
9 We also conclude that the colleges operate on economy of scale. The schools with more economy of scale can reduce its average cost of operations and allocate that surplus to financial aid for students. Extensions To increase the accuracy of our model, we acknowledge that we could have further categorized our data by private and public 4-year colleges, and private and public 2-year colleges. Public colleges have a tendency to charge less for overall tuition and this factor may change our variables. Possible Extensions to Python modules: 1. We could make this more customizable by calculating the tuition based on other criteria a. E.g., user wants the predicted tuition for colleges with a specific class size b. This calculator could then return all colleges and predicted tuition costs that fall within the specified class size 2. We could turn this into a web page calculator a. Command Python to interpret part of the text as HTML b. Use forms either drop down to select a specific college, or input search criteria c. Select the necessary attribute information based on the user input d. Code in the model to calculate the tuition cost
10 Appendices Appendix A: createcollegedatadb.py # CS105 project: Predicting College Tuition # by: Jeanette Chu & Khiem Tran # name: createcollegedatadb.py # description: create the college data database and insert all data into the table # # import necessary modules import sqlite3 as db import string # create/connect to the database filename = "collegedata.db" conn = db.connect(filename) cursor = conn.cursor() # create the table try: cursor.execute(""" CREATE TABLE collegedata ( CollegeName text, Rank int, Classes20 numeric, AlumniRate numeric, Tuition numeric, Board numeric, AvgAid numeric, SizeFT numeric ); """) except: pass # commit changes and close cursor conn.commit() cursor.close() # open file to read in readcsv = open('collegetuitiondataset-forpythoncalculator.csv', 'r') for record in readcsv: record = string.split(record, ",") # insert the values sql = """INSERT INTO collegedata VALUES (?,?,?,?,?,?,?,?)""" parameters = (record[0], record[1], record[2], record[3], record[4], record[5], record[6], record[7]) cursor = conn.cursor() cursor.execute(sql, parameters) conn.commit() cursor.close()
11 Appendix B: tuitioncalculator.py # CS105 project: Predicting College Tuition # by: Jeanette Chu & Khiem Tran # name: tuitioncalculator.py # description: calculates the tuition based on selected model # # import necessary modules import sqlite3 as db import math ############################################################################### # to format dollars def dollarformat(total): # Converting number to string, cut out cents to work with dollars only total = float(total) total = "%.2f" % total cents = total[-3: ] total = total[0:-3] # Format with dollar sign and commas if len(total) <= 3: numform = "$"+total+cents else: n = len(total) numform = "" while n > 3: last = total[-3: ] numform = ","+last + numform n -= 3 total = total[0:-3] numform = "$"+total+numform+cents return numform ############################################################################### # to format numbers def numformat(number): # Converting number to string, cut out cents to work with dollars only number = str(number) # Format with dollar sign and commas if len(number)> 3: n = len(number) numform = "" while n > 3: last = number[-3: ] numform = ","+last + numform n -= 3 number = number[0:-3] numform = number+numform return numform ############################################################################### # to format percent def percentformat(decimal): decimal = decimal * 100 decimal = "%.2f" % decimal decimal = str(decimal) + "%" return decimal ############################################################################### def main(): # connect to the database and create handler filename = "collegedata.db" conn = db.connect(filename) cursor = conn.cursor()
12 # prompt for college name collegename = raw_input("enter name of college: ") # check for results from sql = ("SELECT * FROM collegedata WHERE CollegeName=?") cursor.execute(sql, [collegename]) count = 0 # for row in results: for row in cursor: rank = row[1] classes20 = row[2] alumnirate = row[3] board = row[5] avgaid = row[6] sizeft = row[7] count = count + 1 if count == 0: # prompt user for college info rank = input("enter the school rank: ") classes20 = input("enter % of class sizes under 20 (0 if N/A): ") alumnirate = input("enter alumni contribution rate: ") board = input("enter room & board expenses: ") avgaid = input("enter average financial aid package: ") sizeft = input("enter size of full-time students: ") # apply model equation calctuition = * rank * classes * alumnirate * board * avgaid * float(sizeft) # update the table to include entered data sql = ("INSERT INTO CollegeData VALUES (?,?,?,?,?,?,?,?)") parameters = (collegename, rank, classes20, alumnirate, calctuition, board, avgaid, sizeft) cursor.execute(sql, parameters) # commit changes conn.commit() cursor.close() # output results to user print "The predicted cost for %s is: %s\n" % (collegename, dollarformat(calctuition)) print "*"*50 # more results hit enter print "To see the factors affecting the cost hit enter:" enter = raw_input(">") print """ %s School Rank: %d %% of Classes < 20 Students: %s Alumni Contribution Rate: %s Room & Board Expenses: %s Average Financial Aid Package: %s Size of Full-time Students: %s """ % (collegename, rank, percentformat(classes20), percentformat(alumnirate), dollarformat(board), dollarformat(avgaid), numformat(sizeft)) if name == " main ": main()
Data Mining III: Numeric Estimation
Data Mining III: Numeric Estimation Computer Science 105 Boston University David G. Sullivan, Ph.D. Review: Numeric Estimation Numeric estimation is like classification learning. it involves learning a
More informationPolynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
More informationEasily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
More informationCar Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
More informationStudying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
More informationModule 5: Statistical Analysis
Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More information1. How can I see the representation allowance in IT for position classes 46-54?
PayMonitor Secrets Revealed 1. How can I see the representation allowance in IT for position classes 46-54? - Customise - Report templates - Standard detail report modify - Select fields: Select only annual
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationApplying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation
Applying Customer Attitudinal Segmentation to Improve Marketing Campaigns Wenhong Wang, Deluxe Corporation Mark Antiel, Deluxe Corporation ABSTRACT Customer segmentation is fundamental for successful marketing
More informationLAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE
LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE MAT 119 STATISTICS AND ELEMENTARY ALGEBRA 5 Lecture Hours, 2 Lab Hours, 3 Credits Pre-
More informationTime Clock Import Setup & Use
Time Clock Import Setup & Use Document # Product Module Category CenterPoint Payroll Processes (How To) This document outlines how to setup and use of the Time Clock Import within CenterPoint Payroll.
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationData Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
More informationTI-Inspire manual 1. Instructions. Ti-Inspire for statistics. General Introduction
TI-Inspire manual 1 General Introduction Instructions Ti-Inspire for statistics TI-Inspire manual 2 TI-Inspire manual 3 Press the On, Off button to go to Home page TI-Inspire manual 4 Use the to navigate
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationPearson's Correlation Tests
Chapter 800 Pearson's Correlation Tests Introduction The correlation coefficient, ρ (rho), is a popular statistic for describing the strength of the relationship between two variables. The correlation
More informationClassification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data
Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 2 nd, 2014 Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition
More informationData Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
More informationAgenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
More informationIndex Contents Page No. Introduction . Data Mining & Knowledge Discovery
Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationNATIONAL CENTER FOR EDUCATION STATISTICS. Integrated Postsecondary Education Data System (IPEDS) IPEDS Data Center User Manual
ASDFASFDAS NATIONAL CENTER FOR EDUCATION STATISTICS Integrated Postsecondary Education Data System (IPEDS) IPEDS Data Center User Manual ASDFASFDAS INTEGRATED POSTSECONDARY EDUCATION DATA SYSTEM (IPEDS)
More informationMicrosoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationMS Access: Advanced Tables and Queries. Lesson Notes Author: Pamela Schmidt
Lesson Notes Author: Pamela Schmidt Tables Text Fields (Default) Text or combinations of text and numbers, as well as numbers that don't require calculations, such as phone numbers. or the length set by
More informationCALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
More informationBeating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
More informationA Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
More informationPrediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationBeating the MLB Moneyline
Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
More informationAdvanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
More informationEngineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationU.S. News & World Report 2015 Best Colleges Rankings Summary Report
U.S. News & World Report 2015 Best Colleges ings Summary Report U.S. News World Report releases the Best Colleges ings each fall. These undergraduate rankings are used by students to help them decide where
More informationIssues in Information Systems Volume 16, Issue IV, pp. 30-36, 2015
DATA MINING ANALYSIS AND PREDICTIONS OF REAL ESTATE PRICES Victor Gan, Seattle University, gany@seattleu.edu Vaishali Agarwal, Seattle University, agarwal1@seattleu.edu Ben Kim, Seattle University, bkim@taseattleu.edu
More informationClustering on Large Numeric Data Sets Using Hierarchical Approach Birch
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationUniversité de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationLucky vs. Unlucky Teams in Sports
Lucky vs. Unlucky Teams in Sports Introduction Assuming gambling odds give true probabilities, one can classify a team as having been lucky or unlucky so far. Do results of matches between lucky and unlucky
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationCS5950 - Machine Learning Identification of Duplicate Album Covers. Arrendondo, Brandon Jenkins, James Jones, Austin
CS5950 - Machine Learning Identification of Duplicate Album Covers Arrendondo, Brandon Jenkins, James Jones, Austin June 29, 2015 2 FIRST STEPS 1 Introduction This paper covers the results of our groups
More informationWeb Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
More informationIntroduction to Learning & Decision Trees
Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing
More informationPredictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.
Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged
More informationCo-Curricular Activities and Academic Performance -A Study of the Student Leadership Initiative Programs. Office of Institutional Research
Co-Curricular Activities and Academic Performance -A Study of the Student Leadership Initiative Programs Office of Institutional Research July 2014 Introduction The Leadership Initiative (LI) is a certificate
More informationEFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationIntroduce Decimals with an Art Project Criteria Charts, Rubrics, Standards By Susan Ferdman
Introduce Decimals with an Art Project Criteria Charts, Rubrics, Standards By Susan Ferdman hundredths tenths ones tens Decimal Art An Introduction to Decimals Directions: Part 1: Coloring Have children
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationIBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
More informationAn Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com
More informationData Mining Analysis (breast-cancer data)
Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationDetermining Minimum Sample Sizes for Estimating Prediction Equations for College Freshman Grade Average
A C T Research Report Series 87-4 Determining Minimum Sample Sizes for Estimating Prediction Equations for College Freshman Grade Average Richard Sawyer March 1987 For additional copies write: ACT Research
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationKNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it
KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
More informationRepresentation of Electronic Mail Filtering Profiles: A User Study
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu
More informationVendor: Brio Software Product: Brio Performance Suite
1 Ability to access the database platforms desired (text, spreadsheet, Oracle, Sybase and other databases, OLAP engines.) yes yes Brio is recognized for it Universal database access. Any source that is
More informationDoes it Pay to Attend an Elite Liberal Arts College?
Maloney 1 Does it Pay to Attend an Elite Liberal Arts College? Paul Maloney Abstract: One of the most important decisions in a person s life is what college they will attend. The choice of college can
More informationStart-up Companies Predictive Models Analysis. Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov
Start-up Companies Predictive Models Analysis Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov Abstract: A quantitative research is performed to derive a model for predicting the success of Bulgarian start-up
More informationIntroduction to IBM Watson Analytics Data Loading and Data Quality
Introduction to IBM Watson Analytics Data Loading and Data Quality December 16, 2014 Document version 2.0 This document applies to IBM Watson Analytics. Licensed Materials - Property of IBM Copyright IBM
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationCSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin
My presentation is about data visualization. How to use visual graphs and charts in order to explore data, discover meaning and report findings. The goal is to show that visual displays can be very effective
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More information1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand
More informationIBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
More information1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationA Program for PCB Estimation with Altium Designer
A Program for PCB Estimation with Altium Designer By: Steve Hageman AnalogHome.com One thing that I have had to do over and over on my new PCB jobs is to make an estimate of how long I think the layout
More informationBusiness Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)
Business Intelligence Professor Chen NAME: Due Date: Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Tutorial Summary Objective: Richard would
More informationJetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
More informationBig Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationUsing SPSS, Chapter 2: Descriptive Statistics
1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,
More informationIBM SPSS Direct Marketing 20
IBM SPSS Direct Marketing 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This edition applies to IBM SPSS Statistics 20 and to
More informationRiver Dell Regional School District. Computer Programming with Python Curriculum
River Dell Regional School District Computer Programming with Python Curriculum 2015 Mr. Patrick Fletcher Superintendent River Dell Regional Schools Ms. Lorraine Brooks Principal River Dell High School
More informationCreating a Gradebook in Excel
Creating a Spreadsheet Gradebook 1 Creating a Gradebook in Excel Spreadsheets are a great tool for creating gradebooks. With a little bit of work, you can create a customized gradebook that will provide
More informationMachine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error
More informationImproving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationHow U.S. News Calculated the 2015 Best Colleges Rankings
How U.S. News Calculated the 2015 Best Colleges Rankings Here's how you can make the most of the key college statistics. The U.S. News Best Colleges rankings can help prospective students and their families
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationGeoGebra Statistics and Probability
GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,
More informationIntroduction to SQL for Data Scientists
Introduction to SQL for Data Scientists Ben O. Smith College of Business Administration University of Nebraska at Omaha Learning Objectives By the end of this document you will learn: 1. How to perform
More informationData Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationAnalysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
More informationHow To Analyze Data In Excel 2003 With A Powerpoint 3.5
Microsoft Excel 2003 Data Analysis Larry F. Vint, Ph.D lvint@niu.edu 815-753-8053 Technical Advisory Group Customer Support Services Northern Illinois University 120 Swen Parson Hall DeKalb, IL 60115 Copyright
More informationSEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Deep dive into Haven Predictive Analytics Powered by HP Distributed R and
More informationWKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates
Research Report June 7, 2011 WKU Freshmen Performance in Foundational Courses: Implications for Retention and Graduation Rates ABSTRACT In the study of higher education, few topics receive as much attention
More informationPentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
More informationPredictive Dynamix Inc
Predictive Modeling Technology Predictive modeling is concerned with analyzing patterns and trends in historical and operational data in order to transform data into actionable decisions. This is accomplished
More informationASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
More informationData Mining Applications in Fund Raising
Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,
More information