Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D.
|
|
|
- Allen Hardy
- 10 years ago
- Views:
Transcription
1 Data Mining: A Magic Technology for College Recruitment Tongshan Chang, Ed.D. Principal Administrative Analyst Admissions Research and Evaluation The University of California Office of the President [email protected] Abstract This paper introduces a case study using data mining techniques to assist higher education institutions in achieving enrollment goals. The introduction includes the general life cycle of a data mining project: business understanding, data understanding, data preparation, modeling, assessment, and deployment. The model fit statistics show that data mining techniques, neural network, decision tree, logistic regression, and ensemble models, are successful in this project. Ensemble and neural network are better. The study concludes that data mining is an effective technology for college recruitment and also for institutional research and analysis. Data mining, defined as the process of sampling, exploring, modifying, modeling, and assessing large amounts of data to uncover previously unknown patterns (SAS, 2005, p. 1-3), has been used widely in various areas such as science, engineering, business, banking, and even combating terrorism for a long time. Its success and effectiveness in discovering actionable information from large sets of data has been well established. Recently, there has been an increasing trend that data mining techniques were used in institutional research and analysis including, but not limited to, college admissions yield prediction (Chang, 2006), retention and graduation prediction (Herzop, 2006; Sujitparapitaya, 2006; Baily, 2006), time-to-degree analysis (Eykamp, 2006; Herzog, 2006), enrollment management (Aksenova, Zhang, & Lu, 2006; Luan, 2006), course scheduling and online course offerings (Luan, 2006; Dai, Yeh, & Lu, 2007), student performance assessment (Dede & Clarke, 2007; Heathcote & Dawson, 2005; Minaei-Bidgoli, 2004; Ogor, 2007), and survey analysis (Yu, et. al, 2007). These applications have showed great success of various mining approaches in extracting information from the enormous higher education data. The findings help institutions of higher education make more effective decisions as to improve the quality of instruction and services. These studies also provide the evidence proving that data mining techniques have many advantages compared to the conventional analytical approaches in predicting and scoring behaviors of individual students (Luan & Zhao, 2006). However, the agreement about the significance of its practical application and evidence of its advantages still need to be further developed theoretically as well as practically. This paper, therefore, introduces a data mining project about college recruitment. With the evidence showing that attaining some level of education beyond high school is 1
2 critical for individuals to succeed in competing in such challenging global economy and experience a middle-class lifestyle (Kelly, 2005), undergraduate admissions has been more challenging in almost all American higher educational institutions no matter whether they are selective or non-selective, two-year or four-year. Kroc and Hanson (2003) pointed out that the process of student recruitment begins with two important questions: Who does the institution want to educate? and Who is available? Obviously, how to effectively target the prospective students who are qualified to the admissions and successfully recruit them is one of the major issues admission staff has to address effectively. This study describes the data mining procedures the University of California (UC) used in admissions prediction to assist its campuses in achieving enrollment goals. The study is organized into: UC s admissions process and criteria for business understanding that is generally considered as the first step of the life cycle of data mining modeling, data description, data preparation, modeling, model assessment, deployment, and conclusions. The project used SAS Enterprise Miner, so the discussions and figures included in this paper are based on the outputs from this software package. UC s Admissions Process and Criteria UC is a public research institution with 10 campuses, so the admissions process and criteria for resident and non-resident applicants are quite different. Since this case study only includes resident freshman students, the admissions process and criteria for instate applicants are introduced in the following sections only. UC uses a process called comprehensive review to evaluation applications. The process consists of two steps: eligibility calculation and selection (The University of California, 2008a). There are three paths to eligibility for resident freshmen. Eligibility in the Statewide Context. Most of UC s applicants take this path to enter the University. The minimum requirements for this path include 15 yearlong academic courses completed in high school and a combination of GPA and standardized test scores. UC converted applicant s individual standard test scores to the university score. If an applicant submitted scores of SAT Reasoning Test and two SAT Subject Tests, his or her total university score is calculated by adding all five university scores together. If an applicant submitted ACT test scores, his or her total university score is calculated on adding ACT Math score, ACT Reading Score, and ACT Science score together and multiplying the total by.667, then adding ACT English with Wring score and two highest SAT Subject Test scores. Note that no matter whether an applicant submitted SAT Reasoning test scores or ACT test scores, two SAT Subject test scores are required. Then, the eligibility is determined based on the combination of the high school GAP and the total university score calculated. Eligibility in the Local Context (ELC). The ELC program recognizes students individual accomplishments in terms of the opportunities offered by applicants high schools. To be an eligible admit through the ELC, an applicant must rank in the top 4 percent of students in his or her high school s graduating class from a participating high school in the ELC program as determined by the University. UC identifies the top 4 percent of students on the base of GPA in the UC-approved coursework completed in the 10 th and 11 th grades. The applicant should also complete the equivalent of 11 yearlong 2
3 courses of UC s subject requirement by the end of the junior year and remains a minimum GPA of 3.0. Eligibility by Examination Alone. UC also allows applicants to qualify for admission by earning scores on SAT or ACT tests. To quality in this way, an applicant must achieve a minimum university score of 410 and also earn a minimum university score of 63 on each individual test of the ACT or SAT Reasoning Test and two highest SAT Subject Tests. The method used to calculate the university score is the same as described previously. Selection. A student who meets the minimum requirements for any of these three paths is considered as eligible to UC admissions. UC guarantees a place on one of its campuses to all of these applicants. However, because many campuses receive applications from more eligible students than they have space for, meeting the minimum requirements may not be enough to gain admission to the campus of the applicant s choice. When a campus has to choose among qualified students, it applies standards that are more demanding than minimum requirements. Using a process called comprehensive review (The University of California, 2008b), admissions officers look beyond the required test scores and grades to evaluate applicants academic achievements in light of the opportunities available to them and the capacity each student demonstrates to contribute to the intellectual life of the campus. If an eligible student is not admitted to the campus to which they applied. UC will put this student in the Referral Pool and redirects his or her application to the campuses they did not apply for. Currently, two campuses accept students from the Referral Pool to increase their enrollments. Problem Statement. The problem is that UC does not know among those eligible applicants, who will not be admitted to the campus they applied to until April. From the enrollment perspective, at that time, it may be too late for these two campuses to distribute their information to these students and persuade them to enroll. It may also be too late for those who are admitted to these two campuses from the Referral Pool to have enough time to consider whether or not they would like to accept the offer because they may have already received offers from other institutions. In order to increase enrollments from the Referral Pool, UC also launches a program called Early Referral Pool. The Early Referral Pool works in the way that two campuses send a letter in February to those who have a high probability to be included in the Referral Pool and let them know that they may be received an admissions offer from these two campuses, and also ask them to send their feedback to the campuses as to whether or not they would like to consider the admission offer from these two campuses. Two campuses will admit those students sending back positive feedback based on the comprehensive review on their applications. The Early Referral Pool, therefore, includes those who are most likely to be included in the referral pool and are interested in the admissions offer from these two campuses. The question is: To whom do these two campuses send a letter? In other words, what is the probability that each eligible student may be included in the Referral Pool? The Purpose of the Project The purpose of the project is to estimate the probability that an eligible applicant may be included in the UC s Referral Pool. In other words, the project is to predict who may not be admitted to the campus of their choice although they are eligible for the 3
4 admissions to the University systemwide. Then, the University s two campuses accepting students from the Referral Pool use the information to make admissions offers through the Early Referral Pool program to achieve the enrollment goal. Data Description Training Data. The base data used to build predictive models were extracted from UC s admissions database, which is an integrative student data system including application, admission, enrollment, and student progress information updated periodically. In other words, the data files are created repeatedly with updated information throughout the academic year in a regular basis without overwriting the previous version. This feature provides the opportunity to compare data at the same time point over the years. For example, the target population includes students from January application data for 2008, so for the purpose of compatibility, it is more reasonable to use the training data extracted in January 2007 rather than the final version created in December, 2007, which, though, may include complete information. The training data contain 45,393 applicants who were eligible to the admissions of the University in the fall term of 2007 in terms of any one of the three paths described previously. Target Population. The target population includes 48,356 applicants in the fall term of 2008 who were fully satisfied with the University s admission criteria for at least one of the three paths to the eligibility for state residents. The data were extracted from the database built based on the application information in January, Variables. The data elements in two data sets are consistent, including applicants academic information and socioeconomic backgrounds. Table 1 shows the variables used for modeling. The outcome is Referral Pool, which means whether or not an applicant was in the referral pool. This variable was added to the training data from the final admissions file. As mentioned above, the University does not know who will be referred to the two campuses until all campuses make the final admissions decisions. This information, therefore, is only available in the final admissions file. The predictors include Ethnicity, First Language, Campus Applied, Parents Highest Education Level, Family Income, Home Location, Discipline, Outreach Program, School Performance, High School GPA, and University Score (Table 1). Ethnicity has five categories: African-American, American-Indian, Chicano or Latino, Asian, White, and Other or Declined to State. First Language means whether their first language is English only, English and another language, or another language only. If it is missing, it was categorized as Unknown. A series of variables representing each campus indicate whether an applicant applied to the corresponding campus. Each of these variables was coded as 1 (applied) or 0 (not applied). Some students may apply multiple campuses, so one dummy variable does not work for this case. Parents Highest Education Level refers to the highest education an applicant s father or mother received. It consists of four groups: High School or Less, Two-Year College Graduate, Four-Year College Graduate and Post-Education Studies, and Unknown. Family Income is a continuous variable. Home Location was generated based on applicants permanent addresses reported in their application. It was classified into five different locations in terms of the geographical and economic characteristics of the state. Discipline includes Engineering, Science, Social Science, Humanities, and Others or Not Stated. The categories were generated based on the CIP (Classification of Institutional Program) code. Outreach Program means whether 4
5 or not an applicant participated in any outreach program offered by the University before they applied to the University. It is a dichotomous variable. If there is no information available for this variable, it was coded as 0 indicating the student did not participate in any outreach program. School Performance represents the state rank of the school from which an applicant graduated. High School GPA is the weighted, capped GPA, which was calculated based on the generally required courses by the University and limited honors courses such as IB (International Baccalaureate) course, AP (Advanced Placement) courses, college level courses, and so on. University Score was calculated using SAT or Act test scores. It is important to note that UC does not use applicants socioeconomic backgrounds for admissions decision process in any way. However, the variables selected have a significant correlation with applicants admissions status. Table 1: Predictors Included in the Modeling Variable Data Type Description Ethnicity C 7 categories First Language C 3 categories: English Only, English and Another Language, and Another Language Campus Applied to C 7 variables, one for each campus Parent s Educational Level C 5 categories: HS or Less, 2 Year College, 4 Year College and Post Ed. Study, Missing Family Income N Home Location C 5 categories based on the geographic region Discipline C 7 variables, one for each campus Outreach Programs C Participation in any outreach program API Ranking C The State School Ranking High School GPA N Weighted, Capped GPA UF Score N Converted SAT or ACT score Note: C = Categorical; N = Numerical. Data Preparation Eligibility Calculation. Based on the eligibility requirements described above, the eligibility status of each student was determined first. The methods used for both years (training and target) are the same. Imputation of Missing Values. In today s society, more and more data are collected for analytical purpose. However, it is very rare that a data set does not have any missing values. This issue always bothers data analyst. Certainly, from the perspective of data accuracy, one choice to deal with this problem is to simply discard all the records with missing values. Unfortunately, the missingness may reflect some characteristics of data, so if it is related to the target or the inputs, ignoring missing values may reduce data drastically, thus distort analysis (SAS, 2005). The alternative is to impute missing values using a reasonable algorithm. If variables with missing values are categorical, the missing values can be considered as a separate category. If variables with missing values are numerical, SAS Enterprise Miner provides 11 methodologies to deal with this problem such as mean, median, mid-range, tree, tree surrogate, and so on. 5
6 Whether or not to impute missing values also depends on data mining methods selected. Decision trees deal with the problem of missing values directly. In other words, decision trees inherently handle missing values in either numeric or categorical input variables by splitting observations into its own branch with a presence of missing values. This approach is more appropriate than discarding records with missing values or including cases by imputing missing values (Berry & Linoff, 2004). Logistic regression and neural network models, unfortunately, ignore all observations with missing values on any of the target or the input variables. In few cases in data mining, these models may be used alone. Mostly, they are all selected to build various models, which then are compared in terms of the performance for the selection of the best fitting model. It is more appropriate to compare models built on the same set of observations (SAS, 2005). In addition, if missing values are not imputed during the modeling process, observations with missing values cannot be scored with the score code built from the models. This case study includes three numerical variables as described before and also used decision tree, logistics regression, and neural network to develop initial models, so the missing values must be imputed. Several methods (median, tree, and mean) were applied to imputation of the missing values for three continuous variables (GPA, parent income, and university score), but mean in this project produced the best prediction according to the misclassification rate on the test data with both neural network and regression models. Data Transformation. Skewed distributions and outliers cause problems for any data mining technique using the values arithmetically (Berry & Linoff, 2004). Again, some mining methods may not be so sensitive to a skewed distribution such as decision tree, but others are. SAS Enterprise Miner provides 14 data transformation methods such as logarithm, square root, exponential, and so on. In this case study, among three continuous variables, GPA, university score, and parent income, parent income is negatively, highly skewed. Again, a few methods were tried and squared root resulted in an expected distribution. Model Description and Modeling Process Data Partition. SAS Enterprise Miner includes a data partition node (Figure 1), which functions to divide the input data set into training, validation, and test data sets in terms of the user s decision on each data set as percentage of the entire data and the partitioning method (sample random, cluster, stratified, or stratified random). A training data set consisting of preclassified observations is used to build an initial model, while the validation set is used to monitor and adjust the initial model to make it more general (Berry & Linoff, 2004). The tuning process optimizes the selected model based on the validation data (SAS, 2005). In other words, the selected model in terms of the performance on the training data set usually leads to an overfitting (high variance) problem. Selecting a model based on the performance on the validation data set can produce a trade-off between bias and variance. The data in the test set are treated as new data used to gauge the likely effectiveness of the model when applied to unseen data (Berry & Linoff, 2004). The test data set may not be necessary if it is enough to know that the prediction model will likely give the best generalization. A variety of sampling methods are available in SAS Enterprise Miner to select observations from the entire data set for data partition, such as random and stratified 6
7 random sampling. As described previously, the data used for this project include many categorical variables and the subpopulations vary considerably by ethnicity, parental income, and so on, so the stratified random method was utilized to split the original data into three sets. The training data set includes 18,158 observations, accounting for 40%, and the validation and test sets contain 13,617 and 11,620, respectively, each accounting for approximately 30% of the entire data. The proportion of the events (students included in the referral pool) in all these three data sets is about 15% of the observations in each data set. Figure 1: Data Mining Process Flow Diagram Model Selection. As described previously, this data mining project was to build a prediction starting with historical data where the values of the target variable are already known. The modeling task is to discover rules that can be used to explain the known values of the target variable. Therefore, this is an example of directed data mining called supervised learning (Berry & Linoff, 2004). In undirected data mining called unsupervised learning, there is no target variable. The purpose of the modeling is to explore overall patterns that are not tied to any one variable. In terms of the purpose and characteristics of this project, three model techniques were selected: logistic regression, neural network, and decision tree. In addition, based on three models developed using these methods, an ensemble model was also created. Logistic Regression. The Regression node (Figure 1) in SAS Enterpriser Miner is capable of running linear and logistic regression models depending on the measurement level of the target variable. Since the target variable in this project is dichotomous (whether or not an applicant will be included in the referral pool), Logistic Regression was chosen. Logistic Regression in data mining has nothing different from the logistic regression commonly used in statistical analysis for a long time. The predictions are probabilities, which are transformed with the logit transformation in this study. There are also two other methods available for probability transformation in the regression node in SAS Enterprise Miner. They are probit and log-log transformations. Also, there are three choices for model selection: forward, backward, and stepwise. By default, none of them is selected. In this case, all inputs are used to fit the model. If any of these three models is used, a set of selection criteria are available for the selection of the final model. Since this project chose the default method, no selection criteria were needed. Figure 2 shows the logistic regression properties and corresponding parameters set for this project. 7
8 Neural Network. Neural network is an information processing paradigm that has the similar property of biological neurons. It is composed of a large number of highly interconnected processing elements (neurons). There are three layers: input, hidden, and output. The activity of the input units represents the raw information that is fed into the network. The activity of each hidden unit is determined by the activities of the input units and the weights on the connections between the input and the hidden units. The behavior of the output units depends on the activity of the hidden units and the weights between the hidden and output units. The network is trained by multiplying each input by a random weight. The products are summed up with a constant to produce the output. Then the sum of these products forms the argument of a hyperbolic tangent. The number of tanh functions is the number of hidden units. Therefore, the complexity of the function is related to the number of hidden units. A minimum of three hidden units is required before substantial differences from second order polynomial regression models are possible (SAS, 2004a). This allows the neural network method to capture almost arbitrarily nonlinear relationships between the predictors and the outcome (Bhadeshia, 1999). The weights are systematically changed until a best-fit description of the output is obtained as a function of the inputs. Figure 3 indicates the neural network properties along with the corresponding parameters set up for this study. Among them, the number of hidden units, model selection criterion, and the training technique are the most important. The number of hidden units is 3, which was set by default. However, it may or may not be the best number. Misclassification statistic was selected as model selection criterion. Also, the training techniques is the default one, which selects the best training technique based on the weights that are applied at execution. Decision Tree. A decision tree is a structure that can be used to divide up a large collection of records into smaller sets of records by applying a sequence of simple decision rules. With each successive division, the members of the resulting sets become more and more similar to one another. A record enters the tree at the root node. The root node applies a test to determine which child node the record will encounter next. This process is repeated until the record arrives at a leaf node. All the records that end up at a given leaf of the tree are classified the same way, so there is a unique path from the root to each leaf. This path is an expression of the rule used to classify the records. For this project, the criterion used to evaluation candidate splitting rules and to search for the best one is ProbChisq by default (Figure 4). This method evaluates candidate based on p-value of Person Chi-square statistic for target versus the branch node. There are also some other method available in SAS Enterprise Miner such as Entropy, Gini, Variance, and so on. Another important property is Leaf Size, which is used to specify the minimum number of training observations that are allowed in a leaf node. It was set as 5 in this project. Ensemble. The Ensemble node creates a new model by integrating the posterior probability if the target variable is categorical or the predicted values if the target is interval from multiple models. There are some approaches to create an ensemble model. One common approach is to build different models using the same modeling technique, but different samples for the training data and create an ensemble model based on these models. Another common method is to use multiple data mining techniques such as a 8
9 decision tree, logistic regression, and a neural network to build different models based on the same training data sets and then combine these models to form an ensemble model. Figure 2: Logistic Regression Model Property and Corresponding Parameter 9
10 Figure 3: Neural Network Model Property and Corresponding Parameter There are several methods available with the Ensemble node in SAS Enterprise Miner to combine different models. The average method takes the average of the 10
11 posterior probabilities for categorical targets or of the predicted values for interval targets from contributed models. The maximum method uses the maximum of the posterior probabilities or of the predicted values from different models to form an ensemble model. If the target is categorical, voting is also an alternative, which calculates the posterior probabilities by either average (averaging the posterior probabilities from the models that predict the same target event) or proportion (recalculating the posterior probability for a target value based on the proportion of individual models predicting the same event level, e.g. 2/3 for a three-model ensemble if two models predict the same event level, while another one predicts another level). As indicated in Figure 5, the method used to calculate the predicted values and posterior probabilities is average by default for this project. Although the method for posterior probabilities for voting is average, too, it did not do anything for this project because the method to compute posterior probability is average. It works only if the method for calculating posterior probabilities is set as voting. Figure 4: Decision Tree Model Property and Corresponding Parameter 11
12 Figure 5: Ensemble Model Property and Corresponding Parameter Model Assessment When building predictive models, the best model often varies according to the criteria used for evaluation. One criterion might suggest that the neural network is the best, while the other criterion may introduce another model. SAS Enterprise Miner provides several approaches to assess and compare the models build with the same data such as fit statistics, the receiver operating characteristic (ROC), and score ranking overlay. Figure 6 demonstrates the ROC curve, which provides comparative measures of the predictive accuracy of four models based on the test data set. In the ROC chart, the vertical axis is the sensitivity, which measures the accuracy for predicting events (the true positive divided by the total actual positive), while the horizontal axis represents one minus the specificity, which shows the accuracy for predicting nonevents (the true negative divided by the total actual negative). The more the ROC curve bulges towards the upper-left corner, the better the performance quality of a model is (SAS, 2005). The area under the curve ranging from.50 for the poorest model to 1.00 for the perfect model indicates the model s ability to predict the occurrence of events or nonevents. The chart shows that the ensemble model, the neural network model, and logistic model are almost the same with a very good fit in all three data sets compared with the decision tree model. Figure 6: Receiver Operating Characteristics (ROC) Charts Score Rankings Overlay is another way that can be used to compare models. It includes a variety of methods: lift, cumulative lift, gain, percent response, cumulative percent response, percent captured response, and cumulative percent captured response. 12
13 Figure 7 demonstrates cumulative percent captured responses. First, the observations were sorted based on the predicted probability to be included in the referral pool from high to low. Then, they were divided up to deciles. The cumulative percent captured responses was calculated by adding the events (in this case, students who were predicted to be included in the referral pool) captured by each decile, then divided by the total number of the events. For example, in Figure 7, Deciles 1 to 20 representing 20% of the entire population captured approximately 80% of the events by the ensemble, neural network, or logistic regression model and about 65% by the tree model. Again, this measure shows that the ensemble, neural network, and logistic regression models have a very similar fitting, all performing better than the tree model. Figure 7: Score Rankings Relay: Cumulative % Captured Responses Table 2 shows some selected fit statistics. The ensemble model has the lowers misclassification rates on all three data sets, followed by the neural network model and the logistic regression model. The decision tree model performs the worst on classifying the target respondents. The statistics of average squared error show that the ensemble model does not improve the performance compared to the neural network model. Again, the decision tree model demonstrates the highest average squared error on all three data sets. Table 2: Selected Fit Statistics Test Method Ensemble Neural Network Logistic Regression Decision Tree Misclassification Rate Training Data Validation Data Test Data Average Squared Error Training Data Validation Data Test Data
14 The final model can be selected based on either one of the methods described above in terms of the business purpose. The main purpose of this project was to predict whether or not an applicant could be included in the referral pool, so the misclassification rate is a more appropriate method used to select the best fitting model. Consequently, the winning model is ensemble, which was used for scoring the target populati. In addition, it is also interesting to know how each factor contributed to the modeling. SAS Enterprise Miner provides a variety of methods to investigate the importance of each variable in the modeling. Figure 8 shows part of decision tree map for the training data. The map indicates that high school GPA is the most prominent variable in the tree model. If a student s GPA is 3.6 or higher, the probability this student is likely to be in the referral pool is only 7%; otherwise, the probability is 37%. A thorough investigation of the tree map will find how important each factor is in the model. Figure 8: A Partial May of Decision Tree In Figure 9, each bar represents the absolute coefficient for a term in the logistic regression model. Again, the effects plot shows that high school GPA is the most prominent effect in the model. The higher a student s GPA is, the less this student is likely to be in the referral pool. The first bar is the constant in the equation. Figure 9: Logistic Regression Effects Plot Scoring, Deployment, and Results The probability that each student in the target population was calculated using the estimates of the ensemble model. A list of students who were predicted to have a 14
15 probability of 40% or above was sent to two campuses. Then two campuses sent a letter to these students after a comprehensive review of their academic performance. According to the feedback from students, two campuses made admissions offers to 1,099 students. In the fall term, after admits submitted a statement of intention to register, a set of comparisons were conducted to examine the accuracy of the model. In terms of the total number of those who were predicted to be included in the Referral Pool, the accuracy rate is about 93%. The accuracy rate was calculated by dividing the sum of all probabilities of target students that they were estimated to be in the Referral Pool by the total number of those who were actually included in the Referral Pool. As for individual students, Table 3 shows a comparison of predicted and actual referral pool by predicted probability range. For example, if the cut point of the probability is 40%, which was a cut point used for the data deployment, the predicted referral pool would include 8,209 students, among whom 5,518 were actually included in the actual referral pool. The accuracy rate with this cut point of probability is about 67%. The predicted referral pool accounted for 17% of the entire population. The targeted referral pool by the ensemble model represented about 65% of the real referral pool. This rate is almost the same as the predicted rate on the test data set as indicated in Figure 3. Table 3: A Comparison of Predicted and Actual Referral Pool Predicted Probability Predicted Referral Pool Actual Referral Pool Cumulative Accuracy Rate Predicted Referral Pool as Cumulative % of Total Population Actual Referral Pool as Cumulative % of the Entire Referral Pool 90% or Above % 0.1% 0.6% 80% or Above % 0.7% 3.3% 70% or Above 2,732 2, % 5.6% 23.9% 60% or Above 4,986 3, % 10.3% 42.0% 50% or Above 6,659 4, % 13.8% 54.3% 40% or Above 8,209 5, % 17.0% 65.2% Table 4 shows a comparative analysis of the yield rates between the early referral pool and the traditional referral pool in 2008 and also the yield rates between 2008 and previous years. The yield rate of the admits offered through the early referral pool program was about 22% compared to 6.4% for the traditional referral pool students to whom two campuses made an offer in April. The overall yield rate of the referral pool in 2008 was 8.3%, up by more than 1.5 percentage points from previous years. Also, the enrollment from referral pool in 2008 accounted for about 13% of the total enrollment, increasing by more than a quarter from the average rate of three years before. Summary and Conclusions This paper introduces a real project to assist higher education institutions in achieving enrollment goals using data mining techniques. The introduction includes six phases of the general life cycle of a data mining project: business understanding, data understanding, data preparation, modeling, assessment, and deployment. As examined above, all of these phases are critical to a successful predictive model. The sequence of these six phases is not rigid. It is always necessary to move back and forth between different phases. 15
16 The model fit statistics show that data mining techniques used in this project, neural network, decision tree, logistic regression, and ensemble models, are all effective with a misclassification rate around 10%, though ensemble and neural network methods are better. From this case study, it may not be proper to make the same conclusion as other researchers did before that all data mining models are better than traditional logistic regression model because the model fit statistics are so close to each other, and furthermore, logistic regression model even has a better performance than decision tree model. However, it is concluded that with multiple modeling approaches and assessment methods available, data mining offers more flexibility. Furthermore, the results also provide evidence that data mining is an effective technology for college recruitment. It can help higher education institutions mange enrollment more effectively. Table 4: A Comparative Analysis of the Results between 2008 and Previous Years Total Early Referral Pool Traditional Referral Pool Actual Referral Pool 6,170 6,090 6,923 9,300 1,099 8,201 SIR s from Actual Referral Pool Referral Pool Yield Rate 6.4% 6.5% 6.7% 8.3% 21.9% 6.4% Total SIRs from All Admits 3,691 4,006 4,412 5,770 Referral Pool SIRs as % of Total SIRs 10.6% 9.9% 10.5% 13.1% Note: SIR stands for Statement of Intention to Register. Finally, there may be something that should have been done to improve the model fitting. For example, the models developed in this project used many variables. Although these variables have a significant association with the admissions status, some of them are actually highly correlated to each other. Minimizing input redundancy may produce much of the success of the predictive models. Particularly for neural network, choosing the correct inputs has a significant influence on network complexity and flexibility. Variable clustering, principal components, tree models, regression, and enhanced weightof-evidence methods are available and effective for this purpose in SAS Enterprise Miner (SAS, 2004b). It is also worth noting that there may be some other more efficient methods that can be used to deal with the missing value problem. For example, the income predictor with a lot of missing records could have worked better if it had been categorized by including a missing category instead of imputing missing values by mean. In addition, due to the time constraint, this project used the default value for many modeling properties. It is possible that resetting these values could improve the model performance. For example, the number of hidden units for this project is 3. It may or may not be the best number, which generally depends on the number of training cases, the amount of noise, and the complexity of the function. According to SAS Enterprise Miner instruction, if a neural network is used, it is impossible to know how much noise the target values have, or how complex the function is. Hence, it will generally be impossible to determine the best number of hidden units without training numerous networks with different number of hidden units and estimating the generalization error of each network. 16
17 References Aksenova, S. S., Zhang, D. & Lu, M. (2006). Enrollment prediction through data mining. Information Reuse and Integration, 2006 IEEE International Conference. Retrieved on 01/13/ Bailey, B. (2006). Let the data talk: Developing models to explain IPEDS graduation rates. In J. Luan & C. Zhao (Eds.), New Direction for Institutional Research, no San Francisco: Jossey-Bass. Berry, M. J. A. & Linoff, G. S. (2004). Data Mining Techniques for Marking, Sales, and Customer Relationship Management. Indianapolis: Wiley Publishing Bhadeshia, H. K. D. H. (1999). Neural networks in materials science, ISIJ International, V39(10). Chang, L. (2006). Applying data mining to predict college admissions yield: A case study. In In J. Luan & C. Zhao (Eds.), New Direction for Institutional Research, no San Francisco: Jossey-Bass. Chen, C. (2008). An integrated enrollment forecast model. IR Application, Volume 15. Dai, C., Yeh, C. & Lu, C. (2007). Applying data mining technology to analyzing user behavior in course website. Presented on the 3 rd IASTED International Conference: Advances in computer science and technology. Phuket, Thailand. Dede, C. & Clarke, J. (2007). Data mining as an emerging means of assessing student learning. Presented at EDUCAUSE Annual Conference. Eykamp, P. (2006). Using data mining to explore which students use placement to reduce time to degree. In J. Luan & C. Zhao (Eds.), New Direction for Institutional Research, no San Francisco: Jossey-Bass. Heathcote, L. & Dawson, S. (2005). Data mining for evaluation, benchmarking, and reflective practice in a LMS. Presented on E-Learn 2005: World conference on E- learning in corporate, government, healthcare, and higher education. Vancouver, Canada. Herzog, S. (2006). Estimating student retention and degree-completion time: Decision trees and neural networks vis-à-vis regression. In J. Luan & C. Zhao (Eds.), New Direction for Institutional Research, no San Francisco: Jossey-Bass. Kelly, P. J. (2005). As America becomes more diverse: The impact of state higher education inequality. National Center for Higher Education Management System. Luan, J. (2006). Using Academic Behavior Index (AB-Index) to Develop a Learner Typology for Managing Enrollment and Course Offerings: A Data Mining Approach. IR Applications, Volume 10. Luan, J. & Zhao, C. (2006). Practicing data mining for enrollment management and beyond. In J. Luan & C. Zhao (Eds.), New Direction for Institutional Research, no San Francisco: Jossey-Bass. 17
18 Minaei-Bidgoli, B. (2004). Data mining for a web-based educational system. Ph.D. Dissertation, Ogor, E. N. (2007). Student academic performance monitoring and evaluation using data mining techniques. Presented on Electronics, Robotics, and Automotive Mechanics Conference, SAS. (2004a). Predictive modeling using SAS Enterprise Miner 5.1: Instructor-based training, the SAS Institute Inc., Cary, NC. SAS. (2004b). Advanced Predictive Modeling Using SAS Enterprise Miner 5.1: Instructor-based training, the SAS Institute Inc., Cary, NC. SAS. (2005). Applying Data Mining Techniques Using SAS Enterprise Miner. Cary, NC: the SAS Institute. Sujitparapitaya, S. (2006). Considering student mobility in retention outcomes. In J. Luan & C. Zhao (Eds.), New Direction for Institutional Research, no San Francisco: Jossey-Bass. The University of California. (2008a). Path to Admission, Retrieved on 01/13/2009, The University of California. (2008b). Comprehensive Review Factors for Freshman Applicants, Retrieved on 01/13/2009, _app.html Yu, C. H., Jannasch-Pennell, A., DiGangi, S., Kim, C., & Andrew, S. (2007). A data visualization and data mining approach to response and non-response analysis in survey research. Practical Assessment, Research & Evaluation, 12(19). 18
Data mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
A fast, powerful data mining workbench designed for small to midsize organizations
FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business
Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
Enhancing Compliance with Predictive Analytics
Enhancing Compliance with Predictive Analytics FTA 2007 Revenue Estimation and Research Conference Reid Linn Tennessee Department of Revenue [email protected] Sifting through a Gold Mine of Tax Data
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA
Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA An Overview of SAS Enterprise Miner The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3.
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
A Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
Dawn Broschard, EdD Senior Research Analyst Office of Retention and Graduation Success [email protected]
Using Decision Trees to Analyze Students at Risk of Dropping Out in Their First Year of College Based on Data Gathered Prior to Attending Their First Semester Dawn Broschard, EdD Senior Research Analyst
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling
MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : [email protected] 1 Aims To introduce the basic concepts of data mining
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM
Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
2015 Workshops for Professors
SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market
Modeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables
Paper 10961-2016 Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables Vinoth Kumar Raja, Vignesh Dhanabal and Dr. Goutam Chakraborty, Oklahoma State
Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1
Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. Developing
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
This presentation was made at the California Association for Institutional Research Conference on November 19, 2010.
This presentation was made at the California Association for Institutional Research Conference on November 19, 2010. 1 This presentation was made at the California Association for Institutional Research
Easily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
Predictive Modeling of Titanic Survivors: a Learning Competition
SAS Analytics Day Predictive Modeling of Titanic Survivors: a Learning Competition Linda Schumacher Problem Introduction On April 15, 1912, the RMS Titanic sank resulting in the loss of 1502 out of 2224
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble
Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble Dr. Hongwei Patrick Yang Educational Policy Studies & Evaluation College of Education University of Kentucky
Application of Predictive Analytics to Higher Degree Research Course Completion Times
Application of Predictive Analytics to Higher Degree Research Course Completion Times Application of Decision Theory to PhD Course Completions (2006 2013) Rachna 1 I Dhand, Senior Strategic Information
Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables
Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Introduction In the summer of 2002, a research study commissioned by the Center for Student
Data Mining Applications in Fund Raising
Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY
S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
What is Predictive Analytics?
What is Predictive Analytics? Firstly, Analytics is the use of data, statistical analysis, and explanatory and predictive models to gain insights and act on complex issues. EDUCAUSE Center for Applied
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit
Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit Duncan Cleary Revenue Irish Tax and Customs, Ireland [email protected] Abstract: Revenue, the Irish
Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention
, 225 October, 2013, San Francisco, USA Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention Ji-Wu Jia, Member IAENG, Manohar Mareboyana Abstract---In this paper, we have
A C T R esearcli R e p o rt S eries 2 0 0 5. Using ACT Assessment Scores to Set Benchmarks for College Readiness. IJeff Allen.
A C T R esearcli R e p o rt S eries 2 0 0 5 Using ACT Assessment Scores to Set Benchmarks for College Readiness IJeff Allen Jim Sconing ACT August 2005 For additional copies write: ACT Research Report
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Joseph Twagilimana, University of Louisville, Louisville, KY
ST14 Comparing Time series, Generalized Linear Models and Artificial Neural Network Models for Transactional Data analysis Joseph Twagilimana, University of Louisville, Louisville, KY ABSTRACT The aim
Academic Performance of IB Students Entering the University of California System from 2000 2002
RESEARCH SUMMARY Academic Performance of IB Students Entering the University of California System from 2000 2002 IB Global Policy & Research Department August 2010 Abstract This report documents the college
Predictive Modeling in Enrollment Management: New Insights and Techniques. uversity.com/research
Predictive Modeling in Enrollment Management: New Insights and Techniques uversity.com/research Table of Contents Introduction 1. The Changing Paradigm of College Choice: Every Student Is Different 2.
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
CART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node
Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous
Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL
Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
Course Syllabus. Purposes of Course:
Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building
Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers
Paper 1863-2014 Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers Sai Vijay Kishore Movva, Vandana Reddy and Dr. Goutam Chakraborty;
An Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX [email protected] Carole Jesse Cargill, Inc. Wayzata, MN [email protected]
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry
Advances in Natural and Applied Sciences, 3(1): 73-78, 2009 ISSN 1995-0772 2009, American Eurasian Network for Scientific Information This is a refereed journal and all articles are professionally screened
Opening Slide. Michael Treviño. Director of Undergraduate Admissions University of California System June 2016
Opening Slide Michael Treviño Director of Undergraduate Admissions University of California System June 2016 UC s Commitment to Diversity Mindful of its mission as a public institution, the University
Modeling to improve the customer unit target selection for inspections of Commercial Losses in Brazilian Electric Sector - The case CEMIG
Paper 3406-2015 Modeling to improve the customer unit target selection for inspections of Commercial Losses in Brazilian Electric Sector - The case CEMIG Sérgio Henrique Rodrigues Ribeiro, CEMIG; Iguatinan
Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC
Technical Paper (Last Revised On: May 6, 2013) Big Data Analytics Benchmarking SAS, R, and Mahout Allison J. Ames, Ralph Abbey, Wayne Thompson SAS Institute Inc., Cary, NC Accurate and Simple Analysis
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
PharmaSUG2011 Paper HS03
PharmaSUG2011 Paper HS03 Using SAS Predictive Modeling to Investigate the Asthma s Patient Future Hospitalization Risk Yehia H. Khalil, University of Louisville, Louisville, KY, US ABSTRACT The focus of
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
IBM SPSS Neural Networks 22
IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Applying to the University of California
Applying to the University of California 2014-2015 Presented by the WHS Counseling Department University of California (UC) Overview Nine undergraduate campuses state-wide: Berkeley Davis Irvine Los Angeles
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four
USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION. Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA
USING LOGISTIC REGRESSION TO PREDICT CUSTOMER RETENTION Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Logistic regression is an increasingly popular statistical technique
The Importance of Community College Honors Programs
6 This chapter examines relationships between the presence of honors programs at community colleges and institutional, curricular, and student body characteristics. Furthermore, the author relates his
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Decision Trees What Are They?
Decision Trees What Are They? Introduction...1 Using Decision Trees with Other Modeling Approaches...5 Why Are Decision Trees So Useful?...8 Level of Measurement... 11 Introduction Decision trees are a
The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?
Conclusions Paper The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS? Insights from a presentation at the 2014 Hadoop Summit Featuring Brian Garrett, Principal Solutions Architect
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
PAKDD 2006 Data Mining Competition
PAKDD 2006 Data Mining Competition Date Submitted: February 28 th, 2006 SAS Enterprise Miner, Release 4.3 Team Members Bhuvanendran, Aswin Bommi Narasimha, Sankeerth Reddy Jain, Amit Rangwala, Zenab Table
M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1
M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
USING PREDICTIVE ANALYTICS TO UNDERSTAND HOUSING ENROLLMENTS
USING PREDICTIVE ANALYTICS TO UNDERSTAND HOUSING ENROLLMENTS Heather Kelly, Ed.D., University of Delaware Karen DeMonte, M.Ed., University of Delaware Darlena Jones, Ph.D., EBI MAP-Works Predictive Analytics:
Nagarjuna College Of
Nagarjuna College Of Information Technology (Bachelor in Information Management) TRIBHUVAN UNIVERSITY Project Report on World s successful data mining and data warehousing projects Submitted By: Submitted
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
