PAKDD 2006 Data Mining Competition

Size: px
Start display at page:

Download "PAKDD 2006 Data Mining Competition"

Transcription

1 PAKDD 2006 Data Mining Competition Date Submitted: February 28 th, 2006 SAS Enterprise Miner, Release 4.3 Team Members Bhuvanendran, Aswin Bommi Narasimha, Sankeerth Reddy Jain, Amit Rangwala, Zenab

2 Table of Contents Index Description Page Nos. Executive Summary 3 CRISP-DM Process I Business Understanding 6 II Data Understanding 7 III Data Preparation 12 IV Modeling 25 Variable Selection Methods Regression 26 Decision Tree 28 Variable Selection Node 29 V Assessment 31 Lift Charts 32 Sensitivity 37 Variable Selection 37 VI Evaluation 38 2

3 EXECUTIVE SUMMARY An Asian Telecommunications operator has successfully launched a third generation (3G) mobile telecommunications network. The company has already collected information about the customer s usage pattern as well as demographic information. The company wants to use this data to identify which customers are likely to switch from the current 2G network to the new 3G network. The data set provided to us for training had 251 variables having usage statistics of 20,000 customers with customers being known as 2G or 3G patrons. The main goal of this data mining project was to analyze the given data and understand the factors that play a role in increasing the likelihood of a customer moving to 3G. In order to achieve the goal various models were built and they were compared against each other to come up with the model that is most sensitive to consumers moving from 2G to 3G. Around 80% of the effort was concentrated on pre-modeling exercises like Recoding data and type mismatches, Understanding rejected variables and rejecting irrelevant ones based on business understanding. Cleaning and replacing missing values and transforming some data. A noteworthy aspect of the data was the missing occupation codes which were missing for 63% 1 of the data. We guesstimated that it could be due to data coding errors or the fact that the codes were too specific. Since we do not consider ourselves experts in this area we were very cautious in rejecting variables and ended up with variables to be selected further by the variable selection models. The three variable selection methods selected based on team experience were: Linear Regression; Decision tree (Gini index) and Variable Selection. Further, using linear regression as the variable selection method, various models such as logistic regression, decision tree and multilayer perceptron were built. Once the models were built, they were compared against each other and we concluded that the 1 Table 4: Variables with missing values 2 Variable Selection Phase 3

4 decision tree model with the Chi-square feature with a three-way split was the most significant predictor of the true positives (3G as 3G) compared to the other tree models being able to predict them right 79.17% 3 of the time Similarly, using Decision tree as the variable selection method, various models such as logistic regression, multilayer perceptron and decision tree models were built. These models were compared against each other. Once again the decision tree model (chi- square test) three- way split model performed the best being able to predict 3G as 3G; times out of 100 but with a slightly less misclassification rate than before of 22.33% 4 compared to 21.33% 3 over the test data set. Finally, similar to the way in which the previous models were built, the variable selection node was also used to perform the same functionality. Among the three models that were built i.e. Logistic Regression, Multilayer Perceptron and decision tree, it was concluded that logistic regression performed better when compared to the other models with a way better overall misclassification rate of 21.17% 5 and was able to predict 3G s; 78.82% 5 of the times. Furthermore, for each of the models that were constructed using the various variable selection methods, in each of the steps, the three models were combined to build an ensemble model. The idea behind building an ensemble model was built to check and see if when combined; whether the models were capable of producing a better result as compared to what was already achieved. After building the ensemble model, lift charts were used to compare each of the models. It was concluded that combining the various models using the ensemble model did not really have much effect on the model. The ensemble model did not perform any better 6 than the individual models that were built. Therefore it was decided to reject the ensemble model. The ensemble model along with the three best models was compared against each other and was concluded that the Logistic Regression model with the variable selection node to perform better 7 than all the other models created thus far. Since the main idea behind this modeling process was to find the ability of the models to predict the actual 3 Table 6: Comparison of the three best models (Linear Regression mode of variable selection) 4 Table 7: Comparison of the three best models (Decision tree mode of variable selection) 5 Table 8: Comparison of the three best models (Variable selection node) 6 Table 9: Comparison of the top 3 models built thus far 7 Pages

5 customers. Hence sensitivity of the model is a major selection factor between models along with using lift charts to observe the performance of other models comparatively. Logistic Regression model created yields a sensitivity of 78.61% and performed relatively better compared to other models up to the 40 th percentile and hence was therefore chosen as the best model to predict the customers who will move from 2G to 3G network. Across all deciles logistic regression with variable selection node performs best in capturing most percentage of responses with 66.04% 8 at the 40 th percentile. It is quite clear from the model that the age (older handsets) of the handset dictates the probability that the customer might not shift to the 3G network 9. This could be due to the fact that the customer might be more familiar with the features of the handset and are not willing to/does not have any use for the new features. Similar results are evident for the age of the customer. The more a customer downloads or plays games on the phone the more likely he/she is to move to 3G. Also different sub plans and different handsets increase or decrease likelihood of a customer moving to 3G. The ideal 3G customer profile from the model selected would be a o o o o o o young customer, who pays an average monthly bill of around $500 (hence affluent), has a handset not more than 4 years old, Hasn t received too many retention campaigns, downloads around Kb a month, has a phone manufactured by o code 49 with a sub-plan 2101 o code 1 with sub-plan It is probable that the selected sub-plans offer 3G related services. 8 Table 12: Comparison of 3 models: % Captured Response - Cumulative 9 Figure 23: Effect of the variables 5

6 I. Business Understanding CRISP DM Modeling Process An Asian telecom operator which has successfully launched a third generation (3G) mobile telecommunications network would like to make use of existing customer usage and demographic data to identify which customers are likely to switch to using their 3G network. Carriers are likely to embrace 3G, because their former cash cow basic cellular is becoming a low cost commodity. Wireless services would be an increasingly important revenue stream for the company. The motivation for building 3G networks according McMahan of TI is, to differentiate them selves and to have another revenue source. 3G phones offer superior voice quality and enhanced broadband and data connection services (such as streaming audio and video) of up to 2Mbps to cater to the high technology needs of the consumer. Data mining Goal There are 3 main goals of this data mining project Analyze the given data and understand the factors that play a role in increasing the likelihood of a customer moving to 3G. Build a model that is most sensitive to consumers moving to 3G. Score the dataset with this model and predict the unknown target variable.(cust_type = 2G/3G) Software Used We used SAS Enterprise Miner, Release 4.3 for the purpose of this Data Mining Project. Preliminary project plan 1. Build a predictive model to predict the likelihood of a customer moving to 3G services. i. First we understand and explore the data. ii. Change data types according to business understanding. 6

7 iii. Reject the unwanted variables by means of logical interpretation of the dataset. iv. Sample and balance the data(if required) to obtain a better prediction model v. Partition the data into Training, Validation and test samples. vi. Analyze missing values and determine appropriate replacement strategies for the same. vii. Observe the distribution of the data and implement suitable transformation strategies. viii. Build various predictive models to effectively and accurately capture 3G consumer profile. ix. Compare the results from the models built and select one. x. Explain the business implications II. Data Understanding Data collection The team will use data sets of the company that is composed of the fields that are available at the individual customer level and provided by the Pacific-Asia Conference on Knowledge Discovery and Data Mining An original sample dataset of 20,000 2G network customers and 4,000 3G network customers was provided with 251 data fields. The target categorical variable is CUSTOMER_TYPE (2G/3G). A 3G customer is defined as a customer who has a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network compatible mobile phone. Three-quarters of the dataset (15K 2G, 3K 3G) had the target field available and was used for training/testing. Describe and Explore data The data set contains information which is mostly consistent with the telephone usage statistics details of each of the customers. It also contains details specific to the customer s relation with the company, in terms of the customer loyalty program and 7

8 contract details. The data set also contains a variable which has been appended to it based on the customer s choice of service (2G /3G). This variable (CUST_TYPE) is to be used as predictor variables. To facilitate SAS, we coded the variables: 2G as 0; and 3G as 1. We further analyzed the input data of 15K customer records by attaching a Distribution explorer node in SAS and identified some key features that are mentioned below: The customers have handsets from 23 different manufacturers and are spanned across 40 nationalities (coded from 36 to 998) and categorized into 15 different occupation types of which 63% have missing values. The customers have an option to choose from 66 different sub-plans and have a flexibility of paying by 5 different methods. The company offers a maximum of 2 year contract (730 days) The company has an ongoing loyalty program. The data has records for 128 unique serial numbers. It is evident that the more the customers usage; in terms of minutes and the extra features the more is the probability of them moving to the 3G mobile handsets. But the proportion of people in this high end bracket is very low. Customers purchasing new handsets now are more likely to go for the 3G network Most popular handset maker: coded 7 Figure 1: Frequency count for handset maker 8

9 Most popular handset models: coded (#1), (#2), (#3) Figure 2: Frequency count for handset models Customers very rarely make us of their loyalty points, but those who do tend to lean towards the 3G network. Figure 3: Loyalty point s usage grouped by customer type CS is the most favored payment method (cash) by both 2G and 3G customers. Figure 4: Payment methods grouped by customer types 9

10 Recoding and Data Type Adjustment For ease of use by SAS, we coded the following variables Customer_type variable as 0 for 2G customers and 1 for 3G. The variable was named; customer_type1. Created a new variable PAYMENT_CHG_FLAG; which indicates whether a customer has ever changed his payment option plan in the past. By default, due to data misinterpretation or missing values in data, SAS had wrongly assumed the data types of some variables. An initial analysis required changing data types assumed by SAS to those dictated by data understanding. VARIABLE NAME SAS Coding Changed to NATIONALITY interval nominal REASON >10 numeric codes (40 codes) CUSTOMER_CLASS ordinal nominal one class is not grater than the other SUBPLAN interval nominal >10 numeric codes (66 codes) SUBPLAN_PREVIOUS interval nominal >10 numeric codes NUM_SUSP_TEL ordinal interval NUM_DELINQ_TEL ordinal interval <10 numeric values (max # =8) <10 numeric values (max # =5) PAY_METD_CHG ordinal interval <10 numeric values HS_CHANGE unary (rejected) interval all values are 0 OBSERVATION most people have no more than 8 suspended lines in the dataset most people have no more than 5 delinquent lines in the dataset most customers have not changed their payment method more than 10 times Either customer hasn t changed their handset in six months or hasn t notified the company about it. HS_MANUFACTURER interval nominal HS_MODEL interval nominal manufacturer codes are assumed to be numbers model numbers are assumed to be numbers TELE_CHANGE_FLAG ordinal (rejected) interval 99.97% are 0's. Values range from 0 to 2. almost all customers haven't changed their number in 6 months 10

11 TOT_PAST_DELINQ ordinal interval TOT_PAST_OVERDUE ordinal interval TOT_PAST_TOS ordinal interval <10 numeric values. Range (0-4) <10 numeric values. Range (0-6) <10 numeric values. Range (0-1) % records are 0. TOT_TOS_DAYS rejected numeric >98% are 0. TOT_PAST_DEMAND rejected - Unary numeric all values are 0 TOT_PAST_REVPAY ordinal interval 97.3% are 0's. REVPAY_PREV_CD ordinal interval 97.3% are 0's. VAS_VMN_FLAG rejected nominal 99.8% are 0's. VAS_INFOSRV unary (rejected) numeric all values are 0 VAS_GPRS_FLAG binary nominal range - 0,1,2. but 0% are 2's. DELINQ_FREQ ordinal interval <10 numeric values OD_FREQ binary interval Only 2 unique values REVPAY_FREQ ordinal interval <10 numeric values AVG_VAS_QI ordinal interval >10 numeric values AVG_VAS_QP ordinal interval >10 numeric values AVG_VAS_QTXT ordinal interval >10 numeric values Table 1: Variables for which data types were changed most customers have not changed their normal status to temporary on suspension most customers never remained on suspension for past 6 months most customers do not use own voice mail notification no one in the dataset uses th GPRS Flexi plan On further analyzing the data using a Distribution explorer node we discovered the following trends in the data: As usage of games in phones increases, customers are more likely to move towards 3G. (fig 5(a).) Most 3G customer does not make more calls at peak time (highly expensive). More proportion of 2G customers default than 3G (this could be because the sample has 83.33% of 2G customers) Most 3G customers prefer to pay by cash and are not likely to change this method. Most 3G customers have fairly new handset models. 11

12 Figure 5(a): AVG_VAS_GAMES Figure 5(b) AVG_PK_RATIO Figure 5(c): BLACK_LIST_FLAG Figure 5(d) HS_AGE It is to be noted that the data set is not complete, but consists of a sufficiently large number of missing values, which is of prime importance (will be explained further later). III. Data Preparation Sampling These observations made above are made on the dataset and provided as an data understanding tool but are not relevant to the prediction as the distribution of the target is heavily biased to the number of people using 2G 83.33% to 16.67% (people using 3G). 12

13 Figure 6(a): Distribution of CUSTOMER_TYPE Figure 6(b): Distribution of CUSTOMER_TYPE (After sampling) We observed that with such a biased sample it was not possible to model the data as the data would learn more how to predict 2G rather than 3G s. Such an unbalanced data would be undesirable for any model to run on. With this understanding we decided on Sampling the data to pick a balanced sample for training, validation and testing. To obtain an equal balanced sample we desired, we included all 3000 of the 3G customers in the 15K dataset with another random G ones using Stratified sampling. (Random seed: 12345). The resulting 6000 records were thus used for modeling. Data Partition In order to build the model, we need partition the model dataset in to three, train dataset, validation dataset and a test dataset. The team will partition for both balanced dataset and unbalance dataset with the proportion between train dataset and validation dataset, 60:30:10 with the random seed Data Selection Rejected variables by the Model The first task in preparing the data was removal of irrelevant variables from the data source. For this purpose we used the discretion of SAS, as it rejected most unary (nonmissing) variables and also variables whose distribution was irreparably skewed (for e.g. 13

14 99.78% zeros in TOT_PAS_TOS, etc.). Using these methods 32 variables were automatically rejected by SAS. VARIABLES REJECTED BY THE MODEL VARIABLE NAME DATATYPE REASON FOR REJECTION AVG_VAS_#123# numeric Unary value. 100% 0 s AVG_VAS_CG numeric Unary value. 100% 0 s AVG_VAS_IDU numeric Unary value. 100% 0 s AVG_VAS_IEM numeric Unary value. 100% 0 s AVG_VAS_IFSMS numeric Unary value. 100% 0 s AVG_VAS_ISMS numeric Unary value. 100% 0 s AVG_VAS_MILL numeric Unary value. 100% 0 s AVG_VAS_QP Numeric 99.96% 0 s. almost unary AVG_VAS_SS numeric Unary value. 100% 0 s AVG_VAS_WLINK numeric Unary value. 100% 0 s DELINQ_FREQ numeric Unary value. 100% 0 s STD_VAS_#123# numeric Unary value. 100% 0 s STD_VAS_CG numeric Unary value. 100% 0 s STD_VAS_IDU numeric Unary value. 100% 0 s STD_VAS_IEM numeric Unary value. 100% 0 s STD_VAS_IFSMS numeric Unary value. 100% 0 s STD_VAS_ISMS numeric Unary value. 100% 0 s STD_VAS_MILL numeric Unary value. 100% 0 s STD_VAS_QP Ordinal 99.96% 0 s. almost unary STD_VAS_SS numeric Unary value. 100% 0 s STD_VAS_WLINK numeric Unary value. 100% 0 s TELE_CHANGE_FLAG numeric 99.97% 0 s. almost unary TOT_PAST_DEMAND numeric Unary value. 100% 0 s TOT_PAST_TOS numeric 99.78% 0 s. almost unary TOT_TOS_DAYS numeric 99.78% 0 s. almost unary VAS_CSMS_FLAG Unary value. 100% 0 s VAS_DRIVE_FLAG binary 99.97% 0 s. almost unary VAS_IEM_FLAG binary 99.93% 0 s. almost unary VAS_INFOSRV numeric Unary value. 100% 0 s 14

15 VAS_SN_FLAG binary Unary value. 100% 0 s VAS_VMN_FLAG binary 99.99% 0 s. almost unary VAS_VMP_FLAG Binary 99.99% 0 s. almost unary Table 2: variables rejected by the model Rejected Variables by the Team Working and modeling with 200+ variables was still difficult and hence we decided to use our business understanding and managerial discretion, to eliminate more variable we felt were redundant or irrelevant. It was a cautious process as throwing away data could wreck the model. We eliminated the 29 variables mentioned below and have illustrated reasons for doing so as well. VARIABLE NAME REASON FOR REJECTION AVG_BILL_VOICE Is correlated to VOICED and VOICEI AVG_CALL Is correlated to outbound and inbound calls AVG_CALL_FRW Rejected by business understanding: judged irrelevant AVG_CALL_OP Is correlated to OBOP and IBOP (outbound and inbound off peak) AVG_CALL_PK Is correlated to OBPK and IBPK (outbound and inbound peak) Rejected by business understanding : since t1 minutes are retained AVG_CALL_T1 and is more descriptive than number of calls AVG_DIS_1900 Rejected by business understanding: 93.81% zeros AVG_MINS Correlated to inbound and outbound minutes (IB, OB) AVG_MINS_EXTRAN Rejected by business understanding: 98.88% zeros Correlated to INTT1, INTT2, and INTT3. Besides knowing top 3 AVG_MINS_INT countries is judged to be sufficient information. AVG_MINS_OP Is correlated to OBOP and IBOP (outbound and inbound off peak) AVG_MINS_PK Is correlated to OBPK and IBPK (outbound and inbound peak) HS_CHANGE Unary value : all zeros OD_FREQ Rejected by business understanding: 97.45% zeros Rejected by business understanding: Instead created a new variable PAY_METD_CHG_FLAG to indicate if payment method has PAY_METD_CHG changed. More descriptive. Rejected by business understanding. Data provided for those who PAY_METD_PREV haven t changed payment method too. Irrelevant. REVPAY_FREQ Rejected by business understanding: 99.83% zeros STD_BILL_VOICE Is correlated to VOICED and VOICEI 15

16 STD_CALL STD_CALL_FRW STD_CALL_OP STD_CALL_PK STD_MINS STD_MINS_EXTRAN STD_MINS_IB STD_MINS_OB STD_MINS_OP STD_MINS_PK STD_MINS_T1 Is correlated to outbound and inbound calls Rejected by business understanding: judged irrelevant Is correlated to OBOP and IBOP (outbound and inbound off peak) Is correlated to OBOP and IBOP (outbound and inbound peak) Correlated to inbound and outbound minutes (IB, OB) Rejected by business understanding: 98.88% zeros Rejected by business understanding: too much data about the same statistic Rejected by business understanding: too much data about the same statistic Is correlated to OBOP and IBOP (outbound and inbound off peak) Is correlated to OBOP and IBOP (outbound and inbound peak) Rejected by business understanding: since t1 calls are retained, number of minutes may be eliminated. Table 3: Variables rejected by the team Cleaning the data Handle the missing value is one of the most challenge things in the data mining area. Running some model that doesn t have the functionality to handle the missing value together with the missing values occur in many places in the data set causes the model lack of the power in prediction or clustering, since the record that contains the missing value will be discarded and less records are used to build the model. On the other hand, filling or impute the missing value improperly can misled the result of the prediction. Many of the variables contained null/blank values. The dataset needed to be cleansed of the null values before the dataset was deemed ready for the application of data mining techniques. One approach to dealing with the null values is to eliminate the rows containing null values. But with the high percentage of the rows containing null values, this approach is not feasible for the preparation of data. We concentrated only on those variables which were kept in the model. Analyzing these variables we noticed the missing values in some variables. Most missing values were a tolerable percentage but OCCUP_CD came up with a staggering 63% missing values. Considering from our understanding of the problem we help that occupation code could 16

17 hold an effect on the prediction and hence the missing values had to be handled. Yet replacement must be done so as to least harm the model. Variable Name % Missing Replacement Strategy DAYS_TO_CONTRACT_EXPIRY 5% Tree imputation HS_MODEL 3% Tree imputation AGE 3% Tree imputation TOT_DEBIT_SHARE 2% Tree imputation OCCUP_CD 63% Default constant U MARITAL_STATUS 6% Default constant U CONTRACT_FLAG 5% Tree imputation PAY_METD_PREV 5% Tree imputation PAY_METD 5% Tree imputation HS_MANUFACTURER 3% Tree imputation Table 4: Variables with missing values OCCUP_CD The unknown values could be due to data entry errors (difference in coding schemes or just plain typo errors). Also the missing values could be due to the unwillingness of the customer to disclose their occupation. Another reason for these missing values could be that the 15 categories of occupation are too specific that some of the job profiles don t fit in them. So the unknown values become very significant in this case. Also since here more than 50% of the values are missing, tree imputation is not feasible as the model has only 37% of the data to learn from. Hence to harm the model the least we resort to character replacement and code missing values as U Unknown. The other variables are missing by only 6% or less and are numeric or nominal and can be replaced using tree imputation. AGE was tried to be replaced with mean replacement but tree imputation gave a more accurate and less biased result. All the imputed indicator values that are stored as M_<variable name> were included into the variable selection phase as input variables to determine if they influence the predictive accuracy. 17

18 Transformation of variables The basis of regression, the Central Limit Theorem dictates that as more and more random samples are drawn from a population the distribution of the averages of the samples follows a normal distribution. Thus to end up with a good regression model we transform the input variables to obtain fairly normal distributions. Also from the previous section about data insight and inspection, the team has found some interval variable has skewed distribution. In order to use the predictive model with some technique, such as multiple regression and neural network, the interval data type should be distributed normally. Then we should use the data transformation tool to transform each skewed variable to be a new variable that has values distribute closer to normal as much as possible. There are 75 interval variables that are transformed to have the distribution as normal as much as possible. The results as shown below (not all 75 but few distributions are shown) are quite satisfactory since most variables, which were previously skewed, are distributed closer to normal. The team has more confidence about the distribution adjustment that might provide the benefit in helping the model to predict better. VARIABLE NAME TRANSFORM LINE_TENURE Bucket Bin 1 : Bin 2: Bin 3: Bin4: Bin 5: >2580 DAYS_TO_CONTRACT_EXPIRY Bucket Bin 1:<45 days; Bin 2: days; Bin 3: >450 days AVG_USAGE_DAYS Bucket Bin 1: 1-12days; Bin 2: 12-23days; Bin 3: 23-29days ; Bin 4: >29 days REMARKS (Reason for transformation) Not normal: right skew Spike at the beginning Left skew AVG_PK_MINS_RATIO exponential Left skew AGE log Right skew AVG_BILL_AMT Maximize normality Right skew AVG_BILL_SMS Maximize normality Right skew AVG_BILL_VOICED Maximize normality Right skew AVG_BILL_VOICEI Maximize normality Right skew AVG_CALL_EXTRAN Maximize normality Right skew AVG_CALL_EXTRANT1 Maximize normality Right skew AVG_CALL_FIX Maximize normality Right skew 18

19 AVG_CALL_IB Maximize normality Right skew AVG_CALL_INTRAN Maximize normality Right skew AVG_CALL_LOCAL Maximize normality Right skew AVG_CALL_MOB Maximize normality Right skew AVG_CALL_OB Maximize normality Right skew AVG_CALL_T1 Maximize normality Right skew AVG_EXTRAN_RATIO Maximize normality Right skew AVG_M2M_MINS_RATIO Maximize normality Right skew AVG_MINS_FIX Maximize normality Right skew AVG_MINS_FRW Maximize normality Right skew AVG_MINS_IBOP Maximize normality Right skew AVG_MINS_IBPK Maximize normality Right skew AVG_MINS_INTRAN Maximize normality Right skew AVG_MINS_LOCAL Maximize normality Right skew AVG_MINS_MOB Maximize normality Right skew AVG_MINS_OBOP Maximize normality Right skew AVG_MINS_OBPK Maximize normality Right skew AVG_NO_CALLED Maximize normality Right skew AVG_NO_RECV Maximize normality Right skew AVG_OP_CALL_RATIO Maximize normality Right skew AVG_OP_MINS_RATIO Maximize normality Right skew AVG_PK_CALL_RATIO Maximize normality Left Skew AVG_PK_MINS_RATIO Maximize normality Left skew AVG_SPHERE Maximize normality Right skew AVG_VAS_GAMES Maximize normality Right skew AVG_VAS_GPRS Maximize normality Right skew AVG_VAS_SR Maximize normality Right skew HS_AGE Maximize normality Right skew LOYALTY_POINTS Maximize normality Right skew NUM_TEL Maximize normality Right skew OD_REL_SIZE Maximize normality Spike in the middle STD_BILL_AMT Maximize normality Right skew STD_BILL_VOICED Maximize normality Right skew STD_BILL_VOICEI Maximize normality Right skew STD_CALL_EXTRAN Maximize normality Right skew STD_CALL_EXTRANT1 Maximize normality Right skew STD_CALL_FIX Maximize normality Right skew STD_CALL_IB Maximize normality Right skew STD_CALL_INTRAN Maximize normality Right skew STD_CALL_LOCAL Maximize normality Right skew STD_CALL_MOB Maximize normality Right skew STD_CALL_OB Maximize normality Right skew STD_MINS_EXTRANT1 Maximize normality Right skew STD_MINS_FIX Maximize normality Right skew STD_MINS_IBOP Maximize normality Right skew STD_MINS_IBPK Maximize normality Right skew STD_MINS_INTRAN Maximize normality Right skew STD_MINS_LOCAL Maximize normality Right skew 19

20 STD_MINS_MOB Maximize normality Right skew STD_MINS_OBOP Maximize normality Right skew STD_MINS_OBPK Maximize normality Right skew STD_NO_CALLED Maximize normality Right skew STD_NO_RECV Maximize normality Right skew STD_OD_AMT Maximize normality Right skew STD_PAY_AMT Maximize normality Right skew STD_T1_CALL_CON Maximize normality Right skew STD_VAS_GAMES Maximize normality Right skew STD_VAS_GBSMS Maximize normality Right skew STD_VAS_GPRS Maximize normality Right skew STD_VAS_GPSMS Maximize normality Right skew STD_VAS_SMS Maximize normality Right skew TOT_DEBIT_SHARE Maximize normality Right skew STD_BUCKET_UTIL Square root Right skew Table 5: Transformed variables Example transformations: Figure 7(a): LINE_TENURE before transformation Figure 7(b): LINE_TENURE after transformation Figure 7(a): DAYS_TO_EXPIRY before transformation Figure 7(b): DAYS_TO_EXPIRY after transformation 20

21 Figure 8(a): AVG_USAGE_DAYS before transf Figure 8(b): AVG_USAGE_DAYS after transf Figure 9(a): AGE before transformation Figure 9(b): AGE after transformation Figure 10(a): AVG_BILL_AMT before transformation Figure 10(b): AVG_BILL_AMT after transformation 21

22 Figure 11(a): NUM_TEL before transformation Figure 11(b): NUM_TEL after transformation Variable Selection Phase: The 200 variables remaining after data cleaning and preparation are fed into the model. Since our team is not an expert in the telecommunication field we eliminated the variables only minimally as error of omission is more dangerous than error of commission. We depend on the variable selection methods to assist in eliminating irrelevant variables. Thus selecting meaningful variables is a very important task in building a good model. The fact is there are many ways in selecting meaningful combination. Decision Tree, EM Variable Selection method (chi-square), and Multiple Regression are models that the team will use as variable selection technique. Besides having the ability to select meaningful variables, both decision tree and multiple regression are the predictive technique in themselves. Then the team will use both techniques in building the model to predict subscriber after the variable has been selected (remark: the logistic/multiple regression will select the best combination of variables together with predicting the target variable). Variable Selection Node: To select variables to feed in the model we first chose the variable Selection node and from previous experience found Chi-square variable selection to be most effective. We used Chi-square value of 3.84 with 50 bins and making 6 passes with a 0.50 cut-off. Of the variables fed 17 were selected for modeling. 22

23 Variables Selected: AVG_M2M_CALL_RATIO Transformed STD_M2M_MINS_RATIO HS_MANUFACTURER HS_MODEL SUBPLAN SUBPLAN_PREVIOUS TOP1_INT_CD TOP2_INT_CD TOT_RETENTION_CAMP AVG_BILL_AMT Transformed AVG_BILL_VOICED Transformed AVG_CALL_FIX Transformed AVG_VAS_GAMES Transformed AVG_VAS_SR Transformed HS_AGE Transformed STD_MINS_INTRAN Transformed STD_VAS_GAMES Transformed Figure 12: Variables selected by the Variable selection node Decision Tree as Variable Selector: From previous experience in working with decision trees our team has encountered the Gini reduction model of obtaining a pure tree the best variable selection method. Using this decision tree with 6 branches and 6 levels (depth) of the tree and Gini index as our model assessment measure we obtained the following variables to be fed into the model. 23

24 HS_AGE (Most IMPORTANT) SUBPLAN HS_MODEL AVG_VAS_GAMES STD-VAS_GAMES AVG_BILL_AMT SUBPLAN_PREVIOUS TOP2_INT_CD AVG_MIN_INTRAN TOP3_INT_CD Figure 13: Variables selected by the Decision tree node Multiple Regression Our team used stepwise multiple regression as another variable selection method. The resulting model yielding the following variables: AGE: Maximize normality AVG_VAS_QG AVG_BILL_AMT: Maximize normality AVG_VAS_GAMES: Maximize normality AVG_PK_CALL_RATIO: Maximize normality AVG_MINS_INTRAN: Maximize normality AVG_OP_MINS_RATIO: Maximize normality BLACK_LIST_CNT COBRAND_CARD_FLAG 0 DAYS_TO_CONTRACT_EXPIRY HS_AGE: Maximize normality HS_MANUFACTURER 24

25 LST_RETENTION_CAMP LUCKY_NO_FLAG 0 STD_CALL_LOCAL: Maximize normality STD_VAS_GPRS: Maximize normality STD_MINS_MOB: Maximize normality SUBPLAN TOT_DEBIT_SHARE: Maximize normality VAS_AR_FLAG LINE_TENURE: Optimal binning Imputed indicator for DAYS_TO_CONTRACT_EXPIRY Imputed indicator for NATIONALITY IV. Modeling Figure 14: The entire modeling process Taking into account the number of variables and the scope of the data mining goal, the most optimal solution would be the 100% accurate prediction of 3G customers. As achieving such a perfect figure is extremely difficult given the complexity of the data set, it 25

26 was decide that a result around 80% (sensitivity value) could be taken as a fairly good predictive accuracy value. The data set was sampled first to produce a balanced data set (equal number of both 2G and 3G customers; 3000 each). The 6000 instances of the sampled data set was then partitioned using training-validation-test set using a random seed of 5217 Based on the data preparation, selection and transformation (as a result of the previous steps) we decided to build the following models. 1. Having performed variable selection, using the Linear Regression node i. Logistic Regression Models The logistic regression model was built keeping LOGIT, CLOGLOG and PROBIT as the link function and using the stepwise selection method, with criteria set to none. ii. Decision Tree Models Various types of decision tree models (Chi-square, Entropy and Gini models with both binary and 3-way split) were built with different number of observations per leaf and required for the split. iii. Neural Network Models Built a Multiple layer perceptron, using the Misclassification rate as the basic selection method with a low noise data setting. 2. Similar models were built using the Decision tree node as means for variable selection. 3. Similar models were built using the Variable selection node for the purpose of variable selection. It is to be noted that many different types of models were built, but only the ones with acceptable results are described below. Variable selection method - Linear Regression Node Based on the various decision tree models built it was seen that the Chi-square model with a three-way split was the most significant predictor of the true positives ( 3G as 3G) compared to the other tree models. So it was decided to eliminate the other models. 26

27 Similarly the same was done for the different types of Neural Network (different selection criteria and data settings) and Regression models. Table 6 explains the overall performance of the three selected models (best); while figure 15 shows the modeling process. Figure 15: Modeling using Linear Regression for the purpose of variable selection Since the goal of this project is to concentrate on the accurate prediction of customers who would turn to the 3G network, it was decided that the sensitivity figures would be taken into account as of being the foremost importance. Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way Selected Model Multi-layer Perceptron Logistic Regression Ensemble (all 3 models) Table 6: Comparison of the three best models (Linear Regression mode of variable selection) 27

28 As obvious from the table above, we decided to go with the Decision tree model (Chi-square test: 3-way split). To build this tree model we changed the default model to hold a minimum number of 5 observations in each of the leaf nodes, 36 observations were required for each split, with a significance level of It was observed that the chi-square test decision tree model picked HS_AGE, AVG_MINS_INTRAN, AVG_VAS_GAMES, AVG_BILL_AMT, LST_RETENTION_CAMP followed by the VAS_AR_FLAG as the most defining characteristics of the data set in the same order of significance. Variable Selection method Decision Tree Similar to the way in which the previous models were built, using the regression node as a variable selector, the decision tree node was also used to perform the same functionality. Table 7 explains the overall performance of the three selected models (best). The modeling process is defined graphically in figure 16. Figure 16: Modeling using Decision Tree for the purpose of variable selection In the case of constructing the decision tree model it has to be noted that the decision tree model node is capable of factoring in the missing values and also account for the distributions of the various variables into the modeling process. Hence the decision tree variable selection node was connected directly to the model node from the data partition 28

29 node. Whereas the other for the regression and the neural network, replacement and transformation plays a very important part of the modeling process. The three models were run separately and also together as an ensemble model; and the results were collected. For the process of variable selection the decision tree node was set to the Gini reduction splitting criteria with a maximum of 3 split allowable from each node and the model measurement measure set to Total leaf impurity (Gini Index), minimum number of 5 observations in each of the leaf nodes, 36 observations were required for each split and a depth of 8. All the other selections in the node were left as default. Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way % Selected Model Multi-layer Perceptron % Logistic Regression % Ensemble (all 3 models) % Table 7: Comparison of the three best models (Decision tree mode of variable selection) In this case also the Chi-square test model with a 3-way split performed the best with sensitivity for true positives of 76.39%. It was observed that the model picked HS_AGE, SUBPLAN, AVG_VAS_GAMES, AVG_BILL_AMT, HS_MODEL, AVG_MINS_INTRAN followed by the STD_VAS_GAMES in the same order of significance. Variable Selection method Variable Selection Node Similar to the way in which the models were built, using the decision tree node as a variable selector, the variable selection node was also used to perform the same functionality. We use the Chi-square method of selection with a cutoff of Unlike in the previous modeling process the variable selection node is incapable of handling any kind of discrepancies in the data, such as missing values. So the variable selection node for the 29

30 purpose of this modeling phase is executed after the replacement and the transformation of the data was completed. Figure 17 in the next page explains the modeling process. As before the individual models, as well as the combination (ensemble) were run and results collected. Figure 17: Modeling using Variable selection node for the purpose of variable selection Table 8 explains the overall performance of the three selected models (best). There are two models with same values for both the sensitivity and the overall accuracy. As it is difficult to choose between them, it was decided to take both of them into account for further analysis. Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way % Selected Model Multi-layer Perceptron % Logistic Regression % Ensemble (all 3 models) % Table 8: Comparison of the three best models (Variable selection node) 30

31 It was observed that the model picked HS_AGE, AGE, AVG_BILL_AMT, AVG_VAS_GAMES, STD_VAS_GPRS followed by TOT_RETENTION_CAMP in the same order of significance. The model also picked SUBPLAN and HS_MANUFACTURER; with each type of plan affecting the chances of the customer moving from 2G to 3G in a different way. Some types of plans meant the customer would prefer to stay with the 2G network, while others meant a possibility of change. The age of a handset seems to affect the model in a negative manner, i.e. the older the longer a person has used a particular model the more likely he is to stay with his current network settings. The age of a person is also observed to have a negative affect, i.e. the older the person, the less likely it is of him/her moving over to the 3G network Based on the comparative sensitivities of the models, and assuming anything greater than 75% to be a reasonable mark, we chose 3 models for further analysis. Variable selection method Linear Regression Decision Tree Variable selection node Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way % Chi-square Test: 3-way % Logistic Regression % Table 9: Comparison of the top 3 models built thus far V. Assessment The three models selected were combined to produce another ensemble model. This model was built to check, and see if when combined; whether they were capable of producing a better result as compared to what was already achieved. In the next section we explain the comparison of these four models using the lift charts. 31

32 Lift chart (% response) on Validation Data Non Cumulative To assess how well the models we have constructed performs with respect to the baseline model (i.e. the current business practice). We compared our decision tree models with the baseline data to obtain a clearer picture of whether it truly performs better and by how much. We used Lift Charts to gain an overall perspective. These charts also help us understand not only the difference in prediction as a number but also in terms of business practices. The average percentage response (baseline model) captures 50.39% of the responses across any of the deciles. Models Percentiles Ensemble VS -> DT DT -> DT Logistic Baseline model 50.39% Table 10: Comparison of 3 models: % Response Non-Cumulative In the top 10 and 20 percentile, Logistic Model (Variable selection node for variable selection purpose) performs best capturing 92.22% response rate. But at the 30 th percentile, the Decision Tree model (Gini Decision tree node for variable selection purpose) edges ahead of the other three, capturing 86.83% of the response. But if one would look at the figures, it is managerially more viable to target the top 20 percentile (reduces the cost of mailing; if that is the action to be taken) where we capture a much greater response using the logistic regression model compared to the 86.83% at the 30 th percentile by the Decision Tree Model. 32

33 Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 18: Lift chart (% response) on Validation Data Non-Cumulative Thus we see the Logistic regression model performs better than the other three models. We chose the Entropy-2 model as a better predictor of percentage responses; and definitely a huge improvement over the current baseline model Lift chart (% Captured Response) on Validation Data Non-Cumulative To further analyze the models we used lift charts on them based on non-cumulative captured response. In this area if we were to target the top 10% alone, then the logistic regression model yields a higher captured response i.e. it captures 18.30% of the actual 3G customers in the data. In the next decile also, it captures 18.30%. At the next decile the Decision tree model (Decision tree node for variable selection) performs best by capturing 17.23% of the actual 3G customers. But the logistic regression model by now nearly captures 52.26% of the actual 3G customers in the data compared to 51.73% by the next best model. 33

34 Models Percentiles Ensemble VS -> DT DT -> DT Logistic Baseline model 10 Table 11: Comparison of 3 models: % Captured Response Non Cumulative The captured response values and the graphical representation of the same clearly indicate two things. One that up to the 20 th percentile the Logistic regression model outdoes the other models and; two, that at the 55 th percentile, all the models perform worse than the baseline model. We can thus conclude that this chart also indicates logistic regression model to be the better of the 4 models in terms of capturing most actual buyers in the least deciles. Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 19: Lift chart (% Captured response) on Validation Data Non Cumulative 34

35 Lift chart (% Captured Response) on Validation Data Cumulative Speaking cumulatively, if we were to target only the top 10% of the scored file, then the logistic regression model would capture the most (18.30%) of the actual. At 20% of the file, it captures 36.60% of the results. If were to choose just the top 30% of the data it would capture 52.26% of the actual customers by having mailed much less than half of the population. At 40%, it performs optimally by capturing 66.04% of the actual 3G customers. Models Percentiles Ensemble VS -> DT DT -> DT Logistic Table 12: Comparison of 3 models: % Captured Response - Cumulative The table above as well as the graph below clearly illustrates that the logistic regression model is the better performing model of the other three models in comparison. Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 20: Lift chart (% captured response) on Validation Data Cumulative 35

36 Lift Value on Validation Data Cumulative In terms of how the models perform against the baseline, we see that if we target just the top 10% and 20% of the file, the resulting logistic regression model performs best being 1.83 times better than baseline. It would capture 1.83 times more than the percentage of buyers captured by baseline. At 30% of the file, its performance drops though not considerably to 1.74 times, yet at 30% percentile the logistic regression model is still more than (almost) twice better than baseline (1.74 times). This may be a small number in comparison, but when assimilated with millions of records would make a tangible difference in profits. Models Percentiles Ensemble VS -> DT DT -> DT Logistic Table 13: Comparison of 3 models: Lift value Cumulative Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 21: Lift value on Validation Data Cumulative 36

37 After evaluating the all the four models we have concluded that the Logistic Regression model with the variable selection node to perform better than all the other models created thus far. Sensitivity The ability of the models to predict the actual customers is of greater impact than predicting those who will not. Hence sensitivity of the model is one of the selection factors between models. Logistic Regression model created yields a sensitivity of 78.61% over the training data set (Figure 22) Figure 22: Cross-tab plot: Logistic Regression Variable Selection Logistic regression selects 63 variables (inclusive of dummy coded variables). 37

38 Figure 23: Effect of the variables It is quite clear from the chart above that the older the age of the handset the more likely is the probability that the customer might not shift to the 3G network. This could be due to the fact that the customer might be more familiar with the features of the handset and is not willing to/does not have any use for the new features. Similar results are evident for the age of the customer and the sub-plan he/she has chosen. But it is also seen that though the sub-plan has a negative effect on the model, there are some plans; when chosen by a customer increases the chances of shifting to the 3G network. The model has a positive intercept indicating that if no data is known about a customer he is still has a likelihood of shifting to 3G VI Evaluation We have evaluated the regression model as a more accurate predictor of 3G customers. Quantitative Analysis To quantify our results and justify our model selection in more managerial terms, we present the example below: We know that in general younger people prefer change and more so in the telecom industry. For the purpose of predicting 3G potential customers, it is clear that the target customer is young, rich, uses more features like games and multimedia messaging etc., 38

39 From a managerial point of view we try to quantify our model, taking an example of a 25 year old customer paying on an average $500 for telecom services who downloads close to 22,500Kb monthly and has a handset around 4 years old. This customer s likelihood to be a potential 3G customer changes if the manufacturer of the phone changes or even the sub plans changes: Manufacturer Code Sub Plan Code Number retention campaigns sent Probability CUSTOMER PROFILE % % - IDEAL % % % Table 14: Probability of purchase compared to customer profiles We see that if the customer has a phone manufactured by manufacturer coded as 49, uses the sub plan 2101 and is sent 3 retention campaigns He has the highest probability of moving to 3G service 99.99%. This is our IDEAL customer. If we change just the manufacturer to 1 the customer now has only 0.12% chance of changing to 3G. The same customer with handset manufactured by code 6 has a 73.90% chance of moving to 3G, which is our mid-tier probability. If we increase the retention campaigns sent to the customer the probability drops. Hence we recommend do not flood the customer with retention campaigns Our next ideal profile was if the customer has manufacturer 1 phone but a sub plan 2108, then he has a 98.42% chance of moving to 3G. Summary: o Ideal Profile: Young customer: age around 25. Uses games and downloads around Kb a month. Pays a monthly bill of $500 and handset is around 4 years old. Total retention campaigns sent should be around 5, not more. Customer having phone manufactured by 49 with a sub-plan Customer having phone manufactured by 1 and sub-plan

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

More information

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Predictive Modeling of Titanic Survivors: a Learning Competition

Predictive Modeling of Titanic Survivors: a Learning Competition SAS Analytics Day Predictive Modeling of Titanic Survivors: a Learning Competition Linda Schumacher Problem Introduction On April 15, 1912, the RMS Titanic sank resulting in the loss of 1502 out of 2224

More information

Smart Sell Re-quote project for an Insurance company.

Smart Sell Re-quote project for an Insurance company. SAS Analytics Day Smart Sell Re-quote project for an Insurance company. A project by Ajay Guyyala Naga Sudhir Lanka Narendra Babu Merla Kiran Reddy Samiullah Bramhanapalli Shaik Business Situation XYZ

More information

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study Tongshan Chang The University of California Office of the President CAIR Conference in Pasadena 11/13/2008

More information

Easily Identify the Right Customers

Easily Identify the Right Customers PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your

More information

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous

More information

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers Paper 1863-2014 Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers Sai Vijay Kishore Movva, Vandana Reddy and Dr. Goutam Chakraborty;

More information

Understanding Characteristics of Caravan Insurance Policy Buyer

Understanding Characteristics of Caravan Insurance Policy Buyer Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended

More information

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

More information

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : msiris@cityu.edu.hk 1 Aims To introduce the basic concepts of data mining

More information

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC ABSTRACT Due to the large amount of data typically involved, data mining analyses can exacerbate some

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

A Property and Casualty Insurance Predictive Modeling Process in SAS

A Property and Casualty Insurance Predictive Modeling Process in SAS Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly

More information

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA An Overview of SAS Enterprise Miner The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3.

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America Application of SAS! Enterprise Miner in Credit Risk Analytics Presented by Minakshi Srivastava, VP, Bank of America 1 Table of Contents Credit Risk Analytics Overview Journey from DATA to DECISIONS Exploratory

More information

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Business Intelligence Professor Chen NAME: Due Date: Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Tutorial Summary Objective: Richard would

More information

Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry

Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry Paper 1808-2014 Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry Kittipong Trongsawad and Jongsawas Chongwatpol NIDA Business School, National Institute

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry

Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry Paper 12028 Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry Junxiang Lu, Ph.D. Overland Park, Kansas ABSTRACT Increasingly, companies are viewing

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

Easily Identify Your Best Customers

Easily Identify Your Best Customers IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

More information

Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition

Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition Data Mining Using SAS Enterprise Miner : A Case Study Approach, Second Edition The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2003. Data Mining Using SAS Enterprise

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D.

Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D. Data Mining: A Magic Technology for College Recruitment Tongshan Chang, Ed.D. Principal Administrative Analyst Admissions Research and Evaluation The University of California Office of the President Tongshan.Chang@ucop.edu

More information

A fast, powerful data mining workbench designed for small to midsize organizations

A fast, powerful data mining workbench designed for small to midsize organizations FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. Introduction: The Basel Capital Accord, ready for implementation in force around 2006, sets out

More information

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Using Data Mining Techniques for Analyzing Pottery Databases

Using Data Mining Techniques for Analyzing Pottery Databases BAR-ILAN UNIVERSITY Using Data Mining Techniques for Analyzing Pottery Databases Zachi Zweig Submitted in partial fulfillment of the requirements for the Master s degree in the Martin (Szusz) Department

More information

Data Mining with SAS. Mathias Lanner mathias.lanner@swe.sas.com. Copyright 2010 SAS Institute Inc. All rights reserved.

Data Mining with SAS. Mathias Lanner mathias.lanner@swe.sas.com. Copyright 2010 SAS Institute Inc. All rights reserved. Data Mining with SAS Mathias Lanner mathias.lanner@swe.sas.com Copyright 2010 SAS Institute Inc. All rights reserved. Agenda Data mining Introduction Data mining applications Data mining techniques SEMMA

More information

Course Syllabus. Purposes of Course:

Course Syllabus. Purposes of Course: Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building

More information

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1 M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have

More information

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses Salford Systems Data Mining 2006 March 27-31 2006 San Diego, CA By Dean Abbott Abbott Analytics

More information

Data Mining with SQL Server Data Tools

Data Mining with SQL Server Data Tools Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining

More information

Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

Enhancing Compliance with Predictive Analytics

Enhancing Compliance with Predictive Analytics Enhancing Compliance with Predictive Analytics FTA 2007 Revenue Estimation and Research Conference Reid Linn Tennessee Department of Revenue reid.linn@state.tn.us Sifting through a Gold Mine of Tax Data

More information

The Predictive Data Mining Revolution in Scorecards:

The Predictive Data Mining Revolution in Scorecards: January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

More information

Grow Revenues and Reduce Risk with Powerful Analytics Software

Grow Revenues and Reduce Risk with Powerful Analytics Software Grow Revenues and Reduce Risk with Powerful Analytics Software Overview Gaining knowledge through data selection, data exploration, model creation and predictive action is the key to increasing revenues,

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

PASW Direct Marketing 18

PASW Direct Marketing 18 i PASW Direct Marketing 18 For more information about SPSS Inc. software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412

More information

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business

More information

IBM SPSS Direct Marketing 20

IBM SPSS Direct Marketing 20 IBM SPSS Direct Marketing 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This edition applies to IBM SPSS Statistics 20 and to

More information

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1 Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2012. Developing

More information

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC Technical Paper (Last Revised On: May 6, 2013) Big Data Analytics Benchmarking SAS, R, and Mahout Allison J. Ames, Ralph Abbey, Wayne Thompson SAS Institute Inc., Cary, NC Accurate and Simple Analysis

More information

BOR 6335 Data Mining. Course Description. Course Bibliography and Required Readings. Prerequisites

BOR 6335 Data Mining. Course Description. Course Bibliography and Required Readings. Prerequisites BOR 6335 Data Mining Course Description This course provides an overview of data mining and fundamentals of using RapidMiner and OpenOffice open access software packages to develop data mining models.

More information

Product recommendations and promotions (couponing and discounts) Cross-sell and Upsell strategies

Product recommendations and promotions (couponing and discounts) Cross-sell and Upsell strategies WHITEPAPER Today, leading companies are looking to improve business performance via faster, better decision making by applying advanced predictive modeling to their vast and growing volumes of data. Business

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Innovations and Value Creation in Predictive Modeling. David Cummings Vice President - Research

Innovations and Value Creation in Predictive Modeling. David Cummings Vice President - Research Innovations and Value Creation in Predictive Modeling David Cummings Vice President - Research ISO Innovative Analytics 1 Innovations and Value Creation in Predictive Modeling A look back at the past decade

More information

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK Agenda Analytics why now? The process around data and text mining Case Studies The Value of Information

More information

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES International Journal of Scientific and Research Publications, Volume 4, Issue 4, April 2014 1 CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES DR. M.BALASUBRAMANIAN *, M.SELVARANI

More information

Decision Trees What Are They?

Decision Trees What Are They? Decision Trees What Are They? Introduction...1 Using Decision Trees with Other Modeling Approaches...5 Why Are Decision Trees So Useful?...8 Level of Measurement... 11 Introduction Decision trees are a

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT

Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal Bank of Scotland, Bridgeport, CT Variable Selection in the Credit Card Industry Moez Hababou, Alec Y. Cheng, and Ray Falk, Royal ank of Scotland, ridgeport, CT ASTRACT The credit card industry is particular in its need for a wide variety

More information

Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit

Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit Duncan Cleary Revenue Irish Tax and Customs, Ireland dcleary@revenue.ie Abstract: Revenue, the Irish

More information

Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable

More information

Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner A Beginner s Guide

Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner A Beginner s Guide Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner A Beginner s Guide Olivia Parr-Rud From Business Analytics Using SAS Enterprise Guide and SAS Enterprise Miner. Full book available

More information

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Paper 114-27 Predicting Customer in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS Junxiang Lu, Ph.D. Sprint Communications Company Overland Park, Kansas ABSTRACT

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Predicting earning potential on Adult Dataset

Predicting earning potential on Adult Dataset MSc in Computing, Business Intelligence and Data Mining stream. Business Intelligence and Data Mining Applications Project Report. Predicting earning potential on Adult Dataset Submitted by: xxxxxxx Supervisor:

More information