PAKDD 2006 Data Mining Competition

Transcription

1 PAKDD 2006 Data Mining Competition Date Submitted: February 28 th, 2006 SAS Enterprise Miner, Release 4.3 Team Members Bhuvanendran, Aswin Bommi Narasimha, Sankeerth Reddy Jain, Amit Rangwala, Zenab

2 Table of Contents Index Description Page Nos. Executive Summary 3 CRISP-DM Process I Business Understanding 6 II Data Understanding 7 III Data Preparation 12 IV Modeling 25 Variable Selection Methods Regression 26 Decision Tree 28 Variable Selection Node 29 V Assessment 31 Lift Charts 32 Sensitivity 37 Variable Selection 37 VI Evaluation 38 2

3 EXECUTIVE SUMMARY An Asian Telecommunications operator has successfully launched a third generation (3G) mobile telecommunications network. The company has already collected information about the customer s usage pattern as well as demographic information. The company wants to use this data to identify which customers are likely to switch from the current 2G network to the new 3G network. The data set provided to us for training had 251 variables having usage statistics of 20,000 customers with customers being known as 2G or 3G patrons. The main goal of this data mining project was to analyze the given data and understand the factors that play a role in increasing the likelihood of a customer moving to 3G. In order to achieve the goal various models were built and they were compared against each other to come up with the model that is most sensitive to consumers moving from 2G to 3G. Around 80% of the effort was concentrated on pre-modeling exercises like Recoding data and type mismatches, Understanding rejected variables and rejecting irrelevant ones based on business understanding. Cleaning and replacing missing values and transforming some data. A noteworthy aspect of the data was the missing occupation codes which were missing for 63% 1 of the data. We guesstimated that it could be due to data coding errors or the fact that the codes were too specific. Since we do not consider ourselves experts in this area we were very cautious in rejecting variables and ended up with variables to be selected further by the variable selection models. The three variable selection methods selected based on team experience were: Linear Regression; Decision tree (Gini index) and Variable Selection. Further, using linear regression as the variable selection method, various models such as logistic regression, decision tree and multilayer perceptron were built. Once the models were built, they were compared against each other and we concluded that the 1 Table 4: Variables with missing values 2 Variable Selection Phase 3

4 decision tree model with the Chi-square feature with a three-way split was the most significant predictor of the true positives (3G as 3G) compared to the other tree models being able to predict them right 79.17% 3 of the time Similarly, using Decision tree as the variable selection method, various models such as logistic regression, multilayer perceptron and decision tree models were built. These models were compared against each other. Once again the decision tree model (chi- square test) three- way split model performed the best being able to predict 3G as 3G; times out of 100 but with a slightly less misclassification rate than before of 22.33% 4 compared to 21.33% 3 over the test data set. Finally, similar to the way in which the previous models were built, the variable selection node was also used to perform the same functionality. Among the three models that were built i.e. Logistic Regression, Multilayer Perceptron and decision tree, it was concluded that logistic regression performed better when compared to the other models with a way better overall misclassification rate of 21.17% 5 and was able to predict 3G s; 78.82% 5 of the times. Furthermore, for each of the models that were constructed using the various variable selection methods, in each of the steps, the three models were combined to build an ensemble model. The idea behind building an ensemble model was built to check and see if when combined; whether the models were capable of producing a better result as compared to what was already achieved. After building the ensemble model, lift charts were used to compare each of the models. It was concluded that combining the various models using the ensemble model did not really have much effect on the model. The ensemble model did not perform any better 6 than the individual models that were built. Therefore it was decided to reject the ensemble model. The ensemble model along with the three best models was compared against each other and was concluded that the Logistic Regression model with the variable selection node to perform better 7 than all the other models created thus far. Since the main idea behind this modeling process was to find the ability of the models to predict the actual 3 Table 6: Comparison of the three best models (Linear Regression mode of variable selection) 4 Table 7: Comparison of the three best models (Decision tree mode of variable selection) 5 Table 8: Comparison of the three best models (Variable selection node) 6 Table 9: Comparison of the top 3 models built thus far 7 Pages

5 customers. Hence sensitivity of the model is a major selection factor between models along with using lift charts to observe the performance of other models comparatively. Logistic Regression model created yields a sensitivity of 78.61% and performed relatively better compared to other models up to the 40 th percentile and hence was therefore chosen as the best model to predict the customers who will move from 2G to 3G network. Across all deciles logistic regression with variable selection node performs best in capturing most percentage of responses with 66.04% 8 at the 40 th percentile. It is quite clear from the model that the age (older handsets) of the handset dictates the probability that the customer might not shift to the 3G network 9. This could be due to the fact that the customer might be more familiar with the features of the handset and are not willing to/does not have any use for the new features. Similar results are evident for the age of the customer. The more a customer downloads or plays games on the phone the more likely he/she is to move to 3G. Also different sub plans and different handsets increase or decrease likelihood of a customer moving to 3G. The ideal 3G customer profile from the model selected would be a o o o o o o young customer, who pays an average monthly bill of around $500 (hence affluent), has a handset not more than 4 years old, Hasn t received too many retention campaigns, downloads around Kb a month, has a phone manufactured by o code 49 with a sub-plan 2101 o code 1 with sub-plan It is probable that the selected sub-plans offer 3G related services. 8 Table 12: Comparison of 3 models: % Captured Response - Cumulative 9 Figure 23: Effect of the variables 5

6 I. Business Understanding CRISP DM Modeling Process An Asian telecom operator which has successfully launched a third generation (3G) mobile telecommunications network would like to make use of existing customer usage and demographic data to identify which customers are likely to switch to using their 3G network. Carriers are likely to embrace 3G, because their former cash cow basic cellular is becoming a low cost commodity. Wireless services would be an increasingly important revenue stream for the company. The motivation for building 3G networks according McMahan of TI is, to differentiate them selves and to have another revenue source. 3G phones offer superior voice quality and enhanced broadband and data connection services (such as streaming audio and video) of up to 2Mbps to cater to the high technology needs of the consumer. Data mining Goal There are 3 main goals of this data mining project Analyze the given data and understand the factors that play a role in increasing the likelihood of a customer moving to 3G. Build a model that is most sensitive to consumers moving to 3G. Score the dataset with this model and predict the unknown target variable.(cust_type = 2G/3G) Software Used We used SAS Enterprise Miner, Release 4.3 for the purpose of this Data Mining Project. Preliminary project plan 1. Build a predictive model to predict the likelihood of a customer moving to 3G services. i. First we understand and explore the data. ii. Change data types according to business understanding. 6

7 iii. Reject the unwanted variables by means of logical interpretation of the dataset. iv. Sample and balance the data(if required) to obtain a better prediction model v. Partition the data into Training, Validation and test samples. vi. Analyze missing values and determine appropriate replacement strategies for the same. vii. Observe the distribution of the data and implement suitable transformation strategies. viii. Build various predictive models to effectively and accurately capture 3G consumer profile. ix. Compare the results from the models built and select one. x. Explain the business implications II. Data Understanding Data collection The team will use data sets of the company that is composed of the fields that are available at the individual customer level and provided by the Pacific-Asia Conference on Knowledge Discovery and Data Mining An original sample dataset of 20,000 2G network customers and 4,000 3G network customers was provided with 251 data fields. The target categorical variable is CUSTOMER_TYPE (2G/3G). A 3G customer is defined as a customer who has a 3G Subscriber Identity Module (SIM) card and is currently using a 3G network compatible mobile phone. Three-quarters of the dataset (15K 2G, 3K 3G) had the target field available and was used for training/testing. Describe and Explore data The data set contains information which is mostly consistent with the telephone usage statistics details of each of the customers. It also contains details specific to the customer s relation with the company, in terms of the customer loyalty program and 7

8 contract details. The data set also contains a variable which has been appended to it based on the customer s choice of service (2G /3G). This variable (CUST_TYPE) is to be used as predictor variables. To facilitate SAS, we coded the variables: 2G as 0; and 3G as 1. We further analyzed the input data of 15K customer records by attaching a Distribution explorer node in SAS and identified some key features that are mentioned below: The customers have handsets from 23 different manufacturers and are spanned across 40 nationalities (coded from 36 to 998) and categorized into 15 different occupation types of which 63% have missing values. The customers have an option to choose from 66 different sub-plans and have a flexibility of paying by 5 different methods. The company offers a maximum of 2 year contract (730 days) The company has an ongoing loyalty program. The data has records for 128 unique serial numbers. It is evident that the more the customers usage; in terms of minutes and the extra features the more is the probability of them moving to the 3G mobile handsets. But the proportion of people in this high end bracket is very low. Customers purchasing new handsets now are more likely to go for the 3G network Most popular handset maker: coded 7 Figure 1: Frequency count for handset maker 8

9 Most popular handset models: coded (#1), (#2), (#3) Figure 2: Frequency count for handset models Customers very rarely make us of their loyalty points, but those who do tend to lean towards the 3G network. Figure 3: Loyalty point s usage grouped by customer type CS is the most favored payment method (cash) by both 2G and 3G customers. Figure 4: Payment methods grouped by customer types 9

10 Recoding and Data Type Adjustment For ease of use by SAS, we coded the following variables Customer_type variable as 0 for 2G customers and 1 for 3G. The variable was named; customer_type1. Created a new variable PAYMENT_CHG_FLAG; which indicates whether a customer has ever changed his payment option plan in the past. By default, due to data misinterpretation or missing values in data, SAS had wrongly assumed the data types of some variables. An initial analysis required changing data types assumed by SAS to those dictated by data understanding. VARIABLE NAME SAS Coding Changed to NATIONALITY interval nominal REASON >10 numeric codes (40 codes) CUSTOMER_CLASS ordinal nominal one class is not grater than the other SUBPLAN interval nominal >10 numeric codes (66 codes) SUBPLAN_PREVIOUS interval nominal >10 numeric codes NUM_SUSP_TEL ordinal interval NUM_DELINQ_TEL ordinal interval <10 numeric values (max # =8) <10 numeric values (max # =5) PAY_METD_CHG ordinal interval <10 numeric values HS_CHANGE unary (rejected) interval all values are 0 OBSERVATION most people have no more than 8 suspended lines in the dataset most people have no more than 5 delinquent lines in the dataset most customers have not changed their payment method more than 10 times Either customer hasn t changed their handset in six months or hasn t notified the company about it. HS_MANUFACTURER interval nominal HS_MODEL interval nominal manufacturer codes are assumed to be numbers model numbers are assumed to be numbers TELE_CHANGE_FLAG ordinal (rejected) interval 99.97% are 0's. Values range from 0 to 2. almost all customers haven't changed their number in 6 months 10

11 TOT_PAST_DELINQ ordinal interval TOT_PAST_OVERDUE ordinal interval TOT_PAST_TOS ordinal interval <10 numeric values. Range (0-4) <10 numeric values. Range (0-6) <10 numeric values. Range (0-1) % records are 0. TOT_TOS_DAYS rejected numeric >98% are 0. TOT_PAST_DEMAND rejected - Unary numeric all values are 0 TOT_PAST_REVPAY ordinal interval 97.3% are 0's. REVPAY_PREV_CD ordinal interval 97.3% are 0's. VAS_VMN_FLAG rejected nominal 99.8% are 0's. VAS_INFOSRV unary (rejected) numeric all values are 0 VAS_GPRS_FLAG binary nominal range - 0,1,2. but 0% are 2's. DELINQ_FREQ ordinal interval <10 numeric values OD_FREQ binary interval Only 2 unique values REVPAY_FREQ ordinal interval <10 numeric values AVG_VAS_QI ordinal interval >10 numeric values AVG_VAS_QP ordinal interval >10 numeric values AVG_VAS_QTXT ordinal interval >10 numeric values Table 1: Variables for which data types were changed most customers have not changed their normal status to temporary on suspension most customers never remained on suspension for past 6 months most customers do not use own voice mail notification no one in the dataset uses th GPRS Flexi plan On further analyzing the data using a Distribution explorer node we discovered the following trends in the data: As usage of games in phones increases, customers are more likely to move towards 3G. (fig 5(a).) Most 3G customer does not make more calls at peak time (highly expensive). More proportion of 2G customers default than 3G (this could be because the sample has 83.33% of 2G customers) Most 3G customers prefer to pay by cash and are not likely to change this method. Most 3G customers have fairly new handset models. 11

12 Figure 5(a): AVG_VAS_GAMES Figure 5(b) AVG_PK_RATIO Figure 5(c): BLACK_LIST_FLAG Figure 5(d) HS_AGE It is to be noted that the data set is not complete, but consists of a sufficiently large number of missing values, which is of prime importance (will be explained further later). III. Data Preparation Sampling These observations made above are made on the dataset and provided as an data understanding tool but are not relevant to the prediction as the distribution of the target is heavily biased to the number of people using 2G 83.33% to 16.67% (people using 3G). 12

13 Figure 6(a): Distribution of CUSTOMER_TYPE Figure 6(b): Distribution of CUSTOMER_TYPE (After sampling) We observed that with such a biased sample it was not possible to model the data as the data would learn more how to predict 2G rather than 3G s. Such an unbalanced data would be undesirable for any model to run on. With this understanding we decided on Sampling the data to pick a balanced sample for training, validation and testing. To obtain an equal balanced sample we desired, we included all 3000 of the 3G customers in the 15K dataset with another random G ones using Stratified sampling. (Random seed: 12345). The resulting 6000 records were thus used for modeling. Data Partition In order to build the model, we need partition the model dataset in to three, train dataset, validation dataset and a test dataset. The team will partition for both balanced dataset and unbalance dataset with the proportion between train dataset and validation dataset, 60:30:10 with the random seed Data Selection Rejected variables by the Model The first task in preparing the data was removal of irrelevant variables from the data source. For this purpose we used the discretion of SAS, as it rejected most unary (nonmissing) variables and also variables whose distribution was irreparably skewed (for e.g. 13

14 99.78% zeros in TOT_PAS_TOS, etc.). Using these methods 32 variables were automatically rejected by SAS. VARIABLES REJECTED BY THE MODEL VARIABLE NAME DATATYPE REASON FOR REJECTION AVG_VAS_#123# numeric Unary value. 100% 0 s AVG_VAS_CG numeric Unary value. 100% 0 s AVG_VAS_IDU numeric Unary value. 100% 0 s AVG_VAS_IEM numeric Unary value. 100% 0 s AVG_VAS_IFSMS numeric Unary value. 100% 0 s AVG_VAS_ISMS numeric Unary value. 100% 0 s AVG_VAS_MILL numeric Unary value. 100% 0 s AVG_VAS_QP Numeric 99.96% 0 s. almost unary AVG_VAS_SS numeric Unary value. 100% 0 s AVG_VAS_WLINK numeric Unary value. 100% 0 s DELINQ_FREQ numeric Unary value. 100% 0 s STD_VAS_#123# numeric Unary value. 100% 0 s STD_VAS_CG numeric Unary value. 100% 0 s STD_VAS_IDU numeric Unary value. 100% 0 s STD_VAS_IEM numeric Unary value. 100% 0 s STD_VAS_IFSMS numeric Unary value. 100% 0 s STD_VAS_ISMS numeric Unary value. 100% 0 s STD_VAS_MILL numeric Unary value. 100% 0 s STD_VAS_QP Ordinal 99.96% 0 s. almost unary STD_VAS_SS numeric Unary value. 100% 0 s STD_VAS_WLINK numeric Unary value. 100% 0 s TELE_CHANGE_FLAG numeric 99.97% 0 s. almost unary TOT_PAST_DEMAND numeric Unary value. 100% 0 s TOT_PAST_TOS numeric 99.78% 0 s. almost unary TOT_TOS_DAYS numeric 99.78% 0 s. almost unary VAS_CSMS_FLAG Unary value. 100% 0 s VAS_DRIVE_FLAG binary 99.97% 0 s. almost unary VAS_IEM_FLAG binary 99.93% 0 s. almost unary VAS_INFOSRV numeric Unary value. 100% 0 s 14

15 VAS_SN_FLAG binary Unary value. 100% 0 s VAS_VMN_FLAG binary 99.99% 0 s. almost unary VAS_VMP_FLAG Binary 99.99% 0 s. almost unary Table 2: variables rejected by the model Rejected Variables by the Team Working and modeling with 200+ variables was still difficult and hence we decided to use our business understanding and managerial discretion, to eliminate more variable we felt were redundant or irrelevant. It was a cautious process as throwing away data could wreck the model. We eliminated the 29 variables mentioned below and have illustrated reasons for doing so as well. VARIABLE NAME REASON FOR REJECTION AVG_BILL_VOICE Is correlated to VOICED and VOICEI AVG_CALL Is correlated to outbound and inbound calls AVG_CALL_FRW Rejected by business understanding: judged irrelevant AVG_CALL_OP Is correlated to OBOP and IBOP (outbound and inbound off peak) AVG_CALL_PK Is correlated to OBPK and IBPK (outbound and inbound peak) Rejected by business understanding : since t1 minutes are retained AVG_CALL_T1 and is more descriptive than number of calls AVG_DIS_1900 Rejected by business understanding: 93.81% zeros AVG_MINS Correlated to inbound and outbound minutes (IB, OB) AVG_MINS_EXTRAN Rejected by business understanding: 98.88% zeros Correlated to INTT1, INTT2, and INTT3. Besides knowing top 3 AVG_MINS_INT countries is judged to be sufficient information. AVG_MINS_OP Is correlated to OBOP and IBOP (outbound and inbound off peak) AVG_MINS_PK Is correlated to OBPK and IBPK (outbound and inbound peak) HS_CHANGE Unary value : all zeros OD_FREQ Rejected by business understanding: 97.45% zeros Rejected by business understanding: Instead created a new variable PAY_METD_CHG_FLAG to indicate if payment method has PAY_METD_CHG changed. More descriptive. Rejected by business understanding. Data provided for those who PAY_METD_PREV haven t changed payment method too. Irrelevant. REVPAY_FREQ Rejected by business understanding: 99.83% zeros STD_BILL_VOICE Is correlated to VOICED and VOICEI 15

16 STD_CALL STD_CALL_FRW STD_CALL_OP STD_CALL_PK STD_MINS STD_MINS_EXTRAN STD_MINS_IB STD_MINS_OB STD_MINS_OP STD_MINS_PK STD_MINS_T1 Is correlated to outbound and inbound calls Rejected by business understanding: judged irrelevant Is correlated to OBOP and IBOP (outbound and inbound off peak) Is correlated to OBOP and IBOP (outbound and inbound peak) Correlated to inbound and outbound minutes (IB, OB) Rejected by business understanding: 98.88% zeros Rejected by business understanding: too much data about the same statistic Rejected by business understanding: too much data about the same statistic Is correlated to OBOP and IBOP (outbound and inbound off peak) Is correlated to OBOP and IBOP (outbound and inbound peak) Rejected by business understanding: since t1 calls are retained, number of minutes may be eliminated. Table 3: Variables rejected by the team Cleaning the data Handle the missing value is one of the most challenge things in the data mining area. Running some model that doesn t have the functionality to handle the missing value together with the missing values occur in many places in the data set causes the model lack of the power in prediction or clustering, since the record that contains the missing value will be discarded and less records are used to build the model. On the other hand, filling or impute the missing value improperly can misled the result of the prediction. Many of the variables contained null/blank values. The dataset needed to be cleansed of the null values before the dataset was deemed ready for the application of data mining techniques. One approach to dealing with the null values is to eliminate the rows containing null values. But with the high percentage of the rows containing null values, this approach is not feasible for the preparation of data. We concentrated only on those variables which were kept in the model. Analyzing these variables we noticed the missing values in some variables. Most missing values were a tolerable percentage but OCCUP_CD came up with a staggering 63% missing values. Considering from our understanding of the problem we help that occupation code could 16

17 hold an effect on the prediction and hence the missing values had to be handled. Yet replacement must be done so as to least harm the model. Variable Name % Missing Replacement Strategy DAYS_TO_CONTRACT_EXPIRY 5% Tree imputation HS_MODEL 3% Tree imputation AGE 3% Tree imputation TOT_DEBIT_SHARE 2% Tree imputation OCCUP_CD 63% Default constant U MARITAL_STATUS 6% Default constant U CONTRACT_FLAG 5% Tree imputation PAY_METD_PREV 5% Tree imputation PAY_METD 5% Tree imputation HS_MANUFACTURER 3% Tree imputation Table 4: Variables with missing values OCCUP_CD The unknown values could be due to data entry errors (difference in coding schemes or just plain typo errors). Also the missing values could be due to the unwillingness of the customer to disclose their occupation. Another reason for these missing values could be that the 15 categories of occupation are too specific that some of the job profiles don t fit in them. So the unknown values become very significant in this case. Also since here more than 50% of the values are missing, tree imputation is not feasible as the model has only 37% of the data to learn from. Hence to harm the model the least we resort to character replacement and code missing values as U Unknown. The other variables are missing by only 6% or less and are numeric or nominal and can be replaced using tree imputation. AGE was tried to be replaced with mean replacement but tree imputation gave a more accurate and less biased result. All the imputed indicator values that are stored as M_<variable name> were included into the variable selection phase as input variables to determine if they influence the predictive accuracy. 17

18 Transformation of variables The basis of regression, the Central Limit Theorem dictates that as more and more random samples are drawn from a population the distribution of the averages of the samples follows a normal distribution. Thus to end up with a good regression model we transform the input variables to obtain fairly normal distributions. Also from the previous section about data insight and inspection, the team has found some interval variable has skewed distribution. In order to use the predictive model with some technique, such as multiple regression and neural network, the interval data type should be distributed normally. Then we should use the data transformation tool to transform each skewed variable to be a new variable that has values distribute closer to normal as much as possible. There are 75 interval variables that are transformed to have the distribution as normal as much as possible. The results as shown below (not all 75 but few distributions are shown) are quite satisfactory since most variables, which were previously skewed, are distributed closer to normal. The team has more confidence about the distribution adjustment that might provide the benefit in helping the model to predict better. VARIABLE NAME TRANSFORM LINE_TENURE Bucket Bin 1 : Bin 2: Bin 3: Bin4: Bin 5: >2580 DAYS_TO_CONTRACT_EXPIRY Bucket Bin 1:<45 days; Bin 2: days; Bin 3: >450 days AVG_USAGE_DAYS Bucket Bin 1: 1-12days; Bin 2: 12-23days; Bin 3: 23-29days ; Bin 4: >29 days REMARKS (Reason for transformation) Not normal: right skew Spike at the beginning Left skew AVG_PK_MINS_RATIO exponential Left skew AGE log Right skew AVG_BILL_AMT Maximize normality Right skew AVG_BILL_SMS Maximize normality Right skew AVG_BILL_VOICED Maximize normality Right skew AVG_BILL_VOICEI Maximize normality Right skew AVG_CALL_EXTRAN Maximize normality Right skew AVG_CALL_EXTRANT1 Maximize normality Right skew AVG_CALL_FIX Maximize normality Right skew 18

19 AVG_CALL_IB Maximize normality Right skew AVG_CALL_INTRAN Maximize normality Right skew AVG_CALL_LOCAL Maximize normality Right skew AVG_CALL_MOB Maximize normality Right skew AVG_CALL_OB Maximize normality Right skew AVG_CALL_T1 Maximize normality Right skew AVG_EXTRAN_RATIO Maximize normality Right skew AVG_M2M_MINS_RATIO Maximize normality Right skew AVG_MINS_FIX Maximize normality Right skew AVG_MINS_FRW Maximize normality Right skew AVG_MINS_IBOP Maximize normality Right skew AVG_MINS_IBPK Maximize normality Right skew AVG_MINS_INTRAN Maximize normality Right skew AVG_MINS_LOCAL Maximize normality Right skew AVG_MINS_MOB Maximize normality Right skew AVG_MINS_OBOP Maximize normality Right skew AVG_MINS_OBPK Maximize normality Right skew AVG_NO_CALLED Maximize normality Right skew AVG_NO_RECV Maximize normality Right skew AVG_OP_CALL_RATIO Maximize normality Right skew AVG_OP_MINS_RATIO Maximize normality Right skew AVG_PK_CALL_RATIO Maximize normality Left Skew AVG_PK_MINS_RATIO Maximize normality Left skew AVG_SPHERE Maximize normality Right skew AVG_VAS_GAMES Maximize normality Right skew AVG_VAS_GPRS Maximize normality Right skew AVG_VAS_SR Maximize normality Right skew HS_AGE Maximize normality Right skew LOYALTY_POINTS Maximize normality Right skew NUM_TEL Maximize normality Right skew OD_REL_SIZE Maximize normality Spike in the middle STD_BILL_AMT Maximize normality Right skew STD_BILL_VOICED Maximize normality Right skew STD_BILL_VOICEI Maximize normality Right skew STD_CALL_EXTRAN Maximize normality Right skew STD_CALL_EXTRANT1 Maximize normality Right skew STD_CALL_FIX Maximize normality Right skew STD_CALL_IB Maximize normality Right skew STD_CALL_INTRAN Maximize normality Right skew STD_CALL_LOCAL Maximize normality Right skew STD_CALL_MOB Maximize normality Right skew STD_CALL_OB Maximize normality Right skew STD_MINS_EXTRANT1 Maximize normality Right skew STD_MINS_FIX Maximize normality Right skew STD_MINS_IBOP Maximize normality Right skew STD_MINS_IBPK Maximize normality Right skew STD_MINS_INTRAN Maximize normality Right skew STD_MINS_LOCAL Maximize normality Right skew 19

20 STD_MINS_MOB Maximize normality Right skew STD_MINS_OBOP Maximize normality Right skew STD_MINS_OBPK Maximize normality Right skew STD_NO_CALLED Maximize normality Right skew STD_NO_RECV Maximize normality Right skew STD_OD_AMT Maximize normality Right skew STD_PAY_AMT Maximize normality Right skew STD_T1_CALL_CON Maximize normality Right skew STD_VAS_GAMES Maximize normality Right skew STD_VAS_GBSMS Maximize normality Right skew STD_VAS_GPRS Maximize normality Right skew STD_VAS_GPSMS Maximize normality Right skew STD_VAS_SMS Maximize normality Right skew TOT_DEBIT_SHARE Maximize normality Right skew STD_BUCKET_UTIL Square root Right skew Table 5: Transformed variables Example transformations: Figure 7(a): LINE_TENURE before transformation Figure 7(b): LINE_TENURE after transformation Figure 7(a): DAYS_TO_EXPIRY before transformation Figure 7(b): DAYS_TO_EXPIRY after transformation 20

21 Figure 8(a): AVG_USAGE_DAYS before transf Figure 8(b): AVG_USAGE_DAYS after transf Figure 9(a): AGE before transformation Figure 9(b): AGE after transformation Figure 10(a): AVG_BILL_AMT before transformation Figure 10(b): AVG_BILL_AMT after transformation 21

22 Figure 11(a): NUM_TEL before transformation Figure 11(b): NUM_TEL after transformation Variable Selection Phase: The 200 variables remaining after data cleaning and preparation are fed into the model. Since our team is not an expert in the telecommunication field we eliminated the variables only minimally as error of omission is more dangerous than error of commission. We depend on the variable selection methods to assist in eliminating irrelevant variables. Thus selecting meaningful variables is a very important task in building a good model. The fact is there are many ways in selecting meaningful combination. Decision Tree, EM Variable Selection method (chi-square), and Multiple Regression are models that the team will use as variable selection technique. Besides having the ability to select meaningful variables, both decision tree and multiple regression are the predictive technique in themselves. Then the team will use both techniques in building the model to predict subscriber after the variable has been selected (remark: the logistic/multiple regression will select the best combination of variables together with predicting the target variable). Variable Selection Node: To select variables to feed in the model we first chose the variable Selection node and from previous experience found Chi-square variable selection to be most effective. We used Chi-square value of 3.84 with 50 bins and making 6 passes with a 0.50 cut-off. Of the variables fed 17 were selected for modeling. 22

23 Variables Selected: AVG_M2M_CALL_RATIO Transformed STD_M2M_MINS_RATIO HS_MANUFACTURER HS_MODEL SUBPLAN SUBPLAN_PREVIOUS TOP1_INT_CD TOP2_INT_CD TOT_RETENTION_CAMP AVG_BILL_AMT Transformed AVG_BILL_VOICED Transformed AVG_CALL_FIX Transformed AVG_VAS_GAMES Transformed AVG_VAS_SR Transformed HS_AGE Transformed STD_MINS_INTRAN Transformed STD_VAS_GAMES Transformed Figure 12: Variables selected by the Variable selection node Decision Tree as Variable Selector: From previous experience in working with decision trees our team has encountered the Gini reduction model of obtaining a pure tree the best variable selection method. Using this decision tree with 6 branches and 6 levels (depth) of the tree and Gini index as our model assessment measure we obtained the following variables to be fed into the model. 23

24 HS_AGE (Most IMPORTANT) SUBPLAN HS_MODEL AVG_VAS_GAMES STD-VAS_GAMES AVG_BILL_AMT SUBPLAN_PREVIOUS TOP2_INT_CD AVG_MIN_INTRAN TOP3_INT_CD Figure 13: Variables selected by the Decision tree node Multiple Regression Our team used stepwise multiple regression as another variable selection method. The resulting model yielding the following variables: AGE: Maximize normality AVG_VAS_QG AVG_BILL_AMT: Maximize normality AVG_VAS_GAMES: Maximize normality AVG_PK_CALL_RATIO: Maximize normality AVG_MINS_INTRAN: Maximize normality AVG_OP_MINS_RATIO: Maximize normality BLACK_LIST_CNT COBRAND_CARD_FLAG 0 DAYS_TO_CONTRACT_EXPIRY HS_AGE: Maximize normality HS_MANUFACTURER 24

25 LST_RETENTION_CAMP LUCKY_NO_FLAG 0 STD_CALL_LOCAL: Maximize normality STD_VAS_GPRS: Maximize normality STD_MINS_MOB: Maximize normality SUBPLAN TOT_DEBIT_SHARE: Maximize normality VAS_AR_FLAG LINE_TENURE: Optimal binning Imputed indicator for DAYS_TO_CONTRACT_EXPIRY Imputed indicator for NATIONALITY IV. Modeling Figure 14: The entire modeling process Taking into account the number of variables and the scope of the data mining goal, the most optimal solution would be the 100% accurate prediction of 3G customers. As achieving such a perfect figure is extremely difficult given the complexity of the data set, it 25

26 was decide that a result around 80% (sensitivity value) could be taken as a fairly good predictive accuracy value. The data set was sampled first to produce a balanced data set (equal number of both 2G and 3G customers; 3000 each). The 6000 instances of the sampled data set was then partitioned using training-validation-test set using a random seed of 5217 Based on the data preparation, selection and transformation (as a result of the previous steps) we decided to build the following models. 1. Having performed variable selection, using the Linear Regression node i. Logistic Regression Models The logistic regression model was built keeping LOGIT, CLOGLOG and PROBIT as the link function and using the stepwise selection method, with criteria set to none. ii. Decision Tree Models Various types of decision tree models (Chi-square, Entropy and Gini models with both binary and 3-way split) were built with different number of observations per leaf and required for the split. iii. Neural Network Models Built a Multiple layer perceptron, using the Misclassification rate as the basic selection method with a low noise data setting. 2. Similar models were built using the Decision tree node as means for variable selection. 3. Similar models were built using the Variable selection node for the purpose of variable selection. It is to be noted that many different types of models were built, but only the ones with acceptable results are described below. Variable selection method - Linear Regression Node Based on the various decision tree models built it was seen that the Chi-square model with a three-way split was the most significant predictor of the true positives ( 3G as 3G) compared to the other tree models. So it was decided to eliminate the other models. 26

27 Similarly the same was done for the different types of Neural Network (different selection criteria and data settings) and Regression models. Table 6 explains the overall performance of the three selected models (best); while figure 15 shows the modeling process. Figure 15: Modeling using Linear Regression for the purpose of variable selection Since the goal of this project is to concentrate on the accurate prediction of customers who would turn to the 3G network, it was decided that the sensitivity figures would be taken into account as of being the foremost importance. Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way Selected Model Multi-layer Perceptron Logistic Regression Ensemble (all 3 models) Table 6: Comparison of the three best models (Linear Regression mode of variable selection) 27

28 As obvious from the table above, we decided to go with the Decision tree model (Chi-square test: 3-way split). To build this tree model we changed the default model to hold a minimum number of 5 observations in each of the leaf nodes, 36 observations were required for each split, with a significance level of It was observed that the chi-square test decision tree model picked HS_AGE, AVG_MINS_INTRAN, AVG_VAS_GAMES, AVG_BILL_AMT, LST_RETENTION_CAMP followed by the VAS_AR_FLAG as the most defining characteristics of the data set in the same order of significance. Variable Selection method Decision Tree Similar to the way in which the previous models were built, using the regression node as a variable selector, the decision tree node was also used to perform the same functionality. Table 7 explains the overall performance of the three selected models (best). The modeling process is defined graphically in figure 16. Figure 16: Modeling using Decision Tree for the purpose of variable selection In the case of constructing the decision tree model it has to be noted that the decision tree model node is capable of factoring in the missing values and also account for the distributions of the various variables into the modeling process. Hence the decision tree variable selection node was connected directly to the model node from the data partition 28

29 node. Whereas the other for the regression and the neural network, replacement and transformation plays a very important part of the modeling process. The three models were run separately and also together as an ensemble model; and the results were collected. For the process of variable selection the decision tree node was set to the Gini reduction splitting criteria with a maximum of 3 split allowable from each node and the model measurement measure set to Total leaf impurity (Gini Index), minimum number of 5 observations in each of the leaf nodes, 36 observations were required for each split and a depth of 8. All the other selections in the node were left as default. Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way % Selected Model Multi-layer Perceptron % Logistic Regression % Ensemble (all 3 models) % Table 7: Comparison of the three best models (Decision tree mode of variable selection) In this case also the Chi-square test model with a 3-way split performed the best with sensitivity for true positives of 76.39%. It was observed that the model picked HS_AGE, SUBPLAN, AVG_VAS_GAMES, AVG_BILL_AMT, HS_MODEL, AVG_MINS_INTRAN followed by the STD_VAS_GAMES in the same order of significance. Variable Selection method Variable Selection Node Similar to the way in which the models were built, using the decision tree node as a variable selector, the variable selection node was also used to perform the same functionality. We use the Chi-square method of selection with a cutoff of Unlike in the previous modeling process the variable selection node is incapable of handling any kind of discrepancies in the data, such as missing values. So the variable selection node for the 29

30 purpose of this modeling phase is executed after the replacement and the transformation of the data was completed. Figure 17 in the next page explains the modeling process. As before the individual models, as well as the combination (ensemble) were run and results collected. Figure 17: Modeling using Variable selection node for the purpose of variable selection Table 8 explains the overall performance of the three selected models (best). There are two models with same values for both the sensitivity and the overall accuracy. As it is difficult to choose between them, it was decided to take both of them into account for further analysis. Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way % Selected Model Multi-layer Perceptron % Logistic Regression % Ensemble (all 3 models) % Table 8: Comparison of the three best models (Variable selection node) 30

31 It was observed that the model picked HS_AGE, AGE, AVG_BILL_AMT, AVG_VAS_GAMES, STD_VAS_GPRS followed by TOT_RETENTION_CAMP in the same order of significance. The model also picked SUBPLAN and HS_MANUFACTURER; with each type of plan affecting the chances of the customer moving from 2G to 3G in a different way. Some types of plans meant the customer would prefer to stay with the 2G network, while others meant a possibility of change. The age of a handset seems to affect the model in a negative manner, i.e. the older the longer a person has used a particular model the more likely he is to stay with his current network settings. The age of a person is also observed to have a negative affect, i.e. the older the person, the less likely it is of him/her moving over to the 3G network Based on the comparative sensitivities of the models, and assuming anything greater than 75% to be a reasonable mark, we chose 3 models for further analysis. Variable selection method Linear Regression Decision Tree Variable selection node Model Test Misclass. rate Training Misclass. rate Validation Misclass. rate Sensitivitytest Chi-square Test: 3-way % Chi-square Test: 3-way % Logistic Regression % Table 9: Comparison of the top 3 models built thus far V. Assessment The three models selected were combined to produce another ensemble model. This model was built to check, and see if when combined; whether they were capable of producing a better result as compared to what was already achieved. In the next section we explain the comparison of these four models using the lift charts. 31

32 Lift chart (% response) on Validation Data Non Cumulative To assess how well the models we have constructed performs with respect to the baseline model (i.e. the current business practice). We compared our decision tree models with the baseline data to obtain a clearer picture of whether it truly performs better and by how much. We used Lift Charts to gain an overall perspective. These charts also help us understand not only the difference in prediction as a number but also in terms of business practices. The average percentage response (baseline model) captures 50.39% of the responses across any of the deciles. Models Percentiles Ensemble VS -> DT DT -> DT Logistic Baseline model 50.39% Table 10: Comparison of 3 models: % Response Non-Cumulative In the top 10 and 20 percentile, Logistic Model (Variable selection node for variable selection purpose) performs best capturing 92.22% response rate. But at the 30 th percentile, the Decision Tree model (Gini Decision tree node for variable selection purpose) edges ahead of the other three, capturing 86.83% of the response. But if one would look at the figures, it is managerially more viable to target the top 20 percentile (reduces the cost of mailing; if that is the action to be taken) where we capture a much greater response using the logistic regression model compared to the 86.83% at the 30 th percentile by the Decision Tree Model. 32

33 Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 18: Lift chart (% response) on Validation Data Non-Cumulative Thus we see the Logistic regression model performs better than the other three models. We chose the Entropy-2 model as a better predictor of percentage responses; and definitely a huge improvement over the current baseline model Lift chart (% Captured Response) on Validation Data Non-Cumulative To further analyze the models we used lift charts on them based on non-cumulative captured response. In this area if we were to target the top 10% alone, then the logistic regression model yields a higher captured response i.e. it captures 18.30% of the actual 3G customers in the data. In the next decile also, it captures 18.30%. At the next decile the Decision tree model (Decision tree node for variable selection) performs best by capturing 17.23% of the actual 3G customers. But the logistic regression model by now nearly captures 52.26% of the actual 3G customers in the data compared to 51.73% by the next best model. 33

34 Models Percentiles Ensemble VS -> DT DT -> DT Logistic Baseline model 10 Table 11: Comparison of 3 models: % Captured Response Non Cumulative The captured response values and the graphical representation of the same clearly indicate two things. One that up to the 20 th percentile the Logistic regression model outdoes the other models and; two, that at the 55 th percentile, all the models perform worse than the baseline model. We can thus conclude that this chart also indicates logistic regression model to be the better of the 4 models in terms of capturing most actual buyers in the least deciles. Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 19: Lift chart (% Captured response) on Validation Data Non Cumulative 34

35 Lift chart (% Captured Response) on Validation Data Cumulative Speaking cumulatively, if we were to target only the top 10% of the scored file, then the logistic regression model would capture the most (18.30%) of the actual. At 20% of the file, it captures 36.60% of the results. If were to choose just the top 30% of the data it would capture 52.26% of the actual customers by having mailed much less than half of the population. At 40%, it performs optimally by capturing 66.04% of the actual 3G customers. Models Percentiles Ensemble VS -> DT DT -> DT Logistic Table 12: Comparison of 3 models: % Captured Response - Cumulative The table above as well as the graph below clearly illustrates that the logistic regression model is the better performing model of the other three models in comparison. Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 20: Lift chart (% captured response) on Validation Data Cumulative 35

36 Lift Value on Validation Data Cumulative In terms of how the models perform against the baseline, we see that if we target just the top 10% and 20% of the file, the resulting logistic regression model performs best being 1.83 times better than baseline. It would capture 1.83 times more than the percentage of buyers captured by baseline. At 30% of the file, its performance drops though not considerably to 1.74 times, yet at 30% percentile the logistic regression model is still more than (almost) twice better than baseline (1.74 times). This may be a small number in comparison, but when assimilated with millions of records would make a tangible difference in profits. Models Percentiles Ensemble VS -> DT DT -> DT Logistic Table 13: Comparison of 3 models: Lift value Cumulative Ensemble Combination model Var:Chi-3 Decision tree model with variable selection node DT:chi-3 Decision tree model with Gini tree for variable selection Logistic Logistic regression model with variable selection node Fig 21: Lift value on Validation Data Cumulative 36

37 After evaluating the all the four models we have concluded that the Logistic Regression model with the variable selection node to perform better than all the other models created thus far. Sensitivity The ability of the models to predict the actual customers is of greater impact than predicting those who will not. Hence sensitivity of the model is one of the selection factors between models. Logistic Regression model created yields a sensitivity of 78.61% over the training data set (Figure 22) Figure 22: Cross-tab plot: Logistic Regression Variable Selection Logistic regression selects 63 variables (inclusive of dummy coded variables). 37

38 Figure 23: Effect of the variables It is quite clear from the chart above that the older the age of the handset the more likely is the probability that the customer might not shift to the 3G network. This could be due to the fact that the customer might be more familiar with the features of the handset and is not willing to/does not have any use for the new features. Similar results are evident for the age of the customer and the sub-plan he/she has chosen. But it is also seen that though the sub-plan has a negative effect on the model, there are some plans; when chosen by a customer increases the chances of shifting to the 3G network. The model has a positive intercept indicating that if no data is known about a customer he is still has a likelihood of shifting to 3G VI Evaluation We have evaluated the regression model as a more accurate predictor of 3G customers. Quantitative Analysis To quantify our results and justify our model selection in more managerial terms, we present the example below: We know that in general younger people prefer change and more so in the telecom industry. For the purpose of predicting 3G potential customers, it is clear that the target customer is young, rich, uses more features like games and multimedia messaging etc., 38

39 From a managerial point of view we try to quantify our model, taking an example of a 25 year old customer paying on an average $500 for telecom services who downloads close to 22,500Kb monthly and has a handset around 4 years old. This customer s likelihood to be a potential 3G customer changes if the manufacturer of the phone changes or even the sub plans changes: Manufacturer Code Sub Plan Code Number retention campaigns sent Probability CUSTOMER PROFILE % % - IDEAL % % % Table 14: Probability of purchase compared to customer profiles We see that if the customer has a phone manufactured by manufacturer coded as 49, uses the sub plan 2101 and is sent 3 retention campaigns He has the highest probability of moving to 3G service 99.99%. This is our IDEAL customer. If we change just the manufacturer to 1 the customer now has only 0.12% chance of changing to 3G. The same customer with handset manufactured by code 6 has a 73.90% chance of moving to 3G, which is our mid-tier probability. If we increase the retention campaigns sent to the customer the probability drops. Hence we recommend do not flood the customer with retention campaigns Our next ideal profile was if the customer has manufacturer 1 phone but a sub plan 2108, then he has a 98.42% chance of moving to 3G. Summary: o Ideal Profile: Young customer: age around 25. Uses games and downloads around Kb a month. Pays a monthly bill of $500 and handset is around 4 years old. Total retention campaigns sent should be around 5, not more. Customer having phone manufactured by 49 with a sub-plan Customer having phone manufactured by 1 and sub-plan