2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management Engineering, POSTECH, Pohang, South Korea {skagud,jaewookl}@postech.ac. kr Kyu-Hwan Jung SK telecom Seoul, South Korea Onlyou7@postech.ac.kr Yong Seog Kim Department of Management Information Systems, Utah State University, Logan, UT 84322, USA yong.kim@usu.edu Abstract This paper explores the possible application of a single SVM classifier and its variants to churner identification problem in mobile telecommunication industry in which the role of customer retention program becomes more important than ever due to its very competitive business environment. In particular, this study introduces a uniformly subsampled ensemble model of SVM classifiers combined with principal component analysis (PCA) not only to reduce the high dimensionality of data but also to boost the reliability and accuracy of calibrated models on data sets with highly skewed class distributions. According to our experiments, the performance of USE SVM model with PCA is superior to all compared models and the number of principal components (PCs) affect the accuracy of ensemble models. 1. Introduction The availability of cheap hard disk spaces and the expansion of data collection technologies empower many business companies to easily monitor and visualize customers daily purchase and usage patterns through online transaction processing (OLTP) databases [5]. Therefore, in these days, most companies have plenty of data. However, data itself is not information, and data must be turned into information so that users can answer their own questions with the right information at the right time and in the right place. In this paper, we consider an imaginary company in mobile telecommunications industry that faces a very steep competition from its competitors and hence is compelled to capture, understand, and harness these customer related data sets to seek for new business opportunities with new customers and retain current customers through improved business operations. Note that many companies in telecommunications industry has been suffering from extremely high churning rates, i.e. between 20% and 40% of customers leave their current service provider for a given year, mainly because their relatively homogeneous technologies and services drive them to compete in terms of lower service charges. In such a case, the role of marketing becomes a key success factor. In particular, it is well known that it is much profitable for a company to retain a current and royal customer than to recruit a new customer considering an increasing marketing costs. In particular, micro or target marketing programs with tailored messages are much more cost effective than mass marketing programs through traditional marketing channels such as TVs and newspapers. Therefore, it is strongly recommended that companies in a very competitive business environment operate their own customer relationship management (CRM) systems equipped with business intelligence and data mining tools to identify a group of customers who are most likely to terminate their relationships with the current service providers. Note that churn identification and prevention is a critical issue because the mobile phone market has already reached a saturation point and each company strives to attract new subscribers while retaining current profitable customers [19]. To support such an effort, in this paper, we like to introduce one of such micro marketing tools suited for churner identification on behalf of companies in telecommunications industry. We first note that churn management should start with an accurate identification of churners possibly coupled with detailed profiling of their demographic information and behavioral and transactional patterns. While 978-0-7695-4525-7/12 $26.00 2012 IEEE DOI 10.1109/HICSS.2012.74 1023
developing retention strategies and management practices targeted for identified very likely churners may complete a churn management system, we limit our interests on developing a new SVM ensemble model to accurately identify possible churners from their service usage patterns collected over a certain period. values. Second, categorical variables with high missing rate were also eliminated because each categorical variable has very little predictive power in general [17]. The remainder of this paper is organized as follows. Section 2 describes the original data set and preprocessing procedure. Then we introduce the Uniformly Subsampled Ensemble (USE) method in Section 3 and present experimental results in Section 4. Finally, Section 5 concludes this paper and provides suggestions for future research directions. 2. Data Description and Evaluation Metrics 2.1. Telecommunications Market Data The data sets used in this paper are the customer records of a major wireless telecommunications company. They are provided by the Teradata Center for CRM at Duke University [9]. The data collection period is the second half of 2001. The active customers who had been with the company for at least 6 months were sampled. The original predictor variables are 171 and the number of samples is 100,000. Predictor data include four types of variables: demographics such as age, location, number and ages of children; financial such as credit score, credit card ownerships; product details such as handset price, handset capabilities; and phone usage information. To predict churn, we have to set the criteria of churn at first. We classified the customers left the company by the end of the following 60 days after sampled as churners. The actual ratio of churners in a given month is approximately 1.8% but churners in the original training data set were oversampled to 50%. In the test data set there were 51,036 observations with 924 churners which represent a real churning rate 1.8% per month rate. Fig. 1 shows a plot of training dataset with two features selected by feature selection. We can see that the churners and non-churners are highly overlapped. 2.2. Data Preprocessing For further analysis, we performed the preprocessing on raw data before applying the proposed method as follows. First, we eliminated continuous variables with more than 20% of missing Figure 1. Plot of training dataset Also if they are encoded into multiple binary variables, dimensions will increase. Thus only 11 categorical variables were included. They are either indicator variables or countable variables. Finally we removed observations with missing values in dataset. After preprocessing steps, we have 123 predictors with 11 categorical variables and 112 continuous variables. The training dataset has 67,181 observations with 32,862 churners of which churn rate is approximately 49%. The test set has 34,986 observations with 619 churners of which churn rate is approximately 1.8%. 2.3. Evaluation We used hit rate as an evaluation metric for our research. The hit rate is a popular measure to evaluate the predictive power of models numerically for the marketing field [18]. Hit rate is calculated as n Hit rate = H i / n (1) i= 1 where Hi is 1 if the prediction is correct and 0 otherwise. n represents the number of samples in the data sets. In other words, the hit rate represents the percentage of correctly predicted churners from the churner candidates. Hit rate is associated with a target point. For example, a hit rate at a target point of x% is a hit rate when only the top x% of customers are considered for evaluation based on their estimated churn probabilities. Therefore, if we assume that we 1024
have 10,000 observations, a hit rate at a target point of 10% is the percentage of correctly predicted churner out of 1,000 customers who are most likely to churn. Considering hit rates with target points is important because marketing managers have to focus only on the top percentage of customers due to limited budget and time constraints. Thus out target point is 30% 3. Proposed Ensemble Method In this section, we present the structure of our new ensemble model, the USE, and describe its unique characteristics in terms of sampling and weighting schemes. Figure 2 graphically presents the structure of the USE model. The first step in building the USE is to partition the data set into subsets to train a single corresponding classifier. Once a single classifier is calibrated to produce the estimated score (e.g., probability of churning) for each customer record from each partition, the USE ensemble model aggregates the scores of each classifier and produces the final score of the ensemble model. Figure 2. The structure of the proposed ensemble method 3.1. Weighting methods To generate a collective decision, we consider several ways to aggregate the predictions of trained classification models through various weighting schemes such as uniform weights, weighted by classification performance, or weighted by hit rate. The simplest weighting scheme is the uniform weight method that apply the same weight (=1/M) to the prediction from all the classifiers. The prediction of each individual classifier may be weighted depending on the binary classification performance or the hit rate on sampled validation data from the training data. To apply the weighting scheme based on classification performance, the classification accuracy on the validation data of each classifier is normalized to facilitate summing to 1 and the final prediction on the test data set is weighted according to this normalized weight. In the weighting scheme based on hit rate, the hit rate at 10%, 20%, and 30% are summed to measure the performance. Subsequently, they are normalized prior to summing to 1. The final prediction on the test data set is weighted according to this normalized weight as follows: f ( x) = M ˆ (4) m= 1 w m f m ( x) 3.2. Bagging and Boosting vs. USE To build an accurate ensemble model based on our proposed USE method, we divide an entire training data set into M equally sized nonoverlapping subsamples using a random sampler. Consequently, any single classifier (e.g., an SVM classifier) can be calibrated on each subsampled data set to determine hidden patterns. Finally, the prediction of all classifiers will be aggregated via a weighted summation to construct the final prediction as an ensemble model for each record in a test data. In this sense, the proposed USE method is very similar to two popular ensemble methods, namely, bagging [2] and boosting [6] which have been known to perform better than single classifiers [1], [3]. For example, ensemble models based on bagging train each classifier on a randomly drawn training set that consists of the same number of examples randomly drawn from the original training set, with the probability of drawing any given example being equal. Since samples are drawn with replacement, some examples may be selected multiple times whereas others may not be selected at all. Bagging combines the predictions of multiple classifiers by voting with an equal weight. In short, the major difference between bagging and the proposed USE method is whether or not samples are drawn with replacement and whether or not the size of sampled training set for each single classifier is equal to the size of the original training set. 1025
Furthermore, our proposed USE method is different from the boosting [6] method that produces a series of classifiers with each training set based on the performance of the previous classifiers. Through adaptive resampling in boosting, examples that are incorrectly predicted by previous classifiers are sampled more frequently, whereas the uniform subsampling method without replacement is exploited in the USE. Overall, each classifier in the USE model is calibrated on a smaller training set compared with classifiers in bagging and boosting, which requires less CPU power and main memory. The USE model can still reduce the expected prediction error of a single predictor. All three ensemble models bagging, boosting, and USE share a common characteristic: the effectiveness and improved accuracy of the proposed ensemble model come primarily from the diversity caused by resampling training examples. While it is perfectly reasonable to calibrate single classifier on a sampled training set without further preprocessing, we consider a data dimension reduction method such as principal component analysis (PCA). Note that PCA is a mathematical procedure to transform a set of correlated predictors into a set of new uncorrelated variables called principal components (PCs) that capture the maximum amount of variation in the data. Since the number of PCs is less than or equal to the number of original variables, and each PC is uncorrelated with other PCs, PCA method can be particularly useful to reduce high dimensionality of data sets in which many input variables are correlated. Note that dimensionality reduction can be accomplished by selecting by fewer number of PCs than the original input variables, and three methods have been widely used for determining the number of PCs. The first criterion, The Eigenvalue-One Criterion" or the Kaiser-Guttman criterion [8] selects all PCs with an eigenvalue greater than 1. The second approach is based on The Scree Test" [4] and it selects all PCs considering a definitive break between sorted eigenvalues of PCs. The last criterion retains components if they exceeds a specified proportion of variance in the data where the proportion is calculated as follows: Proportion =Eigenvalue for the component of interest/ Total eigenvalues of the correlation matrix compared with other classifiers [10], [11], and authors familiarity. On the other hand, our proposed USE method can be combined with any other classifier. In addition, SVM classifiers often require additional computing power and show poor performance when they are applied to large-scale data [7], [12], [14], [15]. Therefore, the SVM classifier is a perfect candidate to test the effectiveness of the USE method through data subsampling if the aim is to reduce the requirement for high computing power. 4. Experimental results In this section, we present the process and results that the proposed Uniformly Subsampled Ensemble SVM is applied to the telecommunications market data. Fig. 3 shows correlation matrix of variables. We can notice that there are high correlations among features. It supports the need of extracting uncorrelated new features. Figure 3. The correlation matrix with values higher than 0.5 We applied PCA for data dimension reduction. As asserted in section 3, there exist some kinds of methods to select the optimal number of PCs. Among them we considered three approaches which are most commonly used.. In the actual implementation of the USE model in the present paper, we built an ensemble of SVM classifiers. Mainly, the SVM classifiers are used to construct an ensemble model because of their popularity among researchers, superior performance 1026
Figure 6. Effect of Number of Classifiers Figure 4. Plot of Eigenvalues Fig. 4 is the plot of eigenvalues. The numbers of PCs from each approach are as follows. The eigenvalue-one criterion [8] : 27 PCs Scree test [4] : 4 PCs Proportion of variance accounted : 36(90%), 48(95%) PCs We applied the proposed method with above numbers of PCs then compared their hit rates. The results for different number of PCs are presented in Fig. 5. As shown in the graph, hit rate at 30\% is highest when 48 PCs are used. It tends to be increased as the number of PCs increases. Thus 48 PCs are selected in this study After choosing the number of PCs, the optimal number of SVMs, M should be considered. We explored the effects of number of classifiers on the predictive accuracy while the number of PCs was fixed as 48. The training dataset is divided into the different M groups by a random sampler. The hit rate at 10% is highest when M is 49, i.e. 49 SVMs but 25 SVMs shows better hit rate at 30%. Thus our final optimal model is an ensemble model of 25 SVMs with 48 PCs. We also analyzed the effect of weighting methods. Fig. 7 represents a graph of cumulative hit rate for different weighting methods. PCA in the graph is a uniform weight method. As shown in the figure, the weighting methods don't greatly affect the performance. However, the uniform weight method is easy to apply and its performance is a little bit better than other methods. Thus we decided to apply the proposed method using the uniform weight method.. Figure 7. Effect of weighting methods Fig. 7 represents a graph of cumulative hit rate for different weighting methods. PCA in the graph is a uniform weight method. As shown in the figure, the weighting methods don't greatly affect the performance. However, the uniform weight method is easy to apply and its performance is a little bit better than other methods. Thus we decided to apply the proposed method using the uniform weight method. Figure 5. Effect of Number of PC 1027
SVDD (Support Vector Domain Description) model to our problem. Fig. 9 presents hit rates of different five models, respectively: USE SVM + PCA, Ensemble Multi-SVDD,, Logistic model, and a random model. The proposed USE SVM + PCA outperformed other methods, and it shows a larger performance improvement at low proportion. As mentioned before, hit rate at low proportion is more important measure than its at high proportion. Through this the proposed USE method outperforms the conventional method not only theoretically but also practically. Figure 8. Gain by PCA and Ensemble To explore how much the proposed PCA and ensemble model contribute to the increase of performance compared to a single SVM, we compared the performances of five models: USE SVM + PCA, USE SVM, single SVM + PCA, single SVM, and a random model. In Fig. 8 the hit rates were noticeably improved in both cases of USE SVM and USE SVM + PCA. In the case of using a signle SVM with PCA, the performance is increased slightly compared with the results of the previous two methods. 5. Conclusions In this paper, we proposed Uniformly Subsmapled Ensemble (USE) method for churn management. We show that USE SVM enhances churn prediction performance. New features are extracted using PCA. We also investigated the effect of the number of classifiers and principle components and gave a guideline to select them. Different aggregating methods were also considered but they didn't affect the result that much. The performance of USE SVM proposed in this research is superior to all compared models. For further research, Ensemble of heterogeneous classifiers can be considered. In the proposed methodology, only a single classifier, SVM model, is used for prediction, but other heterogeneous classifiers can be calibrated. Effect of distribution of labels also can be analyzed besides effect of the number classifiers and PCs. 6. References Figure 9. Comparison with other methods The performance of the proposed method was compared with the performance of other classifiers. Given dataset is large scale and highly imbalanced, only 1.8% of observations are non-churners. Thus simple conventional methods will not work properly. In the previous studies, Partial Least Square (PLS) model and logistic model, popular models in marketing area, have been proposed to solve this problem [13]. We also applied ensemble multi- [1] E. Bauer and R. Kohavi, "An empirical comparison of voting classification algorithms: Bagging, Boosting, and variants", Machine Learning, 36(1 2):105 139, 1999. [2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123 140, 1996. [3] L. Breiman. Stacked regression. Machine Learning, 24(1):49 64, 1996. [4] R. B. Cattell. The scree test for the number of factors. Multivariate Behavioral Research, 1:245 276. [5] S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Rec., 26:65 74, March 1997. [6] Y. Freund and R. Schapire. Experiments with a new Boosting algorithm. In Proc. of 13th Int l Conf. on Machine Learning, pages 148 156, Bari, Italy, 1996. [7] K.-H. Jung, D. Lee, and J. Lee. Fast support-based clustering method for large-scale problems. Pattern Recognition, 43:1975 1983, 2010. 1028
[8] H. Kaiser. The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20:141 151. [9] Y. Kim. Toward a successful crm: Variable selection, sampling, and ensemble. Decision Support Systems, 41(2):542 553, 2006. [10] D. Lee and J. Lee. Domain described support vector classifier for multi-class classification problems. Pattern Recognition, 40:41 51, 2007. [11] D. Lee and J. Lee. Equilibrium-based support vector machine for semisupervised classification. IEEE Trans. on Neural Networks, 18(2):578 583, 2007. [12] D. Lee and J. Lee. Dynamic dissimilarity measure for supportbased clustering. IEEE Trans. on Knowledge and Data Engineering, 22(6):900 905, 2010. [13] H. Lee, Y. Kim, Y. Lee, and H. Cho. Toward optimal churn management: A partial least square (pls) model. In Proc. of 16th AMCIS. Paper 78, pages 1 10, 2010. [14] J. Lee and D. Lee. An improved cluster labeling method for support vector clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(3):461 464, march 2005. [15] J. Lee and D. Lee. Dynamic characterization of cluster structures for robust and inductive support vector clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(11):1869 1874, nov.2006. [16] S. Rosset, E. Neumann, U. Eick, N. Vatnik, and I. Idan. Evaluation of prediction models for marketing campaigns. In Proc. of 7th Int l Conf. on Knowledge Discovery & Data Mining (KDD-01), pages 456 461, 2001. [17] P. E. Rossi, R. McCulloch, and G. Allenby. The Value of Household Information in Target Marketing. Marketing Science, 15(3):321 340, 1996. [18] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos. Conceptual modeling for etl processes. In Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP, DOLAP 02, pages 14 21, New York, NY, USA, 2002. ACM. [19] L. Wright. The crm imperative practice vs theory in the telecommunications industry. The Journal of Database Marketing, 9:339 349(11), 1 July 2002. 1029