Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords: Data Mining, Direct Marketing,Clustering, Naïve Bayes, Decision Tree, Unbalanced data Abstract: Direct Marketing is a process of advertising in which businesses send out promotional offers directly addressed to a customer. Success of this type of campaign is measured as a percentage of customers who positively respond to the campaign. Direct Marketing is increasingly used in Banks, Insurance companies and retail industry. Success rates of these campaigns are normally less than 10%. Data mining can help industries improve the success rate significantly by identifying customers who are most likely to buy the products. Companies can then target their campaigns towards those hot prospects alone. This will lead to a significant reduction in marketing cost and increase the RoI (Return on Investment). In this paper we will apply some of the data mining techniques to a banking dataset and illustrate how data mining can help the bank improve its direct marketing effort. 1. Introduction: Direct marketing is practiced by businesses of all sizes from the smallest start-up to the leaders on the Fortune 500. A well-executed direct advertising campaign can prove a positive return on investment by showing how many potential customers responded to a clear call-toaction. Direct marketing is attractive to many marketers because its positive results can be measured directly. For example, if a marketer sends out 1,000 solicitations by mail and 100 respond to the promotion, the marketer can say with confidence that campaign led directly to Sagarika Prusty Page 1
10% direct responses. This metric is known as the 'response rate and it is one of many clearly quantifiable success metricsemployed by direct marketers. In contrast, general advertising uses indirect measurements, such as awareness or engagement, since there is no direct response from a consumer. Measurement of results is a fundamental element in successful direct marketing. Predictive modeling and other data mining techniques can help marketers improve the response rate significantly. For example suppose a company has a marketing budget of sending promotional offers to 1000 prospective customers. The company can get much higher return on Investment by sending promotional offers only to the top 1000 customers who are more likely to buy the product than selecting a random base of 1000 customers. Data mining techniques can help marketers identify those top 1000 hot prospects. Data mining tools like cluster analysis can also help marketers group their customers into different clusters or segments and then address their needs accordingly. 2. KDD Process in Data mining: KDD stands for Knowledge Discovery in Databases and refers to the broad process of discovering useful information from datasets. KDD process is often used interchangeably with data mining but actually data mining is a part of the KDD process. It is a systematic approach and can broadly be divided into 5-6 steps. The process starts with the understanding of the business objective and goals of the project. Then comes the dataset identification process where you select the target that needs to be analyzed. More than often, the dataset available is in raw format and needs to be pre-processed or cleaned. This step is called data preprocessing. Data preprocessing takes maximum amount of time in the entire KDD process and if done properly will make the rest of the steps easier. Sometimes data also needs to be transformed before a particular data mining technique can be applied transformation would typically involve either discretizing the numeric attributes, recoding some of the attributes or oversampling or reducing sample size of the data. Once the data is preprocessed and transformed, it is ready for data modeling. Depending on the need of the problem, few techniques or algorithms that are shortlisted and then applied to the dataset. This step is the data modeling step. In this step data miners will apply different techniques and look for patterns and useful information from the data. Sometimes more than one technique is applied and then the best approach is selected. The best one can either be Sagarika Prusty Page 2
one of the established techniques or a hybrid approach. The one approach which gives the best result is then selected as the final model. The final step is the interpretation step in which the information discovered in the data mining step is presented in a format that can be understood by the end user. In this paper we will follow the KDD process. But the data used for this paper is already preprocessed and cleaned. Hence minimum effort is required in preprocessing step. The data can be straight way used for data modeling. The KDD process can be represented throug this simple diagram. Fig. 1(Overview of Web Mining and E-Commerce Data Analytics by BamshadMobasher,DePaul University) 3. Bank Direct Marketing Data 3.1 Source of the data The dataset used for this paper is from a direct marketing campaign of a Portuguese bank for one of its term deposit products. The primary means of campaign was through phone calls to its existing customers. Often, more than one phone call was required to asses if customer is subscribing for the product or not. Data is available in public domain and can be downloaded from http://hdl.handle.net/1822/14838. The full dataset was described and analyzed in:s. Moro, Sagarika Prusty Page 3
R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS. 3.2 Understanding the dataset The dataset available has 45211 instances. For the purpose of my project, I randomly split the data in to two parts. First dataset was the bigger one and contained 40689 instances. This dataset has been for training the model. The second dataset with much fewer instances (4524) has been used as the test data set for validating the model performance. 3.3 Attribute Information The dataset used is related to 17 campaigns that occurred between May 2008 and November 2010. During these phone campaigns, customers of the banks were offered a long-term deposit application, with an attractive interest rate. For each contact, a large no. of attributes was recorded and the output variable was whether the customer accepted the offer or not (Yes indicating that customer accepted the offer and No indicating a negative response). Demographic details of each customer were then added to the campaign related data. The dataset which was available has already been preprocessed. Rows and columns having missing values are already cleaned, only significant attributes are present (totaling to 17 including the output variable. List of the attributes is as follows: Attribute name Description Value age Age(N) Numeric Technician,Management,Student,Maid,Retired job Job (C) etc. marital Marital Status(C) Single,Married education Education ( C ) Primary,Secondray,Tertiary default Credit in default? (B) Yes or No balance Average Yearly balance(euros) (N) Numeric housing Has housing loan? (B) Yes or No Sagarika Prusty Page 4
loan Has personal loan? (B) Yes or No contact Contact communication type ( C ) Phone,Mobile,Unknown day Last contact day of the month (N ) 1,2,3,4,.,29,30,31 month Last contact month of the year(n ) Jan,Feb,Mar,.Nov,Dec duration last contact duration, in seconds (N) Numeric campaign Number of contacts during the campaign(n) Numeric pdays No. of days passed since last contact(n) Numeric poutcome Outcome of the previous marketing campaign (B) Yes or No N=Numeric C=Categorical B=Binary The output variable, as mentioned above is whether the customer subscribed for the term deposit product or not (Yes or No). Of the 40689 instances in the training dataset only 4763 instances has the output variable(y) as Yes. This translates to 10.5 % response rate. This data set is an unbalanced dataset since the variable that needs to be classified has a very unbalanced distribution. (89.5% No cases and 10.5% yes cases). This is an important factor that needs to be considered while performing data modeling. 4 Experimental Environment The software used for data modeling in this project is Weka. Weka is am open source data mining tool and can be downloaded for free. 5 Data Modeling The idea is to experiment with different Machine learning algorithms and select the one giving the best results. 5.1 Model Evaluation criteria For model evaluation, I will be using three different approaches(all complementing each other). First one is F measure which is a weighted average of precision and recall and is represented by the formula 1 below. An F measure close to 1 indicates a good model. Sagarika Prusty Page 5
F Measure ------ Formula 1 Where TP True Positive (Actually yes and classified as yes ) We want a high TP rate FP False Positive (Actually no but classified as yes ) FN False Negative (Actually yes but classified as no ) Lift Chart Second measure for evaluating model performance is to create a cumulative lift chart and check for the lift value and a sampling level. Lift is represented by the formula 2 below. Lift chart is created by calculating the cumulative positive response (actual response) for the hot prospects (data sorted based on the predicted response-all yes cases on top). Lift= Response rate based on Predicted model/response rate based on random calling - Formula 2 A higher lift value will indicate a better performance. Validation using test data set Third one is how good the model performed with the test data set. This can be measured by measuring the TPR rate of class yes in the test set i.e how many of the actual yes cases were classified as yes by the classifier. The higher the better. I experimented with four models. The results are described below: 5.2 Models Naïve Bayes: I started with Naïve Bayes model which is an old method of classification and predictor selection and is known for its simplicity and stability. I performed a 10 fold cross validation on the 40689 instances available in the training data set. The model had a decent accuracy level of 88% but Sagarika Prusty Page 6
had a True Positive Rate(TPR) of only 0.53 (F measure of 0.508) for yes cases. This was not a very good result. But I went ahead to check how the model performs with the test dataset (containing 4524 cases). It failed miserably. It classified all the test cases as No. This is because the data set is highly unbalanced and has only 10% of yes cases. (Resolving the case of unbalanced data) To resolve the issue of unbalanced dataset, I reduced the sample size of No cases in the training dataset. I randomly selected 4763 No cases and deleted rest of the No cases from the training dataset. Now, my training dataset had equal number of Yes and No cases and is perfectly balanced. Naïve Bayes with modified training set After modifying the training dataset, I once again ran the Naïve Bayes algorithm and this time result was much better. The summary of Stratified Cross- validation is shown in fig.2 below. Even though the overall accuracy of the model reduced to 78%, TPR and F measure for Yes cases improved significantly. When validated on the test cases, the model was able to classify 419 of the 526 yes cases which translate to a TPR of 80%. Fig. 2 Sagarika Prusty Page 7
Decision Tree (C4.5 algorithm) Next, I wanted to check how decision tree handles the unbalanced data. So, I ran the J438 method (equivalent to C4.5 algorithm) in Weka. The summary of the model is shown below in fig3. Overall accuracy of the model is 94% with a decent TPR of 0.627 and F measure of 0.701. ROC is 0.94 When tested on the test data set, the model performed much better than the Naïve Bayes model and was able to classify 252 of the total 526 yes cases ( 48%). But, the result was not as satisfactory as Naïve Bayes on the modified training set. Decision Tree (C4.5 algorithm) with modified training set Since decision tree gave better result than Naïve Bayes for the original training set, as a final trial, I experimented with the modified training dataset and applied C4.5 algorithm to it. The summary result is as shown in fig.3 below. Of the four models I experimented with, this one gave the best result. Overall accuracy stands at 90%. Recall and Precision for yes cases are the best of the four models. And F measure for yes class is 0.9. When validated with the test data set, it was able to classify 464 out of 526 yes cases (which is 88%) Fig. 3 Sagarika Prusty Page 8
5.3 Model comparison Since the Naïve Bayes with the unbalanced training data set didn t produce desired result, that model has been left out from the comparison chart. Rest three models are compared below: Precision, Recall, F Measure,ROC Area, Performance with test data Model / Parameter Naïve Bayes(Modified Training set) Decision Tree (C4.5) Overall Accuracy 78.15% 93.7% 89.9% Recall (Class Yes) 0.805 0.627 0.932 Precision ( Class Yes) 0.769 0.794 0.875 F Measure ( Class YeS) 0.786 0.701 0.903 ROC Area 0.851 0.931 0.939 TPR of test set 80% 48% 88% Decision Tree(Modified training set) Except for Overall Accuracy rate (which is anyways not a very good indicator of model performance), Decision tree with the balanced training dataset has outperformed the other two models. Cumulative Lift Chart Fig.4 below shows the cumulative lift chart for all the three models. Decision tree with the modified training set gives the best result throughout ( except at the lower left level sample less than 379 where the unbalanced dataset gave better response rate). Calculating the lift value when only 30% of customers are to be called: 30% of the total instances in test set= 0.3*4524=1357 Lift = Response rate (using top 30% customers)/response rate (random calling) Random Calling Naïve Bayes(Modified dataset) Decision Tree Decision Tree(Modified dataset) No of yes 145 413 305 467 cases Response Rate 10.6% 30.4% 22.4% 34.4% Lift 1 2.87 2.11 3.24 Sagarika Prusty Page 9
Fig.4 6 Clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). Clustering can be informative and can be used very effectively in Direct Marketing. It can identify the characteristics of customers who are more likely to subscribe or buy a product and companies can leverage this information to customize products for different customer segments. 6.1 K means Clustering: Since, the balanced dataset gave better results for the classification cation exercise, I used the same dataset for clustering as well. I applied K means clustering technique in Weka and experimented with different no. of clusters and got some good results with 8 clusters. Result is shown below in fig 5. Sagarika Prusty Page 10
Fig. 5 6.2 Interpretation of Clustering Analysis Clusters with y as yes represent the groups of customers who subscribed for the term deposit product and those with y as no are the ones who didn t subscribe for the product. From the result above, we can see that cluster 0, cluster 2, cluster 3 and cluster 4 represent customer who subscribed for the term deposit product offered by the bank. If we read the some of the characteristics of customers in cluster 4, they are the ones with management jobs, mean age of 41.7 yrs, are married, have tertiary education, have an average balance of 1952 Euros and have subscribed for the term deposit product. Cluster 3 and Cluster 5 looks very similar representing customer with admin jobs, married, have secondary education. But cluster 3 has customer who subscribed for term deposit and cluster 5 represent customers who didn t subscribe for the product. If we look closely, we see that in cluster 3, the average duration of the campaign call is 435 secs as against 295 secs in cluster 5. This indicates that may be customers with higher call duration enquire about the product in detail as they are interested in the product. Similarly pdays(number of days that passed by after the client was last contacted from a previous campaign)for customers in cluster 3 is 47.6 as against 12.7 in cluster. Low average pdays is possibly because customers who were not contacted before have a value of -1 in the data. Similarly, other clusters can be analyzed for specific characteristics. Sagarika Prusty Page 11
7 Conclusion Using the banking dataset we clearly saw the usage of data mining tools in Direct Marketing and how data mining techniques can help companies get better Return on Investment of their marketing budget. Algorithms like Naïve Bayes and Decision trees can help classify customers as good or bad customers (based on their propensity to buy a product) and clustering techniques can help in segmenting customers and identifying characteristics or attributes of good customers. There are lots of other techniques like association rule and other classifier algorithms that have not been discussed in this paper. Depending on the business objective appropriate tool needs to be selected and different models should be tried and tested to come out with the best performing model. That model can then be applied to the direct marketing data for best results. 8 Citation [pdf] http://hdl.handle.net/1822/14838 [bib] Moro et al., 2011] S. Moro, R. Laureano and P. Cortez.Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS. http://en.wikipedia.org/wiki/direct_marketing Overview of Web Mining and E-Commerce Data Analytics by BamshadMobasher Sagarika Prusty Page 12