2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) Predicting Car Purchase Intent Using Data Mining Approach 1 Yap Bee Wah, 2 Nor Huwaina Ismail Faculty of Computer and Mathematical Sciences Universiti Teknologi MARA 40450 Shah Alam, Selangor, Malaysia beewah@tmsk.uitm.edu.my 1,huwaina_ismail@yahoo.com 2 3 Simon Fong 3 Department of Computer and Information Science, University of Macau, China CCFong@umac.mo 3 Abstract Data mining involves the exploration and analysis of large databases to find patterns and valuable information that can aid in decision making. This paper illustrates the use of data mining approach to build predictive models for predicting customer s intent of car purchase after booking a car. Records show that a customer who has booked a car has the tendency to cancel their booking. Three data mining predictive models: Logistic Regression (LR), Decision Tree (DT) and Neural Network (NN) were used to model the intent of purchase (IOP). The sample for this study has 1935 cases. The data was partitioned into training (70%) and validation (30%) samples. Comparisons of the performance of these three predictive models were based on the validation accuracy rate, sensitivity and specificity. Results show that all three models validation accuracy rate are quite similar (LR= 91.79%, CART=91.17%, NN=91.17%) while LR has the highest sensitivity (LR=87.77%, CART=85.47%, NN=85.89%). Important customer characteristics were also revealed from these models. Keywords- logistic regression, decision tree, data mining, classification, predictive modeling I. INTRODUCTION Data mining is one of the stages in the overall process of Knowledge Discovery in large databases (KDD). With the emergence of data mining software, data mining is gaining popularity among banks, telecommunication companies, insurance companies, educational institutions and business organizations to gain valuable information from the data which can aid in decision-making. Such organizations can use data mining for finding undiscovered patterns and/or relationships in large databases [1-5]. The goal of data mining is to find patterns in historical data that shed light on customer purchase behaviour, needs and preferences. Such valuable information can help organizations improve their business performance and practices such as improving target marketing, sales, and customer management. The different stages in the data mining process have been described in [2], [3] and [5]. The kinds of information that can be discovered depend upon the data mining objectives and techniques employed. Data mining techniques can be categorized into three categories: classification and prediction, cluster analysis and association analysis. Classification and prediction techniques fall under predictive modeling. Predictive modeling is also known as supervised classification or supervised learning because the prediction model is constructed from the data where the target or response variable is known. Generally, Linear Discriminant Analysis and logistic regression are two popular statistical methods to construct predictive models [6]. However, with the emergence of Data mining software such as SAS Enterprise Miner and SPSS Clementine, not only the classical methods but new novel predictive modeling and classification techniques such as decision tree, neural networks, support vector machine (SVM), and k-nearest neighbours are available for practical applications to real data from various discipline. Various studies in different subject areas have compared their predictive performance. For example, the ability of neural network models was compared with conventional techniques such as discriminant analysis, probit analysis and logistic regression in evaluating credit risk in Egyptian banks [7]. Some of these data mining classification algorithms were compared in predicting breast cancer survival [8] while [9] used an integrated data mining methodology to predict graft survival for heart-lung transplantation patients. Reference [10] investigated the performance of the SVM approach in credit rating prediction in comparison with back propagation neural networks while [11] reported that compared with neural networks, genetic programming and decision tree classifiers, the SVM classifier achieved identical classification accuracy with relatively few input variables. The performance of these data mining techniques will continuously be compared in different area of applications. The objective of this paper is to develop predictive model to foretell a customer s intent of purchase after booking a car. This study considered and compared the predictive ability of Logistic Regression (LR), decision tree (C5.0, CHAID and CART) and Neural Network (NN) models. This paper is organized as follows. In Section 2, we briefly review the applications of predictive models and the selection of variables. Section 3 presents the methodology for constructing the models. The results are discussed in Section 4. Finally, some concluding remarks are given in Section 5. 978-1-61284-181-6/11/$26.00 2011 IEEE 2052
II. METHODOLOGY A. Logistic Regression Logistic regression is a popular non-linear statistical model and widely applied in many fields. In contrast to multiple regression model, the logistic regression model a binary or polytomous dependent variable. For a binary dependent variable, the event of interest is coded as 1 and the nonevent as 0. The logistic regression model is written as: P( Y = 1) log = α + β P Y 1 ( = 1) 2 + Equation (1) can be solved to obtain 1 P( Y = 1) = z 1+ e where (2) where + β X + + β 1 1 2 2 1 X 1 + β 2 X +... β k X k (1) k X k The logistic regression model enables us to calculate the probability of event Y=1 occurring for each case. The predictors, X k can be a mixture of continuous and categorical variables. B. Decision Tree A decision tree model consists of a set of rules for dividing a large collection of observations into smaller homogeneous group with respect to a particular target variable. The target variable is usually categorical and the decision tree model is used either to calculate the probability that a given record belongs to each of the target category, or to classify the record by assigning it to the most likely category. Decision tree can also be used for continuous target variable although multiple linear regression models are more suitable for such variable. Given a target variable and a set of explanatory variables, decision algorithms automatically determine which variables are most important, and subsequently sort the observations into the correct output category [12]. The common decision tree algorithms in data mining software are CHAID (Chi-Square Automatic Interaction Detector), CART (Classification and Regression tree) and C5. The CART algorithm uses gini as the splitting criteria for categorical dependent variable while C5 uses entropy. Meanwhile, CHAID uses chi-square test as the splitting criteria. These algorithms will produce the tree-like structure diagram and the decision rules whereby important information can be extracted. C. Artificial Neural Networks Artificial Neural Networks (ANNs) are seen as an attractive alternative to traditional statistical methods. They are modeled after the human brain, which can be perceived as a highly connected network of neurons (called nodes in neural networks terminology). Each node (in a layer of nodes) receives inputs from at least one node in a previous layer and combines the inputs and generates an output to at least one node in the next layer. Generally, the independent variables comprise the input layer and the dependent variable comprises the output layer. Between the input and output layers, one or more hidden layers of nodes may exist. The multilayer perceptron (MLP) is the most widely used neural network model in data analysis. ANNs can identify and learn correlated patterns between input data sets and corresponding target values. However, Artificial neural networks (ANNs) have been criticized for its black box approach and interpretative difficulties. Nevertheless, they provide an alternative model to be compared with other classification techniques. After training, ANNs can be used to predict the outcome for new independent input data ([1],[4],[13],[14]). D. Literature on Car Purchase In building a predictive model, historical data on customers who previously purchased or cancelled car booking are required. Reference [15] conducted a study on one thousand recent buyers of a new car. Among those, seventeen percent only considered the brand of their previous car before purchase another car. The factors that influence the consideration of a single brand are satisfaction with the previous car and dealer, socio-demographic variables (being old, with a lower education and lower income), low perceived risk, and a number of product-specific elements (owning only one car, not owning a foreign car, staying in the same product segment, having driven only 30,000 kilometers with the previous car and having owned ten cars in the past). In predicting purchase behavior from stated intentions [16] proposed a unified model and applied it to a survey which involved randomly selected 2000 households. For the automobile data, the purchase intention is defined as 1 if the consumer intends to purchase or (actually purchases) an automobile within 12 months. Meanwhile, the purchase intention is defined as 0 if the consumer does not intend to purchase or (does not actually purchase) an automobile within 12 months. They considered variables such as occupation and education level of household head, type of residence, income, number of cars and years of cars currently in household. According to [17] current owners of cars are more likely to repurchase the brands they currently own when they are asked intent questions. In addition, the purchase behavior of current car owners is more consistent with their brand attitudes. Firsttime car buyers, on the other hand, are more likely to purchase brands that have large market shares. Reference [18] presented a model which produces simultaneous forecasts of car holding, new car purchase and scrappage. All are sensitive to changes in income and prices or car costs. The basic theoretical foundation of their model is the assumption that a potential car holder is a person between 18 and 75 years old. Car holder means here a person holding a registered car. Car holding is largely determined by income, people s expectation and car cost components. Evolution of car holding is sensitive to economic circumstances. Nevertheless, new car purchase is very much more sensitive to economic circumstances than is car holding. The role of affordability is also an important predictor of purchase instead of attitudes and purchase intentions. That is why income is an important variable in economics and is examined extensively. Total family income (TFI) is used to segment markets, profile consumers, and provide explanations for changes in purchasing patterns [19]. Reference [20] examined the impact of gender on 2053
car buyer satisfaction and found that the attitudes of male and female consumers toward car purchasing showed a clear difference. It is clearly shown that the price of a car to be important for both male and female buyers, but for different reasons. For male buyer, paying a higher price for a car means that they can have higher expectations and impress others more, while for female buyers a higher price is more important in assuring them that their car will perform as it should. Women are becoming an increasing force in the car buyer market. Their pattern of car buying differs from men. Women tend to buy lower-priced cars, and are strongest in the compact and subcompact segments. Hence, many car companies aim some of their advertising specifically at women. In a forecasting model of car ownership in Sweden, income is reported as an important predictor of car ownership. Income rates are growing faster among women than men in which 2 per cent growth in income for women and constant income for men. Male car ownership is forecast to grow only by 3 per cent to the year 2010, while female car ownership is forecast to grow by 70 per cent. Thus, he suggested that female car ownership is now the strategic factor for the future development of motorization [21]. In a study on households intention to replace the old car, the replacement intention has positive relationship with the quality of the new car and negative relationship with the perception of the old car. In other words, the household intent to replace their old car is based on the total number of miles driven, age of the car and the anticipated number of repairs [22]. E. Selection From the literature review and availability of data from the car dealer company, a description of the variables in the dataset are shown in Table 1. TABLE I. DESCRIPTION OF VARIABLES Role Name Type Description Intent of Credit card application Purchase (IOP) Target Binary 0 : Purchase 1: cancel age Input Continuous Age in years Income group Input Categorical Car status Input Categorical gender Input Binary LOU Input Categorical Car_Price Input Categorical Monthly income 0 : < 2000 1: 2000-4000 2 : 4000-6000 3 : 6000-8000 4: 8000-10000 4 : > 10000 Status of this car: 1 :Additional car 2: Replacement car 3:First car Applicant is 1: Male, 2: Female House (1: No 2: Yes) Price of car(rm): 1 : 40000-60000 2 : 60000-80000 Name Role Type Book_fee Input Categorical Down_pay Input Categorical Description 3 : 80000-100,000 4. > 100,000 Booking fee (RM) 1 : < 200 2 : 200-300 3 : 300-500 4: 500-1000 Down Payment (RM) 1: 0 2: 500-25000 3: 25000-50000 4: 50000-75000 5: 75000-100,000 6: > 100,000 Loan_amt Input Binary Loan amount (RM) 1 : 15000-50000 2: 50000-100,000 3: >100,000 4. 0 Model type Input Categorical Twelve model types F. Modeling using Clementine The sample data was first partitioned into a training sample (70%) and a validation sample (30%). The training sample data is used to build the models, while the validation sample data is for validation of the models. Fig. 1 depicts the data modeling process using SPSS Clementine. Fig. 1. Data Mining Process Flow Diagram The pentagon-shaped nodes show the construction of the models using logistic regression, decision trees (CART) and neural network. The diamond-shaped nodes show the model outputs of the respective models. For the logistic regression model, four selection methods (ENTER, STEPWISE, FORWARDS, BACKWARDS) were compared using the Analysis and Evaluation nodes. While for decision tress, the C5.0, CHAID and CART models were generated and compared. Then, the three predictive models which are stepwise logistic regression, CART and neural network are connected to the analysis node which provides the computation of accuracy rates, while the evaluation node produces the lift charts. 2054
III. RESULTS Car_Status = 2-2.72** -2.729** In this section the results of the predictive models are presented A. Logistic Regression Results For the Enter method, all variables are significant predictors except for gender. Meanwhile, the Forward, Backward and Stepwise models selected the same significant predictors. Table 3 summarizes the logistic regression results using Enter and Stepwise selection method. Based on the results in Table 2, the validation accuracy rates for the Enter and Stepwise models achieved the same value (91.79%). However, the Stepwise model has a highervalidation sensitivity (87.77%). TABLE II. ACCURACY RATE Model Sample Accuracy Sensitivity Specificity rate Enter Training 0.9208 0.8985 0.9327 Validation 0.9179 0.8734 0.9466 Stepwise Training 0.9177 0.8962 0.9292 Validation 0.9179 0.8777 0.9438 Results in Table 3 shows that, those without LOU, those with low income (< RM2000) and low booking fee are more likely to cancel their booking. Cancellation is also more likely for those who are purchasing a first car. Further crosstabulation results revealed that cancellation was more for model 9 and 4. Meanwhile, model 12 has the lowest cancellation rate. TABLE III. STEPWISE LOGISTIC REGRESSION RESULTS B (Enter) B (Stepwise) Constant -4.246** -5.262** Age -.046** -.046** Gender = F.12 LOU = N 7.162** 7.189** Booking_Fee = 1-3.79** -3.783** Booking_Fee = 2-3.618** -3.595** Booking_Fee = 3 -.257 -.208 Car_Price = 1 -.41 Car_Price = 2 -.79 Car_Price = 3-1.057 Income_Group = 1 3.377* 3.449* Income_Group = 2 3.384** 3.444** Income_Group = 3 2.293** 2.343** Income_Group = 4 1.738** 1.772** Income_Group = 5 1.735** 1.791** Model_Type = 1 2.938** 3.043** Model_Type = 2 3.402* 4.332** Model_Type = 3 -.082.479 Model_Type = 4 5.747** 5.959** Model_Type = 5 4.32** 4.929** Model_Type = 6.477.626 Model_Type = 7 2.862** 2.748** Model_Type = 8 4.738** 5.691** Model_Type = 9 4.161** 4.329** Model_Type = 10 5.324** 6.283** Model_Type = 11 2.373 2.937** Car_Status = 1 -.134 -.133 Chi-Square 1185.134** 1182.89** -2LL 485.222 487.466 Nagelkerke R-Sq 0.824 0.823 B. Decision Tree Model Results Decision tree is easy to understand and can be easily converted to a set of rules. Moreover, they can classify both categorical and numerical data and require no priori assumptions about the data. Because of the advantages listed above, the decision tree approach is extensively utilized for both classification and prediction purposes. The CART model finds four variables to be influential on the intent of purchase (LOU, booking fee, model type and car status) and the decision tree rules are listed in Table 4 while Fig. 2 shows the CART model. CANCEL TABLE IV. replacement car. Car Model: 2, 4, 6, 8, 9,10 or 11. replacement car. Car Model 1, 3, 5, 7 or 12. Booking fees are RM200- RM300, RM300-RM500 or RM500-RM1000. first car or additional car. Income groups are <RM2000, RM2000- RM4000 or RM4000- RM6000. CART RULES PURCHASE Customers have letter of undertaking (LOU) replacement car. Car Model: 1,3, 5, 7 or 12. Booking fee is <RM200. first car or additional car. Income groups are RM6000-RM8000, RM8000-RM10,000 or >RM10,000. Ages of customers are more than 43 years old. Car model: 1, 5, 11 or 12. Table 5 displays the sensitivity, specificity and the classification accuracy for each decision tree model. The sensitivity rate is the true positive rate (the percentage of customers who cancelled booking predicted correctly) while specificity is the true negative rate (percentage of those who purchase predicted correctly). All three models performances are quite similar. The CART produces simple rules and hence was chosen to be compared with LR and NN models. 2055
TABLE V. ACCURACY RATE, SENSITIVITY AND SPECIFICITY Model Sample Accuracy Sensitivity Specificity rate C5.0 Training 0.9273 0.9058 0.9389 Validation 0.9017 0.8632 0.9262 CHAID Training 0.9056 0.8972 0.9101 Validation 0.9000 0.8846 0.9098 CART Training 0.9139 0.8865 0.9286 Validation 0.9117 0.8547 0.9481 C. Neural Network Model For Neural Network (NN) model, the neural network has 34 neurons in the input layer, 3 neurons in the hidden layer and 2 neurons in the output layer. Table 5 shows the importance of the input variables in descending order. The top five most important input variables in descending order of importance are: Letter of undertaking, income group, model type, car status and car price. The estimated of accuracy rate of the neural network model is 90.79%. This is based on the correct classification rate in the training sample. TABLE VI. RELATIVE IMPORTANCE OF INPUT VARIABLES Importance value Letter of Undertaking 0.531 Income Group 0.139 Model Type 0.0760 Car Status 0.075 Car Price 0.07 Booking Fee 0.063 Age 0.035 Gender 0.009 D. Model Comparisons Comparison between these LR, CART and NN models was made to determine the best model. The accuracy rates for training and validation samples are given in Table 6. All three models predictive accuracy is quite comparable with Logistic Regression model having a slightly higher sensitivity. TABLE VII. ACCURACY RATE Model Sample Accuracy Sensitivity Specificity rate Logistic Training 0.9177 0.8962 0.9292 Regression Validation 0.9179 0.8777 0.9438 CART Training 0.9139 0.8865 0.9286 Neural Network Validation 0.9117 0.8547 0.9481 Training 0.9079 0.8737 0.9263 Validation 0.9117 0.8589 0.9454 IV. CONCLUSION There has been a rapid growth of data mining in business, applications, social and medical research. Logistic regression is the most popular statistical model to predict the probability of an event happening. With the emergence of data mining, nontraditional statistical methods such as neural networks, support vector machine and decision trees are gaining popularity in the search for a good predictive model. Data mining usually involves modeling large volumes of data and the focus is on the practical importance of the information or knowledge gained from the models. This study illustrated the construction and evaluation of three predictive models which include logistic regression, decision tree and neural network model to predict the intent of purchase of a car. Results revealed no models outperform the other but important characteristics of customers were obtained from the logistic regression and CART model. Work is in progress to cover other classification techniques such as SVM and Bayesian classification. The performance of predictive models depends on the data structure, data quality and variable selection. With the availability of data mining software, data mining models are easy to construct and apply in the business industry. However, a successful data mining project requires the involvement of experts in data mining, subject area experts and people in the business organization. REFERENCES [1] M. J. A. Berry and G. S. Linoff, Data Mining Ttehniques: For Marketing, Sales, and Customer Support. New York: John Wiley & Sons, Inc, 2004. [2] H. Jiawei and K. Micheline, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2006. [3] A. Feelders, H. Daniels, M. Holsheimer, Methodological and Practical Aspects of Data Mining, Information & Management, 271-281, 2000. [4] G. Paolo, Applied Data Mining for Business and Industry, John Wiley & Sons, 2003. [5] K.J. Cios and L.A. Kurgan, Trends in data mining and knowledge discovery, Advanced Information and Knowledge Processing, 1-26,2005. [6] D. J. Hand and W. E. Henley, Statistical classification methods in consumer credit scoring: A review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 160(3), 523 541,1997. [7] H. Abdou, J. Pointon, and A. El-Masry, Neural nets versus conventional techniques in credit scoring in Egyptian banking. Expert System with Applications, 35, 1275-1292, 2008. [8] A. Endo, T. Shibata and H. Tanaka, Comparisons of seven algorithms to predict breast cancer survival, Biomedical Soft Computing and Human Sciences, Vol 13, No. 2, 11-16, 2008. [9] A. Oztekin, D. Delen, Z. Kong, Predicting the graft survival for heartlung transplantation patients: An integrated data mining methodology,international Journal of Medical Informatics, 78(12),,e84- e96,2009. [10] Z. Huang, H. Chen,, C-J. Hsu, W-H. Chen, and S. Wu, Credit rating analysis with support vector machines and neural networks: a market comparative study. Decision Support System, 37, 543-558, 2004. [11] C-L. Huang, M-C. Chen, and, C-J. Wang, Credit scoring with a data mining approach based on support vector machines. Expert System with Applications, 37, 847-856,2007. [12] D. Olson and S. Yong, Introduction to Business Data Mining. McGraw Hill International Edition,2006. 2056
[13] J.D. Olden and D.A. Jackson, Illuminating the black box : a randomization approach for understanding variable contributions in artificial neural networks, Ecological Modeling, 154, 135-150,2002. [14] C. K. Hian and K.L. Chan, Going concern prediction using data mining techniques, Managerial Auditing Journal, Vol 19, No 3, 462-476, 2004. [15] E. Lapersonne, G. Laurent and J-J Le Goff, Consideration sets of size one: An empirical investigation of automobile purchases. International Journal of Research in Marketing 12, 55-66,1995. [16] B. Sun and Morwitz, V.G., Stated intentions and purchase behavior: A unified model. International Journal of Research in Marketing,Volume 27( 4), 356-366,2010. [17] G.J. Fitzsimons and Mortwitz, V.G.,The effect of measuring intent on brand-level purchase behavior. Journal of Consumer Research Inc., 23,1-11,1996. [18] Jorgensen, F. and Wentzel-Larsen, T, Forecasting car holding,scrappage and new car purchase in Norway, Journal of Transport Economics and Policy 24(2), 139-156,1990. [19] Notani, A.S., Perceptions of affordability: Their role in predicting purchase intent and purchase. Journal of Economic Psychology 18, 525-54,1995. [20] Moutinho, L., Davies, F. and Curry, B.,The impact of gender on car buyer satisfaction and loyalty. Journal of Retailing and Consumer Services 3(3), 135-144,1996. [21] Jansson, J. O., Car demand modeling and forecasting:a new approach. Journal of Transport Economics and Policy 23(2), 125-140,1989. [22] Marell, A., Davidsson, P., Garling, T. and Laitila, T., Direct and indirect effects on households intentions to replace the old car. Journal of Retailing and Consumer Services 11, 1 8,2004. 2057