European Journal of Operational Research

Transcription

1 European Journal of Operational Research 214 (2011) Contents lists available at ScienceDirect European Journal of Operational Research journal homepage: Decision Support A probability-mapping algorithm for calibrating the posterior probabilities: A direct marketing application Kristof Coussement a,, Wouter Buckinx b a IESEG School of Management Catholic University of Lille (LEM, UMR CNRS 8179), Department of Marketing, 3 Rue de la Digue, F Lille, France b Python Predictions, Avenue R. Van den Driessche 9, B-1150 Brussels, Belgium article info abstract Article history: Received 3 October 2010 Accepted 16 May 2011 Available online 23 May 2011 Keywords: Data mining Decision support systems Direct marketing Response modeling Calibration Calibration refers to the adjustment of the posterior probabilities output by a classification algorithm towards the true prior probability distribution of the target classes. This adjustment is necessary to account for the difference in prior distributions between the training set and the test set. This article proposes a new calibration method, called the probability-mapping approach. Two types of mapping are proposed: linear and non-linear probability mapping. These new calibration techniques are applied to 9 real-life direct marketing datasets. The newly-proposed techniques are compared with the original, non-calibrated posterior probabilities and the adjusted posterior probabilities obtained using the rescaling algorithm of Saerens et al. (2002). The results recommend that marketing researchers must calibrate the posterior probabilities obtained from the classifier. Moreover, it is shown that using a simple rescaling algorithm is not a first and workable solution, because the results suggest applying the newly-proposed non-linear probability-mapping approach for best calibration performance. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Due to recent developments in IT infrastructure and the everincreasing trust placed in complex computer systems, analysts are showing an increasing interest in classification modeling in a variety of disciplines such as credit scoring (Martens et al., 2010; Paleologo et al., 2010), medicine (Conforti and Guido, 2010), text classification (Bosio and Righini, 2007), SMEs fund management (Kim and Sohn, 2010), revenue management (Morales and Wang, 2010), and so on. The same interests are shared by the direct marketing community. Direct marketing analysts have an increasing interest in building prediction models that assign a probability of response to each and every individual customer in the database (Lamb et al., 1994). The task of classification is made even more interesting by the fact that nowadays current marketing environments store incredible amounts of customer information at a very low cost, including socio-demographics, transactional buying behavior, attitudinal data, etc. (Naik et al., 2000), while at the same time there has been a tremendous increase in academic interest in direct marketing applications (e.g. Allenby et al., 1999; Baumgartner and Hruschka, 2005; Hruschka, 2010; Lee et al., 2010; Piersma and Jonker, 2004). Therefore response models are defined as classification models that attempt to discriminate between responders and non-responders on a certain company mailing. Corresponding author. Tel.: addresses: [email protected] (K. Coussement), Wouter.Buckinx@ pythonpredictions.com (W. Buckinx). In the past, purely statistical methods like logistic regression, discriminant analysis and naive bayes models have been proposed to discriminate between responders and non-responders in a direct marketing context (Baesens et al., 2002; Bult, 1993; Deichmann et al., 2002). Although these techniques may be very effective, they make a stringent assumption about the underlying relationship between the independent variables and the dependent or response variable. In response to this, more advanced data mining algorithms like decision tree-generating techniques, artificial neural networks and support vector machines have been applied (Baesens et al., 2002; Bose and Chen, 2009; Crone et al., 2006; Haughton and Oulabi, 1997; Zahavi and Levin, 1997). All these binary classification models are used for two reasons. First, researchers rely on them to obtain robust parameter estimates of the independent variables by modeling the probability of response as a function of the independent variables. Second, these models are used to obtain consistent predicted probabilities of response, which are then used (i) to rank the customers based on their responsiveness to the campaign, (ii) to optimize the overall campaign strategy by offering the customer the product with the highest response probability over the different response models and (iii) for the discrimination task of the response event itself where one classifies customers into responders and non-responders. For (ii) and (iii), the absolute size of the posterior response probabilities is crucial. This study focuses on the process of obtaining correct response probabilities, where calibrating the posterior probabilities could have a positive impact on the optimization of the overall campaign strategy and the efficiency of the discrimination task /$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi: /j.ejor

2 K. Coussement, W. Buckinx / European Journal of Operational Research 214 (2011) In practice, a classification model is built on a training set, i.e. a set of customers where both the independent variables and the dependent variable are present. In order to correctly measure the discrimination power of the trained classifier, the classification model is applied to a group of customers who have not been used for training, called the scoring or test set. The purpose is to obtain robust and consistent predictions for the response probability of these unseen customers. As one is interested to divide the customers into responders and non-responders, a judicious classification based on the posterior response probabilities of the customers is needed. In other words, customers having a response probability exceeding a certain threshold will be classified as responders and vice versa. However, it often happens that a classifier is trained using a dataset that does not reflect the true prior probabilities of the target classes in the real-life population. This may have serious negative consequences on the discrimination performance because the posterior probabilities do not reflect the true probability of response. This phenomenon occurs in a direct marketing context as well where the prior probabilities between the training set and the (out-of-sample) test set are significantly different. More specifically, the training set consists of customers who are preselected by an earlier response model as being customers with a high response probability, while the test set does not make any restrictions based on the customer profiles in the database. In such a case, a large discrepancy exists between the response distributions on the training set and the test set. The incidence, which is the percentage of responders in a data set, is much higher in the training set as compared to the incidence of real response in the out-ofsample test set. This inconsistency has a negative effect on the discrimination performance on the test set, especially because the classifier s decision to classify customers into responders or nonresponders is based on setting a threshold on the raw posterior probabilities of class membership. For instance, when a classifier is trained on a dataset with a higher incidence than the one in the test set, the posterior probabilities on the test set are inflated. Thus making a classification decision based on the absolute value of the posterior probabilities may significantly harm the discrimination performance. Moreover, optimizing the campaign strategy by offering the product with the highest response probability to the customer becomes useless because the response probabilities for different products for a particular customer are not comparable. This paper focuses on how researchers can adjust the posterior probabilities based on the true prior distribution of the response variable. This process of adjustment is called calibration. This paper proposes a new methodology to be used to calibrate the posterior probabilities from the test set with the real-world situation, a process called probability-mapping. It maps the posterior response probabilities obtained from the classifier onto the prior distribution of real response. The new probability-mapping approaches using generalized linear models and non-parametric generalized additive models are compared with the original, noncalibrated posterior probabilities and the calibrated probabilities using the rescaling methodology of Saerens et al. (2002). This paper is structured as follows: Section 2 describes the methodological framework, while Section 3 explores the different calibration approaches (rescaling approaches and probabilitymapping approaches). Section 4 explains the characteristics of empirical validation, while Section 5 explores the results. Section 6 gives managerial recommendations, and finally Section 7 concludes this paper. 2. Methodological framework Fig. 1 shows the methodological framework for the different calibration methods applied in this study. Define a training set TRAIN M ¼fðx i ; y i Þg m i¼1 consisting of m customers. Each customer (x i, y i ) is a combination of an input vector x i representing the independent variables and a dependent variable y i with y i e {0, 1} corresponding to whether or not a customer responded on a certain mailing. TRAIN M consists of all customers who were selected by a previous response model, thus received a direct mailing to buy the product, and therefore indicated as customers having a high response probability. During the training phase, a classifier C maps the input vector space onto the binary response variable using the training set observations. For the test set TEST N ¼fðx i Þg n i¼1 consisting of n customers, the trained classifier C is applied and for every customer in TEST N a response probability P org is obtained. The purpose of this paper is to adjust the posterior probabilities P org to the real response distribution because the trained sample TRAIN M is not representative for TEST N which corresponds to the true population. Therefore for every observation (x i ) in TEST N, the real response is collected and summarized in REAL N ¼fðy i Þg n i¼1 with y i e {0, 1} corresponding to whether or not the customer spontaneously bought that particular product in a time window without direct mailing actions. The real response represents a response of pure interest in the product. In other words, REAL N is used to represent the true prior probabilities. The purpose of the calibration phase is to adjust P org, the noncalibrated posterior probabilities of TEST N, in order to truly represent the probability of response. With the aim of methodologically benchmarking the different calibration methods, a k-fold cross-validation is applied. In a k-fold cross-validation, the dataset is randomly split into k equal parts of which one after the other is used during the scoring phase; while the other k 1 parts are used for training the calibration model. Note that TEST kn (REAL kn ) represents the k-fold for TEST N (REAL N ), while P korg represents the noncalibrated posterior probabilities of TEST kn. 3. Calibration approaches Two types of calibration methods are applied: (i) the rescaling algorithm of Saerens et al. (2002) and (ii) the newly-proposed probability-mapping approaches. The former algorithm rescales P korg the posterior probabilities of TEST kn taking into account the real incidence of REAL kn (Saerens et al., 2002), while the latter type adjusts the posterior probabilities of TEST kn by mapping them onto the real responses of REAL kn Rescaling algorithm (SAERENS) This section explains the methodology of Saerens et al. (2002). The starting point of the Saerens et al. (2002) calibration approach is based on Bayes rule, i.e. the posterior probabilities of response depend in a non-linear way on the prior probability distribution of the target classes. The prior probability distribution of the target class is defined as the incidence of the target class, or in this setting the percentage of responders in the dataset. Therefore, a change in the prior probability distribution of the target classes changes the posterior response probabilities of the classification model. Saerens et al. (2002) describe a process that adjusts the posterior probabilities of response output by the classifier to the new prior probability distribution of the target classes making use of a predefined rescaling formula. In detail, the calibrated posterior probabilities of response for the customers in the test set of fold k are obtained by weighting the non-calibrated posterior probabilities, P korg,by the ratio of the response incidence of REAL kn, i.e. the new prior probability distribution, to the response incidence in the training set, i.e. the old prior probability distribution. The denominator is a scaling factor to make sure that the calibrated posterior probabilities sum up to one.

3 734 K. Coussement, W. Buckinx / European Journal of Operational Research 214 (2011) TRAINM C TESTN k = 1 to 10 TESTkN TESTk1 TESTk2 TESTk3 TESTkb LIN LOG GAM GAM MONO ORIGINAL SAERENS TESTkN NEWkN NEWk1 NEWk2 NEWk3 NEWkN NEWN NEW1 NEW2 NEW3 NEWN REALN REALkN REALk1 REALk2 REALk3 REALkb Fig. 1. Methodological framework. In summary, P knew ¼ P k ðc 1 Þ P kt ðc 1 Þ P korg P k ðc 0 Þ ð1 P P kt ðc 0 Þ korgþþ P kðc 1 Þ P P kt ðc 1 Þ korg with P knew representing the calibrated posterior response probabilities in fold k, P k (c i ) and P kt (c i ) the new and old prior probabilities for class i with i e {0, 1}. A data set NEW kn is obtained which contains P knew, the calibrated posterior probabilities for the test data of TEST kn Probability-mapping approaches The purpose of the probability-mapping approaches is to map P korg, the old posterior probabilities of TEST kn, onto the real response probabilities of REAL kn. As such, one is able to build a classification model that maps the non-calibrated probabilities onto the real response probabilities. This model is then used to calibrate the old probabilities with the corrected probabilities of response. However, the real probability distribution of the target classes is not directly available from REAL kn which only contains the real responses y i with y i e {0, 1} on an individual customer level. In order to convert the real responses y i with y i e {0, 1} on an individual level in REAL kn into a real response probability distribution, a number of bins b are constructed. The incidence of response is calculated per bin and equals the percentage of real response. This incidence is used as an approximation for the real probability of response per bin. In practice, both TEST kn and REAL kn are split into a number of bins b using the equal frequency binning approach based on the posterior probabilities of TEST kn. TEST kb (REAL kb ) represents the bth bin in the k-fold of TEST kn (REAL kn respectively). TEST kb and REAL kb logically contain identical observations, while P kborg is the non-calibrated posterior probability average for the bth bin in TEST kn and P kbreal is the percentage of real responders in the bth bin of REAL kn. P kbreal serves as a proxy for the true prior probability. In order to formalize the relationship between the average posterior probabilities of TEST kn and the approximate real probabilities obtained from REAL kn, a formal mapping is obtained using the binned training set of fold k by P kbreal ¼ f k ðp kborg Þ with f k being the classifier that maps the non-calibrated posterior probabilities onto the real probabilities in fold k. After the classifier ð1þ ð2þ f k is built, it is applied to the unseen test data of TEST kn to obtain the new posterior probabilities, P knew, for every individual in the test data set of the kth fold. A new data set is obtained NEW kn which contains P knew, the calibrated posterior probabilities. There are several possibilities for f k, a function that links the estimated, non-calibrated probabilities of TEST kb to the approximated real probabilities of REAL kb. This study uses one linear probability-mapping approach based on generalized linear models (Section 3.2.1) and three non-linear approaches; one based on generalized linear models with log-transformed non-calibrated probabilities (Section 3.2.2) and two approaches based on generalized additive models (Section and Section 3.2.4) Generalized linear model (GLM) Given y i as the dependent variable with y i e [0, 1] representing P kbreal, the averaged true prior probabilities from REAL kb and x i equal to P kborg, the averaged posterior probabilities of TEST kb, a generalized linear model with logit link function is employed to model f k (x i ) e [0, 1]. Moreover, it assumes that the relationship between P kborg and P kbreal is linear in the log-odds via y logitfy i g log i ¼ a k þ b 1 y ki x i i or y i f k ðx i Þ¼logit 1 ða k þ b ki x i Þ ð4þ with a k as the intercept and b ki x i as the predictor. The parameters a k and b ki are estimated using maximum likelihood (Tabachnick and Fidell, 1996) Generalized linear model with log transformation (LOG) Another approach is to log-transform x i in Eqs. (3) and (4), because as such one captures the non-linearity in the log-odds space between y i, P kbreal the true prior probabilities from REAL kb, and x i, P kborg the posterior probabilities of TEST kb Generalized additive models An attractive alternative to standard generalized linear models is generalized additive models (Hastie and Tibshirani, 1986, 1987, 1990). Generalized additive models relax the linearity constraint and apply a non-parametric non-linear fit to the data. In other words, the data themselves decide on the functional form ð3þ

4 K. Coussement, W. Buckinx / European Journal of Operational Research 214 (2011) between the independent variable and the dependent variable. Define y i as the dependent variable with y i e [0, 1] representing P kbreal, the true posterior probabilities from REAL kb, and x i equals to P kborg, the posterior probabilities of TEST kb. To model f k (x i ) e [0, 1], generalized additive models with logit link function are employed. Methodologically, generalized additive models generalize the generalized linear model principle by replacing the linear predictor b ki x i in Eq. (4) with an additive component where y i f k ðx i Þ¼logit 1 ða k þ s ki ðx i ÞÞ ð5þ with s ki (x i ) as a smooth function. This study uses penalized regression splines s ki (x i ) to estimate the non-parametric trend for the dependency of y i on x i (Wahba, 1990; Green and Silverman, 1994). These smooth functions use a large number of knots leading to a model quite insensitive to the knot locations, while the penalty term is used to avoid the danger of over-fitting that would otherwise accompany the use of many knots. The complexity of the model is controlled by a parameter k and it is inversely related to the degrees of freedom (df). If k is small (i.e. the df are large), a very complex model that closely matches the data is employed. When k is large (i.e. the df are small), a smooth model is considered. In order to optimize the generalized additive model, the fitting amounts to penalized likelihood maximization by penalized iteratively reweighted least squares (Wood, 2000, 2004, 2008) Generalized additive models with monotonicity constraint Due to the fact that generalized additive models produce a nonlinear relationship between the independent variable P kborg and the dependent variable P kbreal, the original ranking of the posterior probabilities of TEST kn and its calibrated version may change. However, marketing analysts could argue that the mapping from TRAIN M onto TEST N and the corresponding ranking of the customers in TEST N (and respectively TEST kn ) given by the initial classifier C should be conserved. As such a non-decreasing monotonicity constraint on the generalized additive models predictions is introduced to retain the original ranking of the customers. Inspired by rule-set creation advances in the post-learning phase (e.g. pedagogical rule-based extraction techniques as employed in Martens et al. (2007)), a rule set on the training set of fold k is produced in the post-estimation phase of the generalized additive models to obtain a function f 0 k, a non-decreasing monotone function. This ensures that the initial ranking of P kborg is maintained in the corresponding predictions P kbreal of fold k. Practically, the training set is sorted by P kborg. Afterwards the rule-based algorithm detects all non-decreasing monotonic inconsistencies on the prediction values f k (P kborg ) on the training set. For instance, suppose that the prediction value for bin X + 1 is lower than the prediction value for bin X than the rule-based algorithm adds a rule to the rule-base to change the prediction value of bin X + 1 to the larger prediction value of bin X. In the end, the generalized additive model and the rule-base describe a non-decreasing monotone generalized additive model based function f 0 k with following characteristics (Denlinger, 2010) if P kborgx 6 P kborgxþ1 ) f 0 k ðp kborgxþ 6 f 0 k ðp kborgxþ1þ with P kborgx and P kborgx+1 original non-calibrated posterior probabilities for bins X and X + 1 in the training data set, and f 0 k ðp kborgxþ and f 0 k ðp kborgxþ1þ the calibrated posterior probabilities in fold k for bins X and X Empirical validation The calibration methods are employed on a test bed of 9 real-life direct marketing datasets provided by a large European financial institution. Each of these datasets corresponds to a typical ð7þ financial product. Table 1 shows the characteristics of the response datasets. With the aim of methodologically comparing the different algorithms, a 10-fold cross-validation is applied. Furthermore, the classifier C which links TRAIN M and TEST N and outputs P org is a logistic regression with forward variable selection as it is a robust and well-known classification technique in the marketing environment (Neslin et al., 2006). Moreover, the calibration approaches based on generalized additive models use different levels of degrees of freedom (df) representing the non-linearity of the model. The higher the df, the higher the non-linearity. On the hand, the df are set manually by the researcher (user-specified), while on the other hand the df are simultaneously estimated in correspondence with the shape of the response function (automatic). This study opts to manually set the df equal to {3, 4, 5} (resulting in GAMdf and GAMdf MONO). This df range is inspired by the recommendation and the applications in Hastie and Tibshirani (1990) and Hastie et al. (2001) that use a relatively small number of df to account for different levels of non-linearity. Additionally, the generalized cross-validation procedure (GCV) is employed to automatically select the ideal number of df, resulting in GAMgcv and GAMgcv MONO (Gu and Wahba, 1991; Wood, 2000; Wood, 2004). The number of bins b for TEST kn and REAL kn is set to 200. Furthermore, P org, the non-calibrated posterior probabilities of TEST N, are used as a benchmark (ORIGINAL). The different algorithms are compared on an individual customer level using the log-likelihood (LL) by LL ¼ lnð YN pðx i Þ y i ½1 pðx i ÞŠ 1 y i Þ¼ XN fy i ln½pðx i ÞŠþð1 y i Þln½1 pðx i ÞŠg i¼1 i¼1 with N the number of customers, p(x i ) equal to P knew, the calibrated posterior response probability, and y i as the real response variable with y i e {0, 1}. The LL is a well-known metric in (direct) marketing to evaluate the performance of an algorithm (e.g. Baumgartner and Hruschka, 2005). The higher the LL, the better the calibration of the posterior probabilities to the true response distribution is. Moreover, the non-parametric Friedman test (Demšar, 2006; Friedman, 1937, 1940) with the Bonferroni Dunn test (Dunn, 1961) is used in order to significantly compare the different approaches with the best performing algorithm. 5. Results Table 2 represents the 10-fold cross-validated log-likelihood values for the different datasets and the different algorithms. Three panels (a, b, c) are included representing the various levels of the user-selected degrees of freedom for the generalized additive model mappings. For each dataset, the best performing algorithm in terms of log-likelihood is put in italics. Moreover, the average ranking (AR) per algorithm over the different datasets is given. The lower the ranking, the better the algorithm is shown to be. The best performing algorithm is underlined and set in bold, while the algorithms that are not significantly different to the best one at a 5% significance level are only set in bold. The algorithms are split into four categories; the original, noncalibrated posterior probabilities (ORIGINAL), the rescaling methodology (SAERENS), the linear probability-mapping approach (GLM) and the non-linear probability-mapping approaches (LOG, GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO). Table 2 reveals that calibrating the posterior probabilities has a beneficial impact when a discrepancy exists between the true prior probabilities of the training set and the test set: ORIGINAL always performs worse than the other calibration approaches. Comparing the rescaling approach (SAERENS) with the best performing calibration approaches, one concludes that SAERENS ð8þ

5 736 K. Coussement, W. Buckinx / European Journal of Operational Research 214 (2011) Table 1 Dataset characteristics. Dataset ID 1 TEST N # Variables used by C # Customers % Responders # Customers % Responders 1 70, , , , , , , , ,073, , ,223, , , , , , , Table 2 The 10-fold cross-validated log-likelihood values. Panel a: overview with GAM3 & GAM3 MONO. Panel b: overview with GAM4 & GAM4 MONO. Panel c: overview with GAM5 & GAM5 MONO. Dataset Rescaling Probability-mapping Linear Non-linear Panel a: overview with GAM3 & GAM3 MONO Original Saerens GLM LOG GAM3 GAMgcv GAM3 MONO GAMgcv MONO AR Panel b: overview with GAM4 & GAM4 MONO Original Saerens GLM LOG GAM4 GAMgcv GAM4 MONO GAMgcv MONO AR Panel c: overview with GAM5 & GAM5 MONO Original Saerens GLM LOG GAM5 GAMgcv GAM5 MONO GAMgcv MONO AR fold CV LL values, AR = average ranking. always significantly performs less well than the non-linear probability-mapping approaches, while SAERENS performs better than the linear probability-mapping approach (GLM). These results show that the analyst better shifts towards a non-linear probability-mapping approach, despite the fact that SAERENS is an easy and workable solution to the calibration problem. Contrasting the various probability-mapping approaches, Table 2 discloses that the non-linear calibration approaches (LOG, GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO) are always among the best performing algorithms. The linear mapping approach (GLM) is never significantly competitive with one of its non-linear counterparts. However, the generalized linear model with log-transformation (LOG) is competitive to the more advanced GAM approaches (GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO). Within the non-linear calibration setting, one concludes that GAMgcv MONO always performs best, followed by the other non-linear calibration approaches. Table 3 contains the performance measures for all generalized additive models approaches (GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO), for all the levels of degrees of freedom. On a

6 K. Coussement, W. Buckinx / European Journal of Operational Research 214 (2011) Table 3 The 10-fold cross-validate log-likelihood values forgam and GAM MONO calibration models. Dataset Non-linear GAM3 GAM3 MONO GAM4 GAM4 MONO GAM5 GAM5 MONO GAMgcv GAMgcv MONO AR fold CV LL values, AR = average ranking. dataset level, the best performing algorithm is put in italics. Furthermore, the average ranking (AR) for each algorithm is given and the best performing algorithm (i.e. the one with the lowest ranking) is underlined and set in bold, while the ones that are not significantly different to the best at a 5% significance level are simply put in bold. Table 3 reveals that GAM5 MONO is the best performing algorithm among the GAM and GAM MONO approaches, quickly followed by GAMgcv MONO. Table 3 shows a better performance trend for the GAM approaches when the number of df are increased. GAM3 performs less well than GAM4, while GAM4 has a less well performance than GAM5. Furthermore, it is clear that including the monotonicity constraint has a beneficial impact on the calibration performance of the GAM approaches. The average ranking of the GAM approaches including the monotonicity constraint is always better than their original GAM counterparts (i.e. GAMdf versus GAMdf MONO and GAMgcv versus GAMgcv MONO). Moreover, the automatic smoothness parameter selection procedure proves its beneficial impact. For the non-monotonicity models, GAMgcv has always a better ranking than the GAMdf approaches. For the monotonicity models, GAMgcv MONO performs always better than GAM3 MONO and GAM4 MONO, while GAMgcv MONO is very competitive to GAM5 MONO. 6. Discussion The results suggest that marketing analysts should calibrate the posterior probabilities when the training set does not represent the true prior distribution. In general, calibrating the posterior probabilities is more beneficial than using the non-calibrated posterior probabilities. Moreover, it is shown that a simple rescaling algorithm (SAERENS) that takes into account the ratio of the old and the new priors is not sufficient to be a first and workable solution to initially solve the calibration problem. SAERENS always performs significantly worse than the more complex non-linear probability-mapping approaches. Furthermore, marketing researchers should better not apply the linear probability-mapping approach in this specific setting. Indeed, among the different probability-mapping approaches, it has been shown that non-linear approaches are preferable over the linear mappings. The LOG approach is competitive to the more complex GAM-based calibration approaches, and because it is based on the common generalized linear model framework, LOG could be seen as a first and workable approach. However if one is interested to optimize the calibration performance, the GAM-based approaches are preferable. Moreover, one concludes that using the automatic smoothing parameter selection procedure and imposing a monotonicity constraint on the GAM method are the most preferred options to be employed in GAM models in order to optimize calibration performance. 7. Conclusion Direct marketing receives considerable attention these days in academia as well as in business due to a serious drop in the cost of IT equipment and the ever increasing usage of response models in a variety of business settings. In a direct marketing context, a discrepancy sometimes exists between the prior distributions on the training set and scoring set which is problematic. This may happen due to the fact that the training set consists entirely of customers previously selected by a response model, and thus this dataset consists of a higher percentage of responders. Applying a classification model built on this training set to the complete set of customers will harm the estimation of the response probabilities. Thoroughly adjusting the posterior probabilities to the real response probability distribution will improve the classification performance. This study reveals that the non-linear probabilitymapping approaches are among the best performing algorithms and their usage is highly recommended in a day to day business setting for following reasons. Firstly, the non-linear probabilitymapping approaches deliver a better performance compared to the other calibration algorithms included in this research paper. This leads to the fact that the calibrated probabilities better reflect the true probabilities of response. Secondly, there is a possibility to visualize the relationship between P kborg and P kbreal. This gives managers a better and visual understanding of the calibration process for a particular setting. For instance, the more the calibration curve is away from the 45 line (i.e. the line where P kborg = P kbreal or no calibration is necessary), the higher the added value of sending a leaflet because the incidence in TRAIN M is higher than in REAL N. Finally, the underlying techniques like generalized linear models and generalized additive models are easily implementable in today s business environment due to the availability of the classifiers in traditional software packages like SAS and R. While we are confident that our study adds significant value to the literature, valuable directions for future research are identified. Beside the probability-mapping approaches which map the P kborg onto the P kbreal, an extensive research project could be dedicated to investigate the impact of integrated calibration approaches, i.e. methods that integrate the calibration process into the initial training phase of classifier C in order to come up with a new classifier C 0 which directly outputs calibrated probabilities. For instance, a workable integrated calibration approach could be represented by a two-stage Bayesian logistic regression approach that directly outputs calibrated posterior probabilities. In order to obtain this integrated Bayesian calibration model, the following procedure is proposed. Under the assumption that the commonly-used prior distribution for b ki is multivariate Gaussian, i.e. p(b ki ) N(b 0, P 0), the Bayesian empirical approach could be used to specify the values of b 0 and P 0 by fitting a Bayesian logistic regression to TRAIN km using non-informative priors. Consequently,

7 738 K. Coussement, W. Buckinx / European Journal of Operational Research 214 (2011) the resulting posterior mean vector and variance covariance matrix of this initial model could then be used for the values of b 0 and P 0 for the second Bayesian logistic regression on REAL kn. The resulting integrated Bayesian logistic regression approach C will directly output adapted, calibrated posterior probabilities. 1 Furthermore, the probability-mapping approaches are validated in a direct marketing setting, whereas future research efforts could be spent to investigate the external validity to other operational research settings. Acknowledgements The authors would like to thank the anonymous company for freely distributing the datasets. We would like to thank our friendly and journal reviewers for their fruitful comments on earlier versions of this paper and the editor, Jesus Artalejo, for guiding this paper through the reviewing process. References Allenby, G.M., Leone, R.P., Jen, L.C., A dynamic model of purchase timing with application to direct marketing. Journal of the American Statistical Association 94, Baesens, B., Viaene, S., Van den Poel, D., Vanthienen, J., Dedene, G., Bayesian neural network learning for repeat purchase modeling in direct marketing. European Journal of Operational Research 138, Baumgartner, B., Hruschka, H., Allocation of catalogs to collective customers based on semiparametric response models. European Journal of Operational Research 162, Bose, I., Chen, X., Quantitative models for direct marketing: A review from systems perspective. European Journal of Operational Research 195, Bosio, S., Righini, G., Computational approaches to a combinatorial optimization problem arising from text classification. Computers and Operations Research 34, Bult, J.R., Semiparametric versus parametric classification models: An application to direct marketing. Journal of Marketing Research 30, Conforti, D., Guido, R., Kernel based support vector machine via semidefinite programming: Application to medical diagnosis. Computers and Operations Research 37, Crone, S.F., Lessmann, S., Stahlbock, R., The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research 173, Deichmann, J., Eshghi, A., Haughton, D., Sayek, S., Teebagy, N., Application of multiple adaptive regression splines (MARS) in direct response modeling. Journal of Interactive Marketing 16, Demšar, J., Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, Denlinger, C.G., Elements of Real Analysis. Jones and Bartlett Publishers. Dunn, O.J., Multiple comparisons among means. Journal of the American Statistical Association 56, Friedman, M., The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32, Friedman, M., A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11, Green, P.J., Silverman, B.W., Nonparametric Regression and Generalized Linear Models. Chapman and Hall/CRC Press. Gu, C., Wahba, G., Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM Journal of Scientific and Statistical Computing 12, Hastie, T., Tibshirani, R., Generalized additive models. Statistical Science 1, Hastie, T., Tibshirani, R., Generalized Additive Models: Some applications. Journal of the American Statistical Association 82, Hastie, T., Tibshirani, R., Generalized Additive Models. Chapman and Hall, London. Hastie, T., Tibshirani, R., Friedman, J., The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York. Haughton, D., Oulabi, S., Direct marketing modeling with CART and CHAID. Journal of Direct Marketing 11, Hruschka, H., Considering endogeneity for optimal catalog allocation in direct marketing. European Journal of Operational Research 206, Kim, H.S., Sohn, S.Y., Support vector machines for default prediction of SMEs based on technology credit. European Journal of Operational Research 201, Lamb, C.W., Hair, J.F., McDaniel, C., Principles of Marketing, second ed. South- Western Publishing Co., Cincinnati. Lee, H.J., Shin, H., Hwang, S.S., Cho, S., MacLachlan, D., Semi-Supervised Response Modeling. Journal of Interactive Marketing 24, Martens, D., Baesens, B., Van Gestel, T., Vanthienen, J., Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research 183, Martens, D., Van Gestel, T., De Backer, M., Haesen, R., Vanthienen, J., Baesens, B., Credit rating prediction using Ant Colony Optimization. Journal of the Operational Research Society 61, Morales, D.R., Wang, J.B., Forecasting cancellation rates for services booking revenue management using data mining. European Journal of Operational Research 202, Naik, P.A., Hagerty, M.R., Tsai, C.L., A new dimension reduction approach for data-rich marketing environments: Sliced inverse regression. Journal of Marketing Research 37, Neslin, S.A., Gupta, S., Kamakura, W., Lu, J.X., Mason, C.H., Defection detection: Measuring and understanding the predictive accuracy of customer churn models. Journal of Marketing Research 43, Paleologo, G., Elisseeff, A., Antonini, G., Subagging for credit scoring models. European Journal of Operational Research 201, Piersma, N., Jonker, J.J., Determining the optimal direct mailing frequency. European Journal of Operational Research 158, Saerens, M., Latinne, P., Decaestecker, C., Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation 14, Tabachnick, B.G., Fidell, L.S., Using Multivariate Statistics. Harper Collings Publishers, New York. Wahba, G., Spline models for observational data. Society for Industrial and Applied Mathematics (SIAM), Capital City Press, Montpelier, Vermont. Wood, S.N., Modelling and smoothing parameter estimation with multiple quadratic penalties. Journal of the Royal Statistical Society B 62, Wood, S.N., Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association 99, Wood, S.N., Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society B 70, Zahavi, J., Levin, N., Applying neural computing to target marketing. Journal of Direct Marketing 11, Nevertheless, this approach is not tested in the current version of the paper for confidentiality reasons.