Improving Direct Marketing Profitability with Neural Networks

Volume 9 o.5, September 011 Improving Direct Marketing Profitability with eural etworks Zaiyong Tang Salem State University Salem, MA 01970 ABSTRACT Data mining in irect marketing aims at ientifying the most promising customers to sen targete avertising. Traitionally, statistical moels are use to make such a selection. The success of statistical moels epens on the valiity of certain assumptions about ata istribution. Artificial intelligence inspire moels, such as genetic algorithms an neural networks, o not nee those assumptions. In this paper, we test neural networks with real-worl irect marketing ata. eural networks are use for performance maximization at various mailing epth. Compare with statistical moels, such as logistic regression an orinary least squares regression, the neural network moels provie more balance outcome with respect to the two performance measures: the potential revenue an the churn likelihoo of a customer. Given the overall objective of ientifying the churners with the most revenue potential, neural network moels outperform the statistical moels by a significant margin. General Terms Direct marketing, linear regression, artificial neural networks, irect response moeling. Keywors eural networks, ata Mining, irect Marketing, profit moeling. 1. ITRODUCTIO Data mining aims to ientify patterns or relationships that are of interest or value the ata owners. With the spee of ata creation toay, it is not surprising that ata mining techniques have attracte consierable interest in both business an acaemia. Lyman an Varian estimate that the current annual growth rate of unique ata is between 1 an exabytes, or "roughly 50 megabytes for very man, woman, an chil on earth."[1] The iea of extracting information from these large masses of ata is inee appealing consiering the commercial, inustrial, an economic potentials. Data mining in irect marketing seeks techniques that maximize returns from irect-mailing solicitations. Pollay an Mittal [] stuie the multiple imensions of irect marketing avertising. Although consumer perception of irect marketing avertising has not always been enthusiastic, irect response marketing is wiely use in the inustry. The statistics from the Direct Marketing Association shows that an estimate $166.5 billion was spent on irect marketing in US in 006. In 007, irect response avertising accounte for more than half of all US avertising expenitures [3]. One of the key tasks in the irect marketing avertising is to ientify the most promising iniviuals to solicit. Due to time an bugetary constraints, it is generally not feasible to target the entire customer segment. Thus irect marketing moels are built to maximize potential returns by targeting certain groups of customers or potential customers. The ientification of target auiences for specific marketing promotions involves etaile analyses of the customer atabase to seek out iniviuals most likely to respon an generate profits. Various irect marketing moels can be built using attributes characterizing potential responers to marketing promotions. Statistical techniques such as iscriminant analysis, least squares regression an logistic moels are commonly use [4]. Bult an Wansbeek use statistical regression for optimal selection of target mailing [5]. Haughton an Qulabi moele irect marketing with CART an CHAID [6]. Zahavi an Levin applie neural networks for target marketing an compare performance of neural networks with statistical approaches [7, 8]. Ha, Cho, an MacLachlan use neural networks for response moeling [9]. Baesens et al. applie Bayesian neural networks to irect marketing [10]. Kaefer et al. eploye neural networks moels to improve the timing of irect marketing activities [11]. Lee an Shih applie neural network moels to ientify profitable customers [1]. Torres, Hervás, an García use a hybri approach that combines logistic regression an neural networks for classification problems [13]. While Zahavi an Levin showe that neural network i not o better than statistical methos, Bentz an Merunkay foun that neural networks outperforme multinomial logistic regression [14]. Typically, evelope moels are use to score iniviuals in a customer file such that higher scores inicate greater mailing preference [15]. The moel-obtaine scores are then use to rank iniviuals, an the final mailing list etermine through mailing-cost an bugetary consierations. Response moels use iscriminant analysis to classify iniviuals as responers an non-responers, with moel scores pertaining to iniviuals response likelihoo. An alternate objective is to ientify iniviuals with the highest response frequency in previous mailings, or those that have generate most revenue in earlier purchases. Here the epenent variable becomes continuous, an regression moels are often use. When customer ata contains information pertaining to profits/costs associate with iniviuals, an attractive moeling criterion is to ientify iniviuals such that the overall profit from a mailing, consiering promotional costs an purchase revenues, is maximize. 13

Volume 9 o.5, September 011 Given resource limitations, irect marketing moels are use to target a fraction of iniviuals in the customer file. The proportion of the selecte best iniviuals to be targete is referre to as the mailing epth or epth-of-file. Suppose the buget allows mailing to 5000 customers out of a total of 0,000 in the customer atabase. Obviously, we want to select the most promising 5000 iniviuals. In this case, the best 5% of iniviuals, as ranke by the moel, makes the 5% epth-offile. Once a moel is built, various epth-of-file mailing strategies can be eploye. Because the iniviuals are ranke in the customer file, the smaller the mailing epth, the larger the improvement over ranomly selecte customer list of the same size. Although statistical techniques such as linear regression an logistic regression are commonly use in irect marketing analysis, those techniques have potential problems. For example, assumptions inherent in many commonly use statistical techniques may not be vali as the moel builing typically relies on ata collecte with low response rates. In this paper, we consier a irect incorporation of customer value together with the mailing epth in moel evelopment. We present a neural network base moeling approach that takes avantage of the robust, nonlinear moeling capability of neural networks. The main objective is to stuy the performance of neural network moels in comparison to traitional statistical moes. The following section iscusses the performance analysis of irect marketing moels. Section 3 gives a brief introuction to the neural network moels use in our stuy. Section 4 presents the propose irect marketing moeling approach. Experimental results are provie in Section 5, followe by a iscussion of future research issues in Section 6 an conclusion in Section 7.. DECILE AALYSIS Given that irect marketing moels are use to ientify a subset of the total customers expecte to maximize returns from a mailing solicitation, moel performance is assesse at ifferent mailing epths. Typically a ecile analysis is use to examine moel performance [16]. In a ecile analysis, iniviuals are separate into 10 equal groups base on their ranking or respective moel scores. In general, higher scores inicating better performance. Table 1 shows a typical ecile analysis where the performance objective is profit maximization from a mailing. The first row, or top ecile, inicates performance for the best 10% of iniviuals as ientifie by the moel. The Cumulative Lifts at specific epths of file provie a measure of improvement over a ranom mailing, an is calculate as: Cumulative _ Lift ecile Cumulative _ average _ profit Overall _ average _ profit ecile 100 Thus, in Table 1, a cumulative lift of 55 in the top ecile inicates that the moel in question is expecte to provie a mailing profit that is.55 times the profit expecte from a ranom mailing to 10% of the customers. Similarly, if 0% of the customers are to be maile, the moel is expecte to perform.16 times better than a ranom mailing of 0% of the customers. The cumulative lift at the bottom ecile is always 100 an correspons to a mailing to the entire customer list. An ieal moel shoul exhibit ecreasing performance from the top through bottom eciles. As inicate in the table, the overall average customer profit is $.5. However, the average profit for the top 10% of the customer is $5.75. The bottom 10% of the customer has an average profit of only $0.84. Decile Table 1: Illustrative Decile Analysis umber Total Profit ($) Average Profit ($) Cumu. Average Profit ($) Profit Lift (%) 1 500 873.80 5.75 5.75 55 500 1990.3 3.98 4.86 16 3 500 173.5 3.46 4.40 195 4 500 131.55.46 3.91 174 5 500 885.30 1.77 3.49 155 6 500 67.10 1.5 3.11 138 7 500 513.35 1.03.8 15 8 500 504.18 1.01.59 115 9 500 480.6 0.96.41 107 10 500 40.78 0.84.5 100 Total 5000 1159.5.5 3. EURAL ETWORKS Artificial neural networks are a broa class of computational moels that have sparke wie interest in recent years [17, 18, an 19]. In contrast to conventional centralize, sequential processing, neural networks consist of massively connecte simple processing units, which are analogous to the neurons in the biological brain. Through elementary local interactions (such as excitatory an inhibitory) among these simple processing units, sophisticate global behaviors emerge, resembling the high-level recognition process of humans [0, 1]. By virtue of their inherent parallel an istribute processing, neural networks have been shown to be able to perform tasks that are extremely ifficult for conventional von eumann machines, but are easy for humans. eural networks have been use as an alternative approach to traitional optimization an statistical analysis, an have foun successful applications in systems control, pattern recognition, classification, iscriminant analysis, financial market, an forecasting [19]. Many neural network paraigms have been evelope uring the last two ecaes. One of the most wiely use neural network moels is the feeforwar neural network, where neurons are arrange in layers [13]. Besies an input layer an an output layer, there are one or more hien layers between the input an the output layer. Figure 1 gives a typical fully connecte twolayer feeforwar neural network (by convention, the input layer oes not count) with input noes, H hien noes, an M output notes. It is common to refer the network as xhxm network. The arrows represent the forwar irection. Full connection means that each input noe is connecte to every hien noe, an each hien noe is connecte to every output noe. ote that it is possible to buil neural network moels with partial connections. Small networks (with small number of noes an/or small number of connections) are generally preferre when the moel nees to be able to generalize outsie the sample ata []. Input to the neural network are X = {xi i = 1,,, } an output is Y = {yi i = 1,,, M}. 14

Volume 9 o.5, September 011 Input Hien Output 1 1 H Fig 1. A Feeforwar etwork A feeforwar neural network is use by first training it with known examples (X, T), where X are the inputs an T are the target values. Training a neural net means moifying the weights on the links (connection strength) such that the network learns the unerlying pattern(s) from the training examples. A wiely use training algorithm for feeforwar neural networks is known as the backpropagation algorithm. Backpropagation is essentially a graient ecent base algorithm that minimizes the error function, typically, the sum of square ifferences of the network outputs an the target values. E i j ( y ij t ij ) for i = 1,,, P; j = 1,,, M where P is the number of sample (x, t) pairs an M is the number of output noes. Output error is back propagate through the network, an the weights are moifie to reuce the output error. When the error reaches a preetermine minimum we say the training is one. A traine neural network can be use to retrieve the input-output relationship of the training examples. More importantly, it can generalize from the limite training examples. In other wors, a traine neural network can preict the target value given a new set of input ata. For a complete coverage of the backpropagation training algorithm an many of its variations, the reaer is referre to Fine [3] 4. DATA AD MODELIG A real worl application is stuie in this paper. The problem consiere is that of a cellular-phone provier seeking to ientify potential high-value churners so that they can be targete with some appropriate intervention program. The specific objective is to ientify high-value churners amongst new installs within the first year of service. Two epenent variables correspon to the two important measures of the objective: (1) a binary Churn variable inicating whether a customer churne (value 1) or not (value 0) within the first four months; an () a continuous variable measuring revenue ($) associate with the customer. The preictor variables consiere pertain to stanar measures use in the cellular inustry. Four preicator variables use in this stuy are peak minutes-of-use, off-peak minutes-of-use, average charges, an payment 1 M information. The ata were obtaine after the usual variable transformation an reuction. Cumulative lifts at the specifie epths-of-file serve as a performance measure. As iscusse in Section 1, cumulative lifts at specific epths of file provie a measure of improvement over a ranom mailing. For instance, a lift of 300 at the 10% epth of file inicates that the moel in question is expecte to provie a performance that is three times that of expecte from a ranom mailing to 10% of the list. Two cumulative lifts are use to gauge performance levels resulting from the two epenent variables. Churn-Lift at the esire ecile shows the relative performance of the moel in ientifying churners. Revenue lift, enote as $-Lift at the esire ecile inicates the moel performance in ientifying high-value customers without regar for their churn likelihoo. ote that a high Churn-Lift oes not correspon to a high value of $-Lift. A moel that oes well with both performance measures is preferre. The maximization of the expecte revenue that can be save through ientification of high-value churners is the overall moeling objective. Churn-Lift an $-Lift are estimate at a specific ecile as follows: Consier R an C the cumulative total revenue an cumulative total number of churners respectively at the ecile, R the total revenue for the entire ata, an C the total churners in the entire ata. Then, if enotes the overall total customers an is the total customers up to the ecile level, the cumulative churn an revenue lifts are: an Churn $ Lift Lift C / C / R / R / The expecte revenue save through ientifying the churners up to the epth-of-file is given by the prouct of average churn per customer an average revenue per customer. C R * ( Churn Lift *$ Lift )( R C * ) The prouct of Churn-Lift an $-Lift value provies a measure for comparing the performance of moels as it gives the cumulative lift on the expecte revenue save as: ( Churn Lift *$ Lift ) C ( R * R C ) /( * ) Feeforwar neural networks are use to moel the relationship between the inepenent variables an epenent variables (churn an revenue). In theory, a feeforwar neural network with a single hien layer is sufficient to approximate any continuous functions [4]. Empirical evience shows that more than one hien layer in the neural network moels oes not noticeably improve the performance. So we have use neural networks with 4X8X1 an 4X8X structure. That is, there are four input noes (corresponing to the four input variables), eight hien noes, an one or two output noes. 15

Volume 9 o.5, September 011 The numbers of input an output noes epen on the ata attributes, while the selection of the number of hien noes is often base on rule of thumb. Since we are using fully connecte feeforwar networks, the number of weights W epens on the number of hien noes. The general guieline in selecting the number of hien noes is to construct a neural network that is just large enough to solve the problem at han. ot enough weights may rener the moel incapable of solving the problem, while too many weights ten to reuce the moel s generalization ability []. After a few trial runs, the neural network training parameters are selecte as follow: umber of training epochs = 1000, learning rate = 0.5, momentum = 0.7. eural networks are initialize with ranom weights. Each set of experiment is carrie out 10 times with ifferent initial weights, an the average results are reporte. A sample of 50,000 customer ata was use for the moeling builing an testing. This sample was ivie into equal training an test sets of 5,000 observations each. The training set was use to buil the moels. o cross valiation was use uring training. All reporte results are base on the test ata. A logistic regression moel for Churn an an orinary least squares regression moel for revenue give us the baseline performances for the two objectives. While these moels are expecte to perform well on their respective single objectives, they may not provie effective solutions for the overall objective, i.e., maximization of the expecte revenue that can be save through targete marketing to the high-value churners. 5. EXPERIMET RESULTS We teste the moels at four ifferent epths-of-file: 10%, 0%, 30% an 70%. Table shows the Churn-Lift an $-Lift values from three neural network moels. Moel one uses the binary churn variable as training target with network structure 4x8x1. Revenue is not use in Moel one. Moel two is similarly constructe, but it uses the continuous revenue variable as the training target while the churn variable is omitte. Moel three combines the two inepenent variables as the training target with a network structure of 4x8x. All three neural network moels show significant improvement across various epth-of-files compare with the expecte performance from ranom sampling. The result is encouraging consiering that neural network moels use are relatively simple. We have not conucte comprehensive search of optimal neural network structures. ot surprisingly, moel one gives the largest churn-lift, as ientifying the churners is the objective of this moel. Moel two aims to maximize the $-lift. The performance on Churn-Lift is not consiere by the moel, hence the poor performance results for Churn-Lift. When the two performance measures are combine, as in the case of moel three, more balance results are achieve, an the overall performance also improves. Table. eural network performance results eural etwork 10% epth 0% epth 30% epth 70% epth Churn-Lift 365.5 343.9 91.1 138 Moel 1 $-Lift 11. 15. 18.1 86.4 Prouct of lifts 771.9 53.4 37.9 119. Churn-Lift 106.1 101.9 95. 86.8 Moel $-Lift 361.4 71.1.3 136. Prouct of lifts 383.4 76. 11.5 118. Churn-Lift 53.0 185. 149.9 93.6 Moel 3 $-Lift 314.1 90.1 70.1 138.1 Prouct of lifts 794.8 537. 404.9 19.3 Table 3. Performance comparison: eural network vs. Regression Performance 10% epth 0% epth 30% epth 70% epth Churn-Lift 53.0 185. 149.9 93.6 Best moel $-Lift 314.1 90.1 70.1 138.1 Prouct of lifts 794.8 537. 404.9 19.3 Churn-Lift 447.1 403.4 96.0 137.8 Logistic $-Lift 111.8 7.6 57.4 66.7 Regression Prouct of lifts 499.8 9.7 170.0 91.9 OLS Regression Churn-Lift 116. 108.1 99.7 91.8 $-Lift 360.5 71.7 3. 136. Prouct of lifts 418.8 93.7.5 15.1 Improvement over Logistic 59.0% 83.5% 138.% 40.7% Improvement over OLS 89.8% 8.9% 8.0% 3.3% 16

Volume 9 o.5, September 011 In terms of the overall performance measure: the prouct of lifts, Moel 1 an 3 are significantly better than moel. This inicates that high-revenue generating customers o not correspon to high churn rate. Moel 1 suggests that churners may contribute to relative large revenue loss. Since Moel 3 provies the highest overall performance, it shoul be the moel of choice for this particular application. Table 3 gives performance comparisons between Moel 3, our choice of neural network moel, an traitional statistical approaches, namely, the logistic moel an the least squares regression moel. Table 3 shows clearly the neural network moel outperforms the logistic regression an OLS moels. In particular, when the epth-of-file is limite to the top 30 percent, the neural network gives consierably better overall performance. ote that both of the regression moels suffer skewe performance, as the logistic moel overlooks the $-Lift while the OLS moel overlooks the Churn-Lift. It is also noteworthy that the prouct of lifts generate by the neural network ecreases in significant amount when the epth-of-file goes from 10 percent to 30 percent. However, the relative performance of the neural network moel over the comparative moels is still significantly better. 6. DISCUSSIO Feeforwar neural networks are consiere a general class of robust non-linear moels. While linear moels are wiely use in real worl applications, most real-worl problems, nevertheless, exhibit non-linear relationship between the inepenent an epenent variables. eural networks enable us to esign nonlinear systems that are able to eal with complex problems without a priori knowlege of the input-output relationship. Because of their powerful moeling capability an relative ease of use, neural networks have foun wie in various pattern recognition applications [3]. Linear regression moels use linear functions to fit the ata, base on the assumptions that the relationship between the epenent variable Y an inepenent variables X is linear; the values of Y are statistically inepenent of one another, an the istribution of possible values of Y for any X values is normal with equal variances. Those assumptions may not hol true for the all ata sets. In contrast to the statistical moels, neural networks make no such assumptions about the ata; hence they can be applie to a wier range of problems. Furthermore, by changing the neural network structure an activation functions of the processing elements (noes), we can use neural networks to approximate classification an regression moels. In the current application, we use neural networks to moel the input-output relationship of the sample ata. This input-output relationship is employe in the test ata to preict the revenue potential an churn likelihoo of a customer. The current neural network moel oes not irectly incorporate the performance maximization at a given epth-of-file. Future research may consier moifying the stanar neural network learning algorithm to explicitly seek performance maximization with specifie mailing epth as an input. This will enable the ecision maker to buil optimal performance moels geare towars specific epth-of-file requirements. Builing the best neural network for an application is still more of an art than a science. Zahavi an Levin [7] reporte that neural networks i not outperform logistic regression. They suggeste that two possible reasons for their results. One is that neural networks may be over fitting the training ata. Another reason is that neural network moels are typically built by trial an error approach. Further experiments exploring the use of other neural network moels, such as moular neural networks, network with weight ecay, an multiple objective moels may lea to improve performance. More efficient neural network learning algorithms may also be use to improve the training efficiency. Techniques such as cross-valiation can be use to increase the generalization ability of the traine neural network moel. 7. COCLUSIO We have applie one of the most popular neural network moels, namely, the feeforwar neural network, to performance maximization at esire mailing epths in irect marketing in cellular phone inustry. eural network base preictive moel ientifies the most promising iniviuals given a specifie mailing epth. Compare with statistical moels, such as logistic regression an orinary least squares regression, the neural network moels provie more balance outcome regaring the two preicte measures, namely, the potential revenue an the churn likelihoo of a customer. In terms of the overall objective, i.e., ientifying the churners with the most revenue potential, neural networks moels outperforms the statistical moels by a significant margin. The performance of the neural network moels is particularly well with low epthof-file target levels. 8. ACKOWLEDGMETS My thanks to Si Bhattacharyya for proviing the ata an help with the irect response moeling an ecile analysis. 9. REFERECES [1] Lyman, Peter an Hal R. Varian, 000. How Much Information? Research report, School of Information Management an Systems, University of California at Berkeley. [] Pollay, R.W. an B. Mittal, 1993. Here s the beef: factors, eterminants an segments of consumer criticism of avertising, Journal of Marketing, 57. 99-114. [3] DMA 007 annual report: Working to keep every channel open an economically viable for all marketers. http://web.mac.com/asyracuse/site/corporate_clips_files/a nnualreport.pf. Retrieve March, 008. [4] Han, D.J. 1981. Discrimination an Classification, John Wiley an Sons, ew York, Y. [5] Bult, J.R. an T.J. Wansbeek, 1995. Optimal selection for irect mail, Marketing Science, 14, 378-394. [6] Haughton, D. an S. Oulabi. 1997. Direct marketing moeling with CART an CHAID, Journal of Direct Marketing, 11(4), 4-5. [7] Zahavi, J., an Levin,. 1997. Issues an problems in applying neural computing to target marketing. Journal of Direct Marketing, 11(4), 63 75. [8] Zahavi, J. an Levin,. 1997. Applying neural computing to target marketing, Journal of Direct Marketing, 11 (1), 5-. 17

Volume 9 o.5, September 011 [9] Ha, K., Cho, S. an MacLachlan, D. 005. Response moels base on bagging neural networks. Journal of Interactive Marketing, 19(1). 17-30. [10] Baesens, B., S. Viaene, D. van en Poel, J. Vanthienen, G. Deene. 00. Bayesian neural network learning for repeat purchase moelling in irect marketing. European Journal of Operations Research. 138(1) 191 11. [11] Kaefer, Freerick, Heilman, Carrie M. an Ramenofsky, Samuel D. 005. A eural etwork Application to Consumer Classification to Improve the Timing of Direct Marketing Activities. Computers an Operations Research, 3 (10), 595-615. [1] Lee, Wan-I, Bih-Yaw Shih. 009. Application of neural networks to recognize profitable customers for ental services marketing-a case of ental clinics in Taiwan. Expert System Applications. 36(1). 199-08. [13] Torres, Mercees, Cesar Hervás, Carlos García, 009. Multinomial logistic regression an prouct unit neural network moels: Application of a new hybri methoology for solving a classification problem in the livestock sector, Expert Systems with Applications: An International Journal, 36(10),15-135. [14] Bentz, Y. an Merunkay, D. 000. "eural etworks an the Multinomial Logit for Bran Choice Moeling: A Hybri Approach", Journal of Forecasting, Vol. 19 (3), 177-00. [15] Bhattacharyya, S. 1999. "Direct Marketing Performance Moeling using Genetic Algorithms", IFORMS Journal of Computing, 11(3). 48-57. [16] Davi Shepar Associates, 005. The ew Direct Marketing: How to Implement a Profit-Driven Database Marketing Strategy, n Eition, Irwin Publishing. 1995. 19(1), 17-30. [17] Levine, D. S. an M. Aparicio, (eitors), 1994. eural etworks for Knowlege Representation an Inference, MIT press. [18] Principe, J. C.,. R. Euliano an W. C. Lefebvre. 000. eural an Aaptive Systems: Funamentals Through Simulations, John Wiley & Son, ew York. [19] Refenes, A. P. (eitor), 1995. eural networks in the capital markets, John Wiley & Sons, West Sussex, Englan, [0] Rumelhart, D. E., James L. McClellan, an the PDP Research Group, 1986. Parallel Distribute Processing - Explorations in the microstructure of Cognition, Volume I: Founations, The MIT Press. [1] Rumelhart, D.E., G.E. Hinton an R.J. Williams, 1986. Learning Internal Representations by Error Propagation, in Parallel Distribute Processing: Exploration in the Microstructure of Cognition, Volume I: Founations, D.E. Rumelhart an J.L. McClellan (es.), MIT Press, Cambrige, MA. [] Barrtlett, P. 1998. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Transactions on Information Theory, 44, 141-166. [3] Fine, T. L. 1999. Feeforwar eural etwork Methoology, Springer-Verlag, ew York. [4] Hornik, K.. 1991. Approximation capabilities of multiplayer feeforwar networks. eural etworks, 4, 51-57. 18