Customer Lifetime Value Measurement using Machine Learning Techniques. Tarun Rathi. Mathematics and Computing. Department of Mathematics

Similar documents

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Predictive Dynamix Inc

Chapter 5: Customer Analytics Part I

Predict the Popularity of YouTube Videos Using Early View Data

Chapter 12 Discovering New Knowledge Data Mining

A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services

A Property & Casualty Insurance Predictive Modeling Process in SAS

Making Sense of the Mayhem: Machine Learning and March Madness

Modeling Customer Lifetime Value

A Property and Casualty Insurance Predictive Modeling Process in SAS

Statistics Graduate Courses

Neural Networks and Support Vector Machines

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Data Mining Applications in Higher Education

Marketing Mix Modelling and Big Data P. M Cain

Chapter 6. The stacking ensemble approach

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Classification of Bad Accounts in Credit Card Industry

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

MARKET SEGMENTATION, CUSTOMER LIFETIME VALUE, AND CUSTOMER ATTRITION IN HEALTH INSURANCE: A SINGLE ENDEAVOR THROUGH DATA MINING

Master of Science in Marketing Analytics (MSMA)

Data Mining. Nonlinear Classification

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

EVALUATION AND MEASUREMENT IN MARKETING: TRENDS AND CHALLENGES

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

PROFITABLE CUSTOMER ENGAGEMENT Concepts, Metrics & Strategies

Customer lifetime value model in an online toy store

Machine learning for algo trading

An Introduction to Neural Networks

Social Media Mining. Data Mining Essentials

How To Understand The Theory Of Probability

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Neural Networks for Sentiment Detection in Financial Text

Tutorial Customer Lifetime Value

TNS EX A MINE BehaviourForecast Predictive Analytics for CRM. TNS Infratest Applied Marketing Science

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Segmentation of stock trading customers according to potential value

Data Mining Practical Machine Learning Tools and Techniques

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

Joseph Twagilimana, University of Louisville, Louisville, KY

Customer Lifetime Value Formula. Concepts, components and calculations involving CLV

Predictive Modeling Techniques in Insurance

DHL Data Mining Project. Customer Segmentation with Clustering

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Predict Influencers in the Social Network

Customer Relationship Management

Customer Analytics. Turn Big Data into Big Value

Data quality in Accounting Information Systems

An Introduction to Data Mining

Easily Identify Your Best Customers

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Data Analytical Framework for Customer Centric Solutions

Product Recommendation Based on Customer Lifetime Value

Multichannel Marketing and Hidden Markov Models

Statistics in Retail Finance. Chapter 6: Behavioural models

Management Science Letters

MODELING CUSTOMER RELATIONSHIPS AS MARKOV CHAINS. Journal of Interactive Marketing, 14(2), Spring 2000, 43-55

OUTLIER ANALYSIS. Data Mining 1

Decision Trees from large Databases: SLIQ

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Stock Portfolio Selection using Data Mining Approach

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Numerical Algorithms Group

White Paper. Data Mining for Business

Introduction to Machine Learning Using Python. Vikram Kamath

Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Linear Threshold Units

Lecture 6. Artificial Neural Networks

INCORPORATION OF LIQUIDITY RISKS INTO EQUITY PORTFOLIO RISK ESTIMATES. Dan dibartolomeo September 2010

Myth or Fact: The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions

Dimensionality Reduction: Principal Components Analysis

How To Identify A Churner

JetBlue Airways Stock Price Analysis and Prediction

SUMAN DUVVURU STAT 567 PROJECT REPORT

Chapter 7: Data Mining

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

Web Appendix OPERATIONALIZATION OF CUSTOMER METRICS FOR NETFLIX AND VERIZON WIRELESS

Leveraging Ensemble Models in SAS Enterprise Miner

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

On the Use of Continuous Duration Models to Predict Customer Churn in the ADSL Industry in Portugal

Measuring Customer Lifetime Value: Models and Analysis

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling

Java Modules for Time Series Analysis

Data Mining - Evaluation of Classifiers

Self Organizing Maps: Fundamentals

A Hybrid Model of Data Mining and MCDM Methods for Estimating Customer Lifetime Value. Malaysia

Data Mining Part 5. Prediction

Transcription:

1 Customer Lifetime Value Measurement using Machine Learning Techniques Tarun Rathi Mathematics and Computing Department of Mathematics Indian Institute of Technology (IIT), Kharagpur -721302 08MA2027@iitkgp.ac.in Project guide: Dr. V Ravi Associate Professor, IDRBT Institute of Development and Research in Banking Technology (IDRBT) Road No. 1, Castle Hills, Masab Tank, Hyderabad 500057 http://www.idrbt.ac.in/ July 8, 2011

2 Certificate Date: July 8, 2011 This is to certify that the project Report entitled Customer Lifetime Value Measurement using Machine Learning Techniques submitted by Mr. TARUN RATHI, 3 rd year student in the Department of Mathematics, enrolled in its 5 year integrated MSc. course of Mathematics and Computing, Indian Institute of Technology, Kharagpur is a record of bonafide work carried out by him under my guidance during the period May 6, 2011 to July 8, 2011 at Institute for Development and Research in Banking Technology (IDRBT), Hyderabad. The project work is a research study, which has been successfully completed as per the set of objectives. I observed Mr. TARUN RATHI as sincere, hardworking and having capability and aptitude for independent research work. I wish him every success in his life. Dr. V Ravi Associate Professor, IDRBT Supervisor

3 Declaration by the candidate I declare that the summer internship project report entitled, Customer Lifetime Value Measurement using Machine Learning Techniques is my own work conducted under the supervision of Dr. V Ravi at Institute of Development and Research in Banking Technology, Hyderabad. I have put in 64 days of attendance with my supervisor at IDRBT and awarded project fellowship. I further declare that to the best of my knowledge the report does not contain any part of any work, which has been submitted for the award of any degree either by this institute or in any other university without proper citation. Tarun Rathi III yr. Undergraduate Student Department of Mathematics IIT Kharagpur July 8, 2011

1 Acknowledgement I would like to thank Mr B. Sambamurthy, director of IDRBT, for giving me this opportunity. I gratefully acknowledge the guidance from Dr. V. Ravi, who helped me sort out all the problems in concept clarifications; and without whose support, the project would not have reached its present state. I would also like to thank Mr. Naveen Nekuri for his guidance and sincere help in understanding important concepts and also in the development of the WNN software. Tarun Rathi III yr. Undergraduate Student Department of Mathematics IIT Kharagpur July 8, 2011

2 Abstract: Customer Lifetime Value (CLV) is an important metric in relationship marketing approaches. There have always been traditional techniques like Recency, Frequency and Monetary Value (RFM), Past Customer Value (PCV) and Share-of-Wallet (SOW) for segregation of customers into good or bad, but these are not adequate, as they only segment customers based on their past contribution. CLV on the other hand calculates the future value of a customer over his or her entire lifetime, which means it takes into account the prospect of a bad customer being good in future and hence profitable for a company or organisation. In this paper, we review the various models and different techniques used in the measurement of CLV. Towards the end we make a comparison of various machine learning techniques like Classification and Regression Trees (CART), Support Vector Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural Network (WNN) for the calculation of CLV. Keywords : Customer lifetime value (CLV), RFM, Share-of-Wallet (SOW), Past Customer Value (PCV), machine learning techniques, Data mining, Support Vector Machines, Sequential Minimal Optimization (SMO), Additive Regression, K-star Method, Artificial Neural Networks (ANN), Multilayer Perceptron (MLP), Wavelet Neural Network (WNN).

3 Contents Certificate Declaration by the candidate Acknowledgement 1 Abstract 2 1. Introduction 4 2. Literature Review 5 2.1 Aggregate Approach 5 2.2 Individual Approach 8 2.3 Models and Techniques to calculate CLV 10 2.2.1 RFM Models 10 2.2.2 Computer Science and Stochastic Models 12 2.2.3 Growth/Diffusion Models 15 2.2.4 Econometric Models 15 2.2.5 Some other Modelling Approaches 17 3. Estimating Future Customer Value using Machine Learning Techniques 19 3.1 Data Description 19 3.2 Models and Software Used 20 3.2.1 SVM 20 3.2.2 Additive Regression and K-Star 21 3.2.3 MLP 22 3.2.4 WNN 22 3.2.5 CART 24 4. Results and Comparison of Models 27 5. Conclusion and Directions of future research 28 References 29

4 1. Introduction: Customer Lifetime Value has become a very important metric in Customer Relationship Management. Various firms are increasing relying on CLV to manage and measure their business. CLV is a disaggregate metric that can be used to find customers who can be profitable in future and hence be used allocate resources accordingly (Kumar and Reinartz, 2006). Besides, CLV of current and future customers is a also a good measure of overall value of a firm (Gupta, Lehmann and Stuart 2004). There have been other measures as well which are fairly good indicators of customer loyalty like Recency, Frequency and Monetary Value (RFM), Past Customer Value (PCV) and Share-of-Wallet (SOW). The customers who are more recent and have a high frequency and total monetary contribution are said to be the best customers in this approach. However, it is possible that a star customer of today may not be the same tomorrow. Matlhouse and Blattberg (2005) have given examples of customers who can be good at certain point and may not be good later and a bad customer turning to good by change of job. Past Customer Value (PCV) on the other hand calculates the total previous contribution of a customer adjusted for time value of money. Again, PCV also does not take into account the possibility of a customer being active in future (V. Kumar, 2007). Share-of-Wallet is another metric to calculate customer loyalty which takes into account the brand preference of a customer. It measures the amount that a customer will spend on a particular brand against other brands. However it is not always possible to get the details of a customer spending on other brands which makes the calculation of SOW a difficult task. A common disadvantage which these models share is the inability to look forward and hence they do not consider the prospect of a customer being active in future. The calculation of the probability of a customer being active in future is a very important part in CLV calculation, which differentiates CLV from from these traditional metrics of calculating customer loyalty. It is very important for a firm to know whether a customer will continue his relationship with it in the future or not. CLV helps firms to understand the behaviour of a customer in future and thus enable them to allocate their resources accordingly. Customer Lifetime Value is defined as the present value of all future profits obtained from a customer over his or her entire lifetime of relationship with the firm (Berger and Nassr, 1998). A very basic model to calculate CLV of a customer is (V. Kumar, 2007) : = where, is the customer index, is the time index, T is the number of time periods considered for estimating CLV, is the discount rate.

5 There are various models to calculate the CLV of a customer or a cohort of customers, depending on the amount of data available and the type of company. V. Kumar (2007) has shown individual level approach and aggregate level approach to calculate CLV. He has linked CLV to Customer Equity (CE) which is nothing but the average CLV of a cohort of customers. Dwyer (1997) have used a customer migration model to take into account the repeat purchase behaviour of customers. Various behaviour based models like logit-models and multivariate Probit-models have also been used (Donkers, Verhoef and Jong, 2007) and models which takes into account the relationship between various components of CLV like customer acquitition and retention are also used (Thomas 2001). We will present some of the most used models to calculate CLV in the later part of the paper. Besides this, there are various techniques that are also used to calculate CLV or the parameters needed to calculate CLV. Aeron, Kumar and Janakiraman (2010) have presented various parameters that may be useful in the calculation of CLV which include Acquisition rate, Retention rate, Add-on-selling rate, Purchase Probability, Purchase amount, Discount rate, Referral rate and Cost factor. However, all of these parameters may not be required in a single model. Various researchers have used different techniques to calculate these parameters for calculating CLV. Hansotia and Wang (1997) used Logistic Regression, Malthouse and Blattberg (2005) used linear regression for predicting future cash flows, Dries and Poel (2009) used quantile regression, Haenlein et al. (2007) used CART and markov chain model to calculate CLV. An overview of various data mining techniques used to calculate the parameters for CLV have been compiled by Aeron, Kumar and Janakiraman (2010). Besides this, many researchers also use models like Pareto/NBD, BG/NBD, MBG-NBD, CBG-NBD, Probit, Tobit, ARIMA, Support vector machines, Kohonen Networks etc., to calculate CLV. Malthouse (2009) presents a list of these methods used by academicians and researchers who participated in the Lifetime Value and Customer equity Modelling Competition. Most of the above mentioned models are used either to calculate the variables used to predict CLV or to find a relationship between them. In our research, we have used several non-linear techniques like Classification and Regression Trees (CART), Support Vector Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural Network (WNN) to calculate CLV which takes care of the relationship between the variables which act as input variables in the prediction of CLV. Further we also make a comparison of these techniques to find the best fitted model for the dataset we used. Later on we make conclusions and discuss the areas of future research. 2. Literature Review: Before going into the details of various models of CLV, let us first have a look on the various approaches designed for calculating CLV. CLV can broadly be classified in 2 ways: a) Aggregate Approach b) Individual Approach 2.1 Aggregate Approach: This approach revolves around calculating Customer Equity (CE) of a firm. Customer Equity is nothing but the average CLV of a cohort of customers. Various researchers have devised different ways to calculate CE of a firm. Gupta, Lehman and Stuart

6 (2004) have calculated CE by summing up the CLV of all the customers and taking its average. Berger and Nassr (1998) calculated CLV from the lifetime value of a customer segment. They also took into account the rate of retention and the average acquisition cost per customer. Avg. CLV = / 1 A Here, r=rate of retention A= Avg. Acquisition cost per customer Kumar and Reinartz (2006) gave a formula for calculating the retention rate for a customer segment as follows : Retention rate(%) = N.. Projecting Retention rate : 1 Here, = predicted retention rate for a given period of time in future. And = Max attainable retention rate, given by the firm. r = coefficient of retention and calculated as r= (1/t) * (ln( ) ln( )) This model is good enough for calculating the CLV of a segment of customers over a small period of time, however the fluctuation of retention rate and gross contribution margin needs to be taken care of while projecting CLV for longer periods. Taking this into account they proposed another model which calculated the profit function over time, which can be calculated separately. This models is given as : CLV = x [ 1 ], where is the profit function over time. Blattberg, Getz and Thomas (2001) calculated average CLV or CE as the sum of return on acquisition, return on retention and return on add-on selling rate across the entire customer base. They summarized the formula as :,,,,,,,,,,,,,,,, 1 1 where,

7 CE(t) is the customer equity value for customers acquired at time t,, is the number of potential customers at time t for segment i,, is the acquisition probability at time t for segment i,, is the retention probability at time t for a customer in segment i,,, is the marketing cost per prospect (N) for acquiring customers for segment i,,, is the marketing costs in time period t for retained customers for segment i,,, is the marketing costs in time period t for add-on selling for segment i d is the discount rate, is the sales of the product/services offered by the firm at time t for segment i,, is the cost of goods at time t for segment i. is the number of segments, is the segment designation and is the initial time period. Rust, Lemon and Zeithaml (2004) used a CLV model in which they considered the case where a customer switches between different brands. However, in using this model, one needs to have a customer base which provides information about previous brands purchased, probability of purchasing different brands etc. Here the CLV of customer i to brand j is given as : 1 1 / where, is the number of purchases customer i makes during the specified time period, is firm js discount rate, is the average number of purchases customer i makes in a unit time (eg. Per year) is customer i s expected purchase volume of brand j in purchase t is the expected contribution margin per unit of brand j from customer i in purchase t is the probability that customer i buys brand j in purchase t.

8 The Customer Equity (CE) of firm j is then calculated as the mean CLV of all customers across all firms multiplied by the total number of customers in the market across all brands. 2.2 Individual Approach : In this approach, CLV is calculated for an individual customer as the sum of cumulated cash flows discounted using WACC (Weighted avg. cost of capital) of a customer over his or her entire lifetime (Kumar and George, 2007). The CLV in this case depends on the activity of the customer or his expected number of purchases during the prediction time period and also his expected contribution margin. The basic formula for CLV in this approach is : where, is the gross contribution margin for customer i in period t. This approach brings into light the need for calculating the probability of a customer being active or P(active). There are various ways to calculate P(active) : V. Kumar (2007) have calculated P(active) as : P(Active) =, where, n is the number of purchases in the observation period, T is the time elapsed between acquisition and the most recent purchase and, N is the time elapsed between acquisition and the period for which P(Active) needs to be calculated. This model however, is quite trivial. Several researchers have used statistically advanced methods to calculate P(active) or the expected frequency of purchase. Most of them have also taken into account other factors like channel communication, recency of purchase, customers characteristics, switching costs, first contribution margin etc. to make the predictions more accurate. Venkatesan and Kumar (2004) in his approach to calculate CLV predicted the customer s purchase frequency based on their past purchases. The CLV function in this case is represented as :, =,, / x,,

9 where, is the lifetime value of customer i,, is the contribution margin from customer i in purchase occasion y, is the discount rate,,, is the unit marketing cost for customer i in channel m in year l,,, is the number of contacts to customer i in channel m in year l, is the predicted purchase frequency for customer i, number of years to forecast, and is the predicted number of purchases made by customer i until the end of the planning period. Besides this, there have been various others models and techniques which calculate P(Active) or the expected frequency of purchase which include Pareto/NBD, BG/NBD, MBG- NBD, CBG-NBD, Probit, Tobit, generalized gamma distribution, log-normal distribution etc. Various researchers and academicians who participated in the 2008 DMEF CLV Modelling Competition have used some of these models to calculated CLV. We will come to know more about these in the next part of the paper, when we study the various models and techniques used by researchers to calculate the parameters of CLV or CLV itself. As we have seen there are various aggregate and disaggregate approaches to calculate CLV. The obvious question which one comes across is which model we use. Kumar and George (2007) have given a detailed discussion of the comparison of these models. They observed that an aggregate approach performs poorly in terms of time to implement and expected benefits and a disaggregate approach has higher data requirement and more metrics to track. They have also concluded that the model selection should depend on the requirement of the firm and which criteria would they more importance to in comparison of others. For example one may consider the cost involved as an important factor while others may consider expected profits as a major factor of contribution. Kumar and George (2007) have also proposed an integrated or hybrid approach to calculate CLV. In this approach, depending on the various details of a customer, an appropriate approach is adopted. If the firm s transaction data and firm-customer interaction data in available then individual approach of Venkatesan and Kumar (2004) is adopted. If this data is not available, but segment level data is available then Blattberg, Getz and Thomas (2001) approach is adopted, if size of wallet information of customers is not available, but survey data is not available then Rust, Lemon and Zeithaml (2004) approach is adopted.

10 2. 3 Models and Techniques to calculate CLV : There are various models to calculate CLV. Most of the models calculate the parameters to measure CLV using different models and then combine the same as a new method to calculate CLV. For example Fader, Hardie and Lee (2005) captured recency and frequency in one model to calculate the expected the number of purchases and built another model to calculate the monetary value. Reinartz, Thomas and Kumar (2005) captured customer acquisition and retention simultaneously. Gupta et. al. (2006) have given a good review on modelling CLV. We will try to use some of his modelling methods in this paper with more examples and understanding. 2.3.1 RFM Models : RFM Models have been in used in direct marketing for more than 30 years. These type of models are most common in industry because of their ease of use. These type of models are based on three levels of information from customers i.e their recency, frequency and Monetary contribution. Fader, Hardie and Lee (2005) have shown that RFM variables can be used to build a CLV model and that RFM are sufficient statistics for their CLV model. We now present in brief about two RFM based models used to determine CLV. Weighted RFM Model : Mahboubeh Khajvand and Mohammad Jafar Tarokh, (2010) have presented his model for estimating customer future value based on the data given by an Iranian Bank. In this model they got the raw data from an Iranian Bank and calculated the recency, frequency and Monetary value of each customer. Using various clustering techniques like K-mean clustering, they segment the data into various groups and calculate the CLV for each cluster using the following formula: = + where, is the weight of recency, frequency and monetary value obtained by AHP method based on expert people idea. The key limit to this modelling approach is that it is scoring model rather than a CLV Model. It divides customers into various segments and then calculates a score for each segment. They don t actually provide a dollar value for each customer. Hence to overcome this, Mahboubeh Khajvand and Mohammad Jafar Tarokh, (2010) proposed a multiplicative seasonal ARIMA Auto Regressive Integrated Moving Average method to calculate CLV, which is a time series prediction method. The multiplicative seasonal ARIMA(p,d,q)x(P,D,Q)s model where, p = order of auto regressive process d = order of differencing operator q = order of moving average process

11 P = order of seasonal auto regressive process D= order of seasonal differencing operator Q= order of seasonal moving average process Can be represented by : D x θ B θ B ε where, is auto regressive process, is moving average process, is d-folding differencing operator which is used to change a nonstationary time series to a stationary one, is the seasonal moving average process and, D is the D-fold differencing operator The main limitation of this model was that it predicted the future value of customers in the next interval only due to lack of data. RFM and CLV using Iso-value curves : Fader, Hardie and Lee (2005) proposed this model to calculate CLV. They showed that no other information other than RFM characteristics are required to formulate this model. Further they have also used the lost for good approach to formulate this model, which means that the customers who leave the relationship with a firm never come back. It is also assumed that M is independent of R and F. This suggests that the value per transaction can be factored out and we can forecast the flow of future transactions. We can then rescale this number of discounted expected transaction (DET) by a monetary value (a multiplier) to yield a dollar number for each customer. This models is formulated as : CLV = margin x revenue/transaction x DET The calculation of DET is the most important part of this model. Fader, Hardie and Lee (2005) first of all calculated DET for a customer with observed behaviour (X=x,, T) as :

12 Here, the numerator is the expected number of transactions in period t and d is the discount rate. However, according to Blattberg, Getz and Thomas (2001) this calculation of CLV has the following problems : a) we don t know the time horizon in projecting the sales, b) What time periods to measure and c) The expression ignores specific timing of transactions. Hence they used Pareto/NBD model by using a continuous-time formulation instead of discrete time formulation to compute DET (and this CLV) over an infinite time horizon. The DET is thus calculated as : where r,,, are the pareto/nbd parameters. (.) is the confluent hypergeometric function of second kind; and L(.) is the pareto/nbd likelihood function. Now they added a general model of monetary value to a dollar value of CLV assuming that a customer s given transactions varies around his/her average transaction value. After that they checked various distributions to find that the gamma distribution best fitted their model and hence calculated the expected average transaction value for a customer with an avg. spend of across x transactions as : This value of monetary value obtained multiplied with DET gave the CLV of a customer. Following this, various graphs also called as iso-curves were drawn to identify customers with different purchase histories but similar CLVs, like CLV frequency, CLV Recency, CLV frequency recency etc. They key limitations of this model is that it is based on a noncontractual purchase model. It is not clear which distribution should be used to calculate the transaction incidence and transaction size immediately. 2.3.2 Computer Science and Stochastic Models : These types of models are primarily based on Data mining, machine learning, non parametric statistics and other approaches that emphasize predictive ability. These include neural network models, projection-pursuit models, decision tree models, spline-based models (Generalized Additive Models (GAM), Classification and Regression Trees (CART), Support Vector Machines (SVM) etc.). There are various researchers who have been using these techniques to calculate CLV. Haenlein et al. (2007) have used a model based on CART and 1 st order Markov chains to calculate CLV. They had the data from a retail bank. First of all they determined they various profitability drivers as predictor variables together with target variables in a CART analysis to build a regression tree. This tree helped them to cluster the customer base into a set of homogenous subgroups. They used these sub-groups as discrete states and estimate a transition matrix

13 which describes movements between them, using markov chains. To estimate the corresponding transition probability, they determined the state each customer belonged to at the beginning and end of a predefined time interval T by using decision rules resulting from CART analysis. In the final step the CLV of each customer group as the discounted sum of state dependent contribution margins, weighted with their corresponding transition probabilities was determined. where, = / 1 is the probability of transition from one state to other, is the contribution margin for customer i and is the discount rate. Finally, a study of the CLVs of each customer segment to carry out marketing strategies for each segment was made. This model however has some limitations too. It was assumed that assumed that client behaviour follows a 1 st order markov process, which does not take into account the behaviour of early periods, rendering it as insignificant. It was also assumed that the transition matrix is stable and constant over time, which seems inappropriate for long term forecasts and the possibilities of brand switching in customer behaviour are not taken into account. Malthouse and Blattberg (2005) have used linear regression to calculate CLV. The CLV in this case is related to the predictor variables with some regression function f as where, are independent random variable with mean 0 and error variance V( ) =, Invertible function g is a variance stabilizing transformation. We can consider various regression models for this function : a) Linear regression with variance stabilizing transformations estimated with ordinary least squares. b) Linear regression estimated with iteratively re-weighted least squares(irls). c) Feedforward neural network estimated using S-plus version 6.0.2. Methods like k-fold cross validation are used to check the extent of correctness of the analysis. Dries and Van den Poel (2009) have used quantile regression over linear regression to calculate CLV. It extends the mean regression model to conditional quantiles of the response variables like the median. It provides insights into the effects of the covariates on the conditional CLV distribution that may be missed by the least squares method. In prediction of the top x-percent of the customers, quantile regression method is a better

14 method than the linear regression method. The smaller the top segment of interest, the better estimate of predictive performance we get. Besides, other data mining techniques like Decision Trees (DT), Artificial Neural Networks (ANN), Genetic Algorithm (GA), Fuzzy Logic and Support Vector Machines (SVM) are also in use but mostly to calculate CLV metrics like customer churn, acquisition rate, customer targeting etc. Among DT the most common are C4.5, CHAID, CART and SLIQ. Again ANN have also been used to catch non linear paterns in data. Besides, it can be used for both classification and regression purposes depending on the activation function. Malthouse and Blattberg (2005) used ANN to predict future cash flows. Aeron and Kumar (2010) have mentioned about different approaches for using ANN. First is the generalised stack approach used by Hu and Tsoukalas (2003) where ensemble method is used. The data is first divided into three groups. The first group has all situational variables, the second has all demographics variables and the third has both situational and demographic variables. The other is the hybrid approach of GA/ANN by Kim and Street (2004) for customer targeting where, GA searches the exponential space of features and passes one subset of features to ANN. The ANN extracts predictive information from each subset and learns the patters. Once, it finds the data patters, it is evaluated on a data set and returns metrics to GA. ANN too is not without limitations. It cannot handle too many variables. So, various other algorithms like GA, PCA (Principal Component Analysis) and logistic regression are used for selecting variables to input in ANN. There is no set rule to find ANN parameters. Selection of these parameters is a research area in itself. Besides all this initial weights are decided randomly in ANN, which takes longer time to reach the desired solution. Genetic Algorihm (GA) are more suitable for optimization problems as they achieve global optimum with quick convergence especially for high dimensional problems. GA have seen varied applications among CLV parameters like multiobjective optimization (using Genetic- Pareto Algorithm), churn prediction, customer targeting, cross selling and feature selection. GA is either used to predict these parameters or optimize parameter selection of other techniques like ANN. Besides GA, Fuzzy Logic and Support Vector Machines also find applications for predicting churn and loyality index. There are many other techniques and models like GAM(Generalized Addictive Models), MARS(Multivariate Adaptive Regression Splines), Support Vector Machines (SVM) etc. which are used to predict or optimize the various parameters for CLV like churn rate, logit, hazard functions, classification etc. Churn Rate in itself is a very vast area of CRM which can be used as a parameter in the prediction of CLV and many other related models. There have been many worldwide competitions and tournaments in which various academics and practitioners use various methods by combining different models to get the best possible results. These approaches remain little known in the marketing literature and has a lot of scope for further research. The 2008 DMEF CLV Competition was one such competition in which various researchers and academicians came together to compete for the three tasks in that competition. Malthouse

15 (2009) have made a compilation of the various models which were presented in that competition. 2.3.3 Growth/Diffusion Models : These types of models focus on calculating the CLV of current and future customers. Forecasting acquisition of future customers can be done in 2 ways : The first approach uses disaggregate customer data and builds models that predict the probability of acquiring a particular customer (Thomas, Blattberg and Fox, 2004). The other approach is to use aggregate data and use diffusion or growth to predict the no. of customers a firm is likely to acquire in the future (Gupta, Lehman and Stuart, 2004). The expression for forecasting the number of new customers at time t is : Using this, they estimated the CE of a firm as :, where,, are parameters of the customer growth curve where, is the the no. of newly acquired customers for a segment k, m is the margin, r is retention rate, i is the discount rate, and c is acquisition cost per customer. Diffusion models can also be used to assess the value of a lost customer. For eg. In a banking Industry which has recently acquired a new technology will have some customers who will be reluctant to that change and will be lost. If the relative proportions of lost customers are, then value of average lost customer is : 2.3.4 Econometric Models : Gupta et. al (2006) have given a good review on this type of models. We will present the same in brief in this paper with an example of a right censored tobit model by Hansotia and Wang (1997). Econometric models study customer acquisition, retention and expansion (cross selling or margin) and combine them to calculate CLV. Customer Acquisition and Customer Retention are the key inputs for such a type of model. Various models relate customer acquisition and retention and come up with new models to calculate CLV. For example the right censored Tobit Model for CLV (Hansotia and Wang,

16 1997). It has also been shown by some researchers (Thomas, 2001) that ignoring the link b/w customer acquisition and retention may cause a 6-50% variation from these models. For example if we spend less money on acquisition, the customers might walk away soon. In case of retention models, they are broady classified into two main categories : a) the first one considers the lost for good approach and uses hazard models to predict the probability of customer deflection, b) the second one considers the always a share approach and typically uses markov models. Hazard models are used to predict probability of customer deflection. They again are are of two types : a) Accelerated Failure time (AFT) (Kalbfleisch and Prentice, 1980) and b) Proportional Hazard Models (PH) (Levinthal and Fichman, 1988). AFT is of the form : ln( ) =, where, where t is purchase duration for customer j and X are covariates. Different specifications of and lead to different models such as Weibull or generalized gamma Model. Again PH models specify the hazard rate ( ) and covariates (X) as : ; exp. We get different models like exponential, weibull, gompertz etc. for different specifications. Hansotia and Wang (1997) used a right censored tobit model to calculate the lifetime value of customers or LTV as it was called then. It is a regression model with right censored observations and can be estimated by the method of maximum likelihood. The present value of a customer s revenue (PVR) for the qth customer receiving package j was calculated as : where,, is the (K+1) dimensional column vector of profile variable for the qth customer. The equation may also be estimated using LIFEBERG procedure in SAS. The likelihood function which is the probability of observing the sample value was given by : where, S=1 if observation i is uncensored and 0 otherwise. Besides, the four type of models presented in this paper, Gupta et. al (2006) have also mentioned about a probability model. However, in our research, it has been taken into account in the Computer science and stochastic model. However Gupta et. al. (2006) have

17 made a few assumptions in their review of probability models like the probability of a customer being alive can be characterized by various probability distributions models. They have also taken into account the heterogeneity in dropout rates across customers. Various combinations of these assumptions results in models like pareto/nbd, betabinomial/beta-geometric (BG/BB), markov models etc. Other than that Gupta et. al. (2006) have also mentioned about persistence models which has been used in some CLV context to study the impact of advertising, discounting and product quality on customer equity (Yoo and Hanssens, 2005) and to examine differences in CLV resulting from different customer acquisition methods (Villanueva, Yoo, and Hanssens, 2006). 2.3.5 Some other Modelling Approaches : Donkers et al. (2007) have also made a review of various CLV modelling approaches with respect to the insurance industry sector. These include a status quo model, a Tobit-II model, univariate and multivariate models and duration models. They grouped these models into two types of models. First Relationship Level models which focus on relationship length and total profit, and is build directly on the definition of CLV as defined by Berger and Nasr (1998) : where, d is a predefined discount rate and Profit, for a multiservice industry is defined as : where, J is the number of different services sold, Serv, is a dummy indicating whether customer i purchases service j at time t, Usage, is the amount of service purchased, and Margin, is the average profit margin for service j., and the second is the service level models- which disaggregate a customer s profit into the contribution per service. The CLV predictions are then obtained by predicting purchase behaviour at the service level and combining the results of both models to calculate CLV. An overview of the models as presented by Donkers et al. (2007) with their mathematical models is given below :

18 An overview of Relationship Level Models : Here the Status Quo Model assumes profit simply remains constant over time. Profit Regression Model aims at prediction of customer s annual profit contribution. Retention Models are based on segmenting over RFM. Probit Model is based on customer specific retention probabilities. Bagging Model is also based on customer specific retention probability. Duration Model is focused on customer s relationship duration. Tobit II Model separates the effect of customer deflection on profitability. An Overview of Sevice-level-Models : These types of models are explained as choice model approach and duration model approach. Choice model approach has as dependent variable the decision to purchase a service or not. Duration Model approach focuses on the duration of an existing relationship. It only models the ending of a period and not the starting of a new one. The next part of the paper presents the machine learning approach, we have used to calculate the future value of customers. A dataset obtained from Microsoft Access 2000, the Northwind Traders is adopted to demonstrate our approach. We have used Classification and Regression Trees (CART), Support Vector Machines (SVM), SVM using SMO, Additive

19 Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural Network (WNN) to calculate the futute value of customers. In the later part of the paper, we make a comparison of these models and suggest the best model to calculate the CLV. We end this paper with results and discussion on the future development in the area of CLV measurement. 3. Estimating Future Customer Value using Machine Learning Techniques: There are various data mining techniques which are used in the field of classification and regression. The use of a technique depends on the type of data available. In our case, we have we have used the regression technique to determine the future value of customers in the next prediction period. In the past, several researchers have used these techniques to determine the metrics of CLV depending on the type of model and approach they have used. Hansotia and Wang (1997) have used CART and CHAID for customer acquisition. Kim and Street (2004) have used ANN for customer targeting, Au et al. (2003) used Genetic Algorithms (GA) for predicting customer s churn. However, using these techniques to directly predict a customer s future value and hence CLV have not been done so far. Most of the previous approaches in measuring CLV have used two or more models to calculate either CLV or determine the relationship between the various parameters used to determine CLV. The approach which we have adopted tries to eliminate this process and allows the software which uses this technique to predict the relationship between the input variables and their weightage in calculating CLV. 3.1 Data Description : A sample database of Microsoft Access 2000, the Northwind Traders database is adopted to calculate the CLV of customers. The database contains 89 customers with a purchase period of 2 years from 1 st July 1994 till 30 th June 1996. We have divided this time frame into 4 equal half years and calculated the frequency of purchase and the total monetary contribution in July December 1994, January June 1995, July December 1995 and January June 1996. Further we kept the observation period from July, 1994 till December 1995 and made a prediction of the expected contribution in the next period i.e. January June 1996. The total variables used are 7, out of which 6 are input or predictor variables and the remaining one i.e. contribution margin in jan-june, 1996 as the target variable. The entire dataset is then dived in two parts: a) training and b) testing. We used 65 samples for training the data and the remaining 24 for testing purposes.

20 Table 1: Description of variables Type of variable Variable Name Variable Description Input Variable Recency-dec95 Calculates the recency as a score, calculating july, 94 as 1 and dec, 95 as 18 Input Variable total frequency The total number of purchases between july, 94 till dec, 95 Input Variable Total duration The total duration of observation i.e from july 94 till dec, 95 Input Variable CM_july-dec94 The contribution margin in the period july dec, 94 Input Variable CM_jan-june95 The contribution margin in the period jan june, 95 Input Variable CM_july-dec95 The contribution margin in the period july dec, 95 Target Variable output The contribution margin in the period jan june, 96 3.2 Models and Software used: Knime 2.0.0, Salford Predictive Miner (SPM), NeuroShell 2 (Release 4.0) and a software by Chauhan et al. (2009) developed at IDRBT for classification problems in DEWNN, Hyderabad is used for analysis. In Knime, we have used Support Vector Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, for learning purposes of the training dataset and the weka predictor for prediction of the testing dataset. In Salford Predictive Miner (SPM), we used CART to train the dataset and applied the rules obtained from the training dataset on the testing dataset for prediction. The software developed at IDRBT, Hyderabad was used to train the data using Wavelet Neural Network (WNN) and applied the learning parameters on the test data to get the results and NeuroShell for MLP. We have given brief description of the techniques used for prediction of the target variable. 3.2.1 SVM : The SVM is a powerful learning algorithm based on recent advances in statistical learning theory (Vapnik, 1998). SVMs are learning systems that use a hypothesis space of linear functions in a high-dimensional space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory (Cristianini & Shawe-Taylor, 2000). SVMs have recently become one of the popular tools for machine learning and data mining and can perform both classification and regression. SVM uses a linear model to implement non-linear class boundaries by mapping input vectors non-linearly into a high dimensional feature space using kernels. The training examples that

21 are closest to the maximum margin hyper plane are called support vectors. All other training examples are irrelevant for defining the binary class boundaries. The support vectors are then used to construct an optimal linear separating hyper plane (in case of pattern recognition) or a linear regression function (in case of regression) in this feature space. The support vectors are conventionally determined by solving a quadratic programming (QP) problem. SVMs have the following advantages: (i) they are able to generalize well even if trained with a small number of examples and (ii) they do not assume prior knowledge of the probability distribution of the underlying dataset. SVM is simple enough to be analyzed mathematically. In fact, SVM may serve as a sound alternative combining the advantages of conventional statistical methods that are more theory-driven and easy to analyze and machine learning methods that are more data-driven, distribution-free and robust. Recently, SVM are used in financial applications such as credit rating, time series prediction and insurance claim fraud detection (Vinaykumar et al., 2008). In our research, we used two SVM learner models for predictive purposes. First we used the SVM Regression model as the learner function and then used weka predictor to get the results. We found the correlation coefficient as 0.8889 and root relative squared squared error as 48.03%. In case of SVO (sequential minimal optimization algorithm) for training a support vector regression model, we replaced the learner function by the SVOreg function. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes by default. Here we found the correlation coefficient as 0.8884 and the root relative squared error as 47.98%. 3.2.2 Additive Regression and K-star: Addtive Regression is another classifier used in weka that enhances the performance of a regression base classifier. Each iteration fits a model to the residuals left by the classifier on the previous iteration. Prediction is accomplished by adding the predictions of each classifier. Reducing the shrinkage (learning rate) parameter helps prevent overfitting and has a smoothing effect but increases the learning time. K-star on the other hand is an instance-based classifier, that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. It differs from other instance-based learners in that it uses an entropy-based distance function. These techniques are quite similar to what we did in SVM Regression and SMO Regression learners using weka predictors. In Additive Regression, we found the correlation coefficient as 0.895, the root mean squared error as 3062.19 and the root relative squared error as 44.36%. In case of K-star, we found the correlation coefficient as 0.9102, root mean squared error as 3203.57 and the root relative squared error as 46.41%.

22 3.2.3 MLP : Multilayer Perceptron (MLP) is one of the most common neural network structures, as they are simple and effective, and have found home in a wide assortment of machine learning applications. MLPs start as a network of nodes arranged in three layers the input, hidden, and output layers. The input and output layers serve as nodes to buffer input and output for the model, respectively, and the hidden layer serves to provide a means for input relations to be represented in the output. Before any data is passed to the network, the weights for the nodes are random, which has the effect of making the network much like a newborn s brain developed but without knowledge. MLPs are feed-forward neural networks trained with the standard back propagation algorithm. They are supervised networks so they require a desired response to be trained. They learn how to transform input data into a desired response So they are widely used for pattern classification and prediction. A multi-layer perceptron is made up of several layers of neurons. Each layer is fully connected to the next one. With one or two hidden layers, they can approximate virtually any input output map. They have been shown to yield accurate predictions in difficult problems (Rumelhart, Hinton, & Williams, 1986, chap. 8). In our research, we used NeuroShell 2 (version 4.0) to determine the results. For learning purposes we set the learning rate as 0.5, momentum rate as 0.1 and the scale function as linear [-1,1] to get the best results. We found the root mean squared error as 43.8 % which was the least among all other methods used, as we will find out later. 3.2.4 WNN : The word wavelet is due to Grossmann et al. (1984). Wavelets are a class of function used to localize a given function in both space and scaling (http://mathworld.wolfram.com/wavelet.html). They have advantages over traditional Fourier methods in analyzing physical situations where the signal contains discontinuities and sharp spikes. Wavelets were developed independently in the fields of mathematics, quantum physics, electrical engineering and seismic geology. Interchanges between these fields during the last few years have led to many new wavelet applications such as image compression, radar and earthquake prediction. A family of wavelet can be constructed from a function ψ ( x) known as mother wavelet, a, which is confined in a finite interval Daughter Wavelets ψ b ( x) are then formed by translation (b) and dilation (a). Wavelets are especially useful for compressing image data. An individual wavelet is defined by a, b 1/2 x b ψ ( x) = α Ψ( ) a In the case of non-uniformly distributed training data, an efficient way of solving this problem is by learning at multiple resolutions. Wavelets in addition to forming an orthogonal basis are capable of explicitly representing the behaviour of a function at various

23 resolutions of input variables. Consequently, a wavelet network is first trained to learn the mapping at the coarsest resolution level. In subsequent stages, the network is trained to incorporate elements of mapping at higher and higher resolutions. Such hierarchical, multi resolution has many attractive features for solving engineering problems, resulting in a more meaningful interpretation of the resulting mapping and more efficient training and adaptation of the network compared to conventional methods. The wavelet theory provides useful guidelines for the construction and initialization of networks and consequently, the training times are significantly reduced (http://www.ncl.ac.uk/pat/neural-networks.html). Wavelet networks employ activation functions that are dilated and translated versions of a single function, where d is the input dimension (Zhang, 1997). This function called the mother wavelet is localized both in the space and frequency domains (Becerra, Galvao, Abou-Seads 2005). The wavelet neural network (WNN) was proposed as a universal tool for functional approximation, which shows surprising effectiveness in solving the conventional problem of poor convergence or even divergence encountered in other kinds of neural networks It can dramatically increase convergence speed (Zhang et al., 2001). The WNN network is consists of three layers namely input layer, hidden layer and output layer. Each layer is fully connected to the nodes in the next subsequent layer. Number of input and output nodes depends on the number of inputs and outputs present in the problem. The number of hidden node can be any number from 3 to 15is a user-defined parameter depending on the problem. WNN is implemented here with the Gaussian wavelet function. The original training algorithm for training a WNN is as follows (Zhang et al., 2001): 1) Specify the number of hidden nodes required. Initialize randomly the dilation and translation parameters and the weights for the connections between the input and hidden layers and also between the hidden and the output layers. 2) The output value of the sample, k = 1,2,..,np, is calculated with the following formula :computed as follows: K n h n V = W f j = 1 j n i n w x b i j k i j i = 1 ( ) a j (1) where, nin is the number of input nodes and nhn is the number of hidden nodes and np is the number of samples. In (1) when f(t) is taken as Morlet mother wavelet is has the following form : f t t t 2 ( ) = cos(1.75 )exp( / 2) (2)

24 And when taken as Gaussian wavelet it becomes f t 2 ( ) = exp( t ) (3) 3) Reduce the error of prediction by adjusting updating using Wj, wij, a j, b j using W, w, a, b (see formulas (4)-(7)). Thus, in training the WNN, the gradient descend j ij j j algorithm is employed: E Wj ( t + 1) = η + α W j ( t), W E wij ( t + 1) = η + α wij ( t), w ( t) E a j ( t + 1) = η + α a j ( t), a ( t) E bj ( t + 1) = η + α bj ( t), b ( t) ij j j j (4) (5) (6) (7) where, the error function can be taken as 1/2 2 k= np ( VK V ) K E = 2 k = 1 V, (8) K Where ηand α are the learning and the momentum rates respectively. 4) Return to step (2) the process is continued until E satisfies the given error criteria, and the whole training of the WNN is completed. Some problem exists in the original WNN such as slow convergence, entrapment in local minima and oscillation (Pan et al., 2008). We propose BFTWNN to resolve these problems. In our research, we used a software made by Chauhan et al. (2009) for DEWNN (Differential evolution trained Wavelet Neural Network). The software was initially made for classification purposes. We changed the software code from classification to regression type and used it in our problem. We set the weight factor as 0.95, convergence criteria as 0.00001,crossover factor as 0.95, population size as 60, number of hidden node as 20, maximum weight as 102 and minimum weight as -102 to find the optimum solution. We found the test set normalized root mean square error as 0.928441. The root relative squared error as 111.2 %, which was the highest amongst all the results. 3.2.5 CART : Decision trees form an integral part of machine learning an important subdiscipline of artificial intelligence. Almost all the decision tree algorithms are used for solving

25 classification problems. However, algorithms like CART solve regression problems also. Decision tree algorithms induce a binary tree on a given training data, resulting in a set of if then rules. These rules can be used to solve the classification or regression problem. CART (http:// www.salford-systems.com) is a robust, easy-to-use decision tree tool that automatically sifts large, complex databases, searching for and isolating significant patterns and relationships. CART uses a recursive partitioning, a combination of exhaustive searches and intensive testing techniques to identify useful tree structures in the data. This discovered knowledge is then used to generate a decision tree resulting in reliable, easy-tograsp predictive models in the form of if then rules. CART is powerful because it can deal with incomplete data; multiple types of features (floats, enumerated sets) both in input features and predicted features, and the trees it produces contain rules, which are humanly readable. Decision trees contain a binary question (with yes/no answer) about some feature at each node in the tree. The leaves of the tree contain the best prediction based on the training data. Decision lists are a reduced form of this where an answer to each question leads directly to a leaf node. A tree s leaf node may be a single member of some class, a probability density function (over some discrete class) or a predicted mean value for a continuous feature or a Gaussian (mean and standard deviation for a continuous value). The key elements of a CART analysis are a set of rules for: (i) splitting each node in a tree, (ii) deciding when a tree is complete; and (iii) assigning each terminal node to a class outcome (or predicted value for regression). In our research, we used Salford Predictive Miner (SPM) to use CART for prediction purposes. We trained the model using least absolute deviation on the training data. We found that the root mean squared error was 3367.53 and the total number of nodes was 5, however, on growing the tree nodes from 5 to 6, we found better results. The root mean squared error changed to 3107.13 and the root relative squared error is 45.38% which is very close to MLP. Figure 3 shows the plot of relative vs. The number of nodes. We see that we got the optimum results on growing the tree from node 5 to node 6. Figure 1 : CART : Plot of relative error vs number of nodes Figure 2: CART : Plot of percent error vs. Terminal nodes

26 It was also seen from the results that, when the optimum number of nodes were kept at 5, 19 out 24 customers were put in node 1, 4 in node 3 and 1 in node 6. We also found that the root mean squared error was 2892.6 for the 19 customers in node 1, which is better than the overall error. However, the overall increase in error was caused due to misclassification or high error rate in splitting customers in node 4 and node 6. In case of growing, optimum nodes to 6, we found that 14 customers were split in node 1, 5 in node 2, 4 in node 4 and 1 in node 6. The RMSE in node 1 was 1846.89, which was way less than the total RMSE of 3107.13. One obvious conclusion, one can draw from CART is that it is more useful than other methods for prediction, because of its rules which gives companies the flexibility to decide which customer to put in which node and also to choose the optimum number of nodes for their analysis. Figure 3 : CART : Tree details showing the splitting rules at each node A summary of the rules is given as : 1. if(cm_july_dec95 <= 2278.66 && CM_JAN_JUNE95 <= 3534.06 ) then y = 1511.64 2. if(cm_july_dec95 <= 2278.66 &&CM_JAN_JUNE95 > 3534.06 && CM_JAN_JUNE95 <= 12252.1 ) then y = 5932.26 3. if(cm_jan_june95 <= 12252.1 && CM_JULY_DEC95 > 2278.66 && CM_JULY_DEC95 <= 2464.75 ) then y = 24996 4. if(cm_jan_june95 <= 12252.1 && CM_JULY_DEC95 > 2464.75 && TOTAL_FREQUENCY <= 14 ) then y = 6350.25 5. if(cm_jan_june95 <= 12252.1 && CM_JULY_DEC95 > 2464.75 && TOTAL_FREQUENCY > 14 ) then y = 19044.4 6. if(cm_jan_june95 > 12252.1 ) then y = 38126.7 ; where, y is median

27 4. Results and Comparison of Models : We have used various machine learning techniques to calculate the future value of 24 customers from a sample of 89 customers. We have used various techniques like SVM, WNN, Additive Regression, K-star Method in Knime using weka predictor, CART in SPM and MLP in NeuroShell. We found that MLP has given the least error amongst all these models, but we find CART to be more useful, as is more helpful in taking decisions by setting splitting rules and also predicts more accurately for a greater section of the test sample by splitting the sample into various nodes. We find that companies can make better decisions with the help of these rules and the segmentation technique in CART. A detailed summary of the final results of competing models is given in Table 2. One limitation of our study is that we have only predicted the future value of only the next time period. Besides this, the error percentage is relative high, because of the small amount of dataset we have. We believe that these models will be able to perform better in case of large dataset with more input variables including customer demographics, customer behaviour etc. Table 2 : Comparison of Competing Models Correlation coefficient Root Mean Squared error Mean Absolute error Root relative squared error SVMreg 0.8889 3315.25 2513.03 48.0% SMOreg 0.8884 3311.98 2499.48 47.9% Additive Reg. 0.8950 3062.19 2203.76 44.3% K-star 0.9102 3203.57 2233.21 46.4% MLP NA 2986.77 2107.10 43.8% CART NA 3107.13 2343.82 45.3% 49 Figure 4 : Graph of Error vs Model 48 47 46 45 44 43 42 41 MLP Additive Reg. CART K-Star SMOreg SVMreg

28 5. Conclusion and Directions of future research: In this paper we have presented a review of various approaches and modelling techniques to determine Customer Lifetime Value. We have also covered the tradional techniques used to calculated Customer Loyalty and found that CLV is better metric compared to these measures. The most common approaches used to measure CLV are aggregate approach and individual approach. We also see that the type of approach used to calculate CLV depends on the type of data available and the type of result which a firm wants. Further, we have also reviewed various modelling techniques to determine CLV, which include RFM Models, Computer Science and Stochastic Models, Econometric Models, Diffusion Models and also relationship level models and service level models. We see that the most frequently applied techniques to determine CLV parameter or to determine the relationship between them include, Pareto/NBD models, Decision trees, Artificial Neural Networks, Genetic Algorithms, Support Vector Machines. We have also presented a study of measuring CLV by means of various machine learning techniques. Emphasis has been given to catch the non-linear pattern in the data which was available for a set of 89 customers having a 2 year transaction history. We have used Classification and Regression Trees (CART), Support Vector Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural Network (WNN) for the calculation of the future value 24 customers. Further we see that although MLP gives the best result amongst all these models, we would still recommend using CART to calculate CLV as it segments the customers into various nodes and calculates more precisely for a larger segment of test case customers. Besides, the splitting rules would also help any firm to understand better the classification of a customer into a particular segment and hence derive more profit out of him. The main limitations of our study have been the projection of future value of customers till only the next period, mainly due to the limitation of the dataset we had. This also resulted in some high error rates even amongst the best models. There limitations can be overcome by using datasets which can give more information about the customer behaviour, his demographics etc. Besides, a large dataset will be useful to make better predictions as it can estimate the training parameters better. For better estimation in small datasets, we have not covered techniques like k-fold cross validation, which again, can be taken as an area of future research. We have also not given much emphasis on feature selection and the relationship between the input variables to calculate CLV. Producing better results with an integrated approach with this dataset is again an area of future research.

29 References: Aeron, H., Kumar, A. and Janakiraman, M. (2010) Application of data mining techniques for customer lifetime value parameters : a review, Int. J. Business Information Systems, Vol. 6, No. 4, pp.514-529. Au, W., Chan, K., & Yao, X. (2003). A novel evolutionary data mining algorithm with applications to churn prediction. IEEE Transactions on Evolutionary Computation, 7(6), 532 545. Becerra, V. M., Galvao, H., & Abou-Seads, M. (2005). Neural and wavelet network models for financial distress classification. Data Mining and Knowledge Discovery, 11, 35 55. doi:1 0.1007/s1 0618-0051360-0 Berger, P. D. and Nasr, N. I. (1998), Customer lifetime value: Marketing models and applications. Journal of Interactive Marketing, 12: 17 30 Blattberg, Robert C., Getz G., Thomas js (2001), ''Customer Equity: Building and Managing Relationships as Valuable Assets'', Boston, MA : Harvard Business School Press. Chauhan, N., V. Ravi, D. Karthik Chandra: Differential evolution trained wavelet neural networks: Application to bankruptcy prediction in banks.expert Syst. Appl. 36(4): 7659-7665 (2009) Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines. Cambridge, UK: Cambridge University Press. Donkers, B. P.C. Verhoef and M.G. de Jong (2007) Modeling CLV: a Test of Competing Models in the Insurance Industry, Quantitative Marketing and Economics, 5 (2) 163-190. Dries F. Benoit, Dirk Van den Poel: Benefits of quantile regression for the analysis of customer lifetime value in a contractual setting: An application in financial services. Expert Syst. Appl. 36 (7): 10475-10484 (2009) Dwyer, R.F (1997) Customer lifetime valuation to support marketing decision making, Journal of Direct Marketing, Vol. 11, No. 4, pp.205 219. Fader, Peter S., Bruce G. S. Hardie, and Ka Lok Lee (2005), RFM and CLV: Using Iso-CLV Curves for customer base analysis, Jounal of Marketing Research, 42 (November), 415-30. Gupta, Sunil, Donald R. Lehmann and Jennifer Ames Stuart (2004), Valuing Customers, Journal of Marketing Research, 41 (1), 7-18., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., and Lin, N. Modelling Customer Lifetime Value. Journal of Service Research, 9, 2006, 139-155.

30 Hansotia, B. J. and P. Wang (1997), Analytical challenges in customer acquisition. Journal of Direct Marketing 11(2), 7-19. Haenlein, M., Kaplan, A.M., Beeser, A.J. (2007) A model to determine customer lifetime value in a retail banking context, European management journal. Hu, M. and Tsoukalas, C., Explaining Consumer Choice through Neural Networks: The Stacked Generalization Approach, European Journal of Operational Research, Vol. 146, No. 3, 2003, 650-661. Kalbfleisch, J. D. and R. L. Prentice. (1980), Statistical Analysis of Failure Time Data, New York: Wiley Kim, Y., Street, N. (2004). An intelligent recommendation system for customer targeting: A data mining approach. Decision Support Systems, 37(2), 215-228 Kumar, V. and J. Werner Reinartz (2006), Customer Relationship Management : A Databased Approach. New York : John Willey. and Morris George (2007), Journal of the Academy of Marketing Science, 35:157-171. Levinthal, D. and M. Fichman. (1988). Dynamics of Interorganizational Attachments: Auditor Client Relationships. Administrative Science Quarterly, 33, 345-69. Malthouse, Edward C. 2009. The Results from the Lifetime Value and Customer Equity Modeling Competition. Journal of Interactive Marketing, Vol. 23 (2009), pp. 272-275. Malthouse, C.E and Blattberg, C.R. (2005) Can we predict customer lifetime value, Journal of Interactive Marketing, Vol. 19, No. 1, pp.2 16. Mahboubeh Khajvand, and Mohammad Jafar Tarokh. Estimating customer future value of different customer segments based on adapted RFM model in retail banking context.. Procedia CS, (3):1327-1332, 2011. Reinartz, Werner, Jacquelyn Thomas and V. Kumar (2005), Balancing Acquisition and Rentension Resources to Maximize Customer Profitability, Journal of Marketing, 69 (1), 63-79. Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533 536 Rust, R. T., K. N. Lemon, and V. A. Zeithaml (2004), Return on marketing: Using customer equity to focus marketing strategy. Journal of Marketing 68, 109-127. Thomas, Jacquelyn (2001), A methodology for linking customers acquisition to customer retention, Journal of Marketing Research, 38 (2), 262-68.

31 Thomas, J.S., Blattberg R.C., and Fox, E.J. (2004, February), Recapturing lost customers, Journal of Marketing Research, 41, 31-45. V. Kumar, Customer Lifetime Value The path to profitability, Foundations and Trends in Marketing, vol 2, no 1, pp 1-96, 2007. Vapnik, V., 1998. Statistical Learning Theory Wiley. New York. Venkatesan, R. and V. Kumar (2004), 'A customer lifetime value framework for customer selections and resource allocation strategy'. Journal of Marketing, 68, 106-125 (October). Villanueva J., S. Yoo, and D.M. Hanssens, "The Impact of Marketing-Induced vs. Word-of- Mouth Customer Acquisition on Customer Equity," Journal of Marketing Research, February 2008. Vinay Kumar, K., V. Ravi, Mahil Carr, N. Raj Kiran: Software development cost estimation using wavelet neural networks. Journal of Systems and Software 81(11): 1853-1867 (2008) Yoo S. & D.M. Hanssens, "Modeling the Sales and Customer Equity Effects of the Marketing Mix," revised, February 2005, working paper, University of California, Los Angeles, Anderson School of Management. Zhang. Q, 1997. Using wavelet network in non-parameters estimation. IEEE Transaction Neural Networks 8 (2): 227~236