Wallet estimation models
|
|
|
- Alexina McKenzie
- 9 years ago
- Views:
Transcription
1 Wallet estimation models Saharon Rosset, Claudia Perlich, Bianca Zadrozny, Srujana Merugu, Sholom Weiss, Rick Lawrence Predictive Modeling Group IBM T. J. Watson Research Center Yorktown Heights, NY Abstract The wallet of a customer is defined as the total amount this customer can spend in a specific product category. This is a vital piece of information for planning marketing and sales efforts. We discuss the important problem of customer wallet estimation, while emphasizing the use of predictive modeling technologies to generate useful estimates, and minimizing reliance on primary research. We suggest several customer wallet definitions and corresponding evaluation approaches. Our main contribution is in presenting several new predictive modeling approaches which allow us to estimate customer wallets, despite the fact that these are typically unobserved. We present empirical results on the success of these modeling approaches, using a dataset of IBM customers. 1 Introduction The total amount of money a customer can spend on a certain product category is a vital piece of information for planning and managing sales and marketing efforts. This amount is usually referred to as the customer s wallet (also called opportunity) for this product category. There are many possible uses for wallet estimates, including straightforward targeting of sales force and marketing actions towards large wallet customers and prospects. In a more sophisticated sales and marketing environment, the combination of classical propensity models for a particular product category with the knowledge of the wallet for the same category can direct the sales efforts: it allows a company to market not only to customers or potential customers with a large wallet, but also to those with a high probability of buying specific products in the category. By combining the customer wallet estimates with the data on how much they spend with a particular seller, we can calculate the share-of-wallet that the seller 1
2 has of each customer for a given product category. This information allows the seller to target customers based on their growth potential, a combination of total wallet and share-of-wallet. The classical approach of targeting customers that have historically generated large amounts of revenue for the company (known as lifetime value modeling, see e.g., [9]) does not give enough importance to customers with a large wallet, but small share-of-wallet, which are the ones with presumably the highest potential for revenue growth. Share-of-wallet is also important for detecting partial defection or silent attrition, which occurs when customers increase their spending in a given category, without increasing the amount purchased from a particular company [7]. In certain industries, customer wallets can be easily obtained from public data. For example, in the credit card industry, the card issuing companies can calculate the wallet size and respective share-of-wallet using credit records from the three major credit bureaus [2]. For most industries, however, no public wallet information is available at the customer level. In this case, there are two approaches used in practice for obtaining wallet estimates: 1. Top-Down: starts from a public aggregate estimate for the overall industry opportunity in a given country and splits this estimate across the individual customers using heuristics based on the customer characteristics. For example, if the customers are companies, the overall opportunity could be divided among the companies proportionally to their number of employees. 2. Bottom-Up: estimates the wallet directly at the customer level, using heuristics or predictive models based on customer information. A common approach is to obtain actual wallet information for a random subset of customers/prospects through primary research. A model is then developed based on this data to predict the wallet for the other customers/prospects. Although customer wallet and share-of-wallet have been recognized as important customer value metrics in the marketing and services literature for a number of years [7, 8, 3, 4], there is not much published work on actual wallet modeling. The few references available are limited to commercial white papers from marketing analytics consulting companies [2] and a very recent thesis proposal [1]. The white paper from Epsilon [2] describes at a very high level the methodology that they use to estimate wallets for both customers and prospects at a given product category. Epsilon uses a bottom-up approach where a survey provides the selfreported category wallet for a sample of customers. The self-reported wallet is used to develop a multivariate linear regression that can be applied to the whole market or customer base. They do not describe which variables are used in the model and do not provide experimental results. In [1], the authors propose a Multivariate Latent Factor Model that a bank can use to impute a customers holdings of financial products outside the bank, based on a combination of internally available data and survey information about customers holdings with competitors. 2
3 In this paper, we address the issue of predicting the wallet for IBM customers that purchase Information Technology (IT) products such as servers, software, and services. Our models are bottom-up, with explanatory features drawn from historical transaction data as well as firmographic data taken from external sources like Dun & Bradstreet. We take a predictive modeling approach, and attempt to minimize the dependence of our models on primary research data. In Section 2, we address the exact meaning of wallet, surveying several different definitions and their implications. We discuss evaluation of wallet models in Section 3 this is challenging because the wallet is not observed in the larger customer population. In Section 4 we introduce our novel modeling approaches, and demonstrate their results on a data set of IBM customers. 2 Definitions of customer wallet The first question we need to address is, what exactly is meant by a customer s wallet? We discuss this is the context of IBM as a seller of IT products to a large collection of customers for whom we wish to estimate the wallet. We have considered three nested definitions: 1. The total spending by this customer in the relevant area. In IBM s case, this would correspond to the total IT spending by the customer. We denote this wallet definition by TOTAL. 2. The total attainable (or served) opportunity for the customer. In IBM s case, this would correspond to the total spend by the customer in IT areas covered by IBM s products and services. While IBM serves all areas of IT spending software, hardware and services its products do not necessarily cover all needs of companies in each of these areas. Thus, the served opportunity is smaller than the total IT spending. We denote this definition by SERVED. 3. The realistically attainable wallet, as defined by what the best customers spend. This may be different from SERVED, because it may just not be practically realistic to get individual customers to buy every single served product from IBM. Defining the best customers is a key in correctly applying this definition. In what follows, we define best customers in a relative sense, as ones who are spending as much as we could hope them to, given a stochastic model of spending. Thus, a good customer is one whose spending is at a high percentile of its spending distribution. We describe below some approaches that can allow us to build models that predict such a high percentile and, perhaps more importantly, evaluate models with regard to the goal of predicting percentiles of individual spending distributions. We denote this definition by REALISTIC. 3
4 In principle, the three definitions should line up as REALISTIC SERVED TOTAL. An important caveat is that all the three definitions could be affected by marketing actions. Thus, a company could theoretically be convinced by diligent marketing activity to buy an IT product they were not planning to spend any money on. This could affect the value of all three wallet definitions. For the rest of this paper, we ignore this possible effect of such marketing actions, and essentially assume that our marketing actions can affect only wallet share, not the actual wallet. Our current challenge is to model these fixed wallet values. We concentrate our interest on REALISTIC and SERVED, which we view as the two more operational definitions. The modeling approaches we discuss below are for the REALISTIC wallet, but we are also developing an approach to modeling SERVED. We usually know the total company sales (revenue) of potential customers, and the total amount of historical sales made by IBM to these customers (see Section 4 for details). In principle, the relation IBM SALES WALLET COMPANY REVENUE should hold for every company and all three measures (for the REALIS- TIC definition of wallet, we actually expect IBM SALES REALISTIC WALLET for a small percentage of companies). 3 Evaluation approaches A key issue in any wallet estimation problem is, how can we evaluate the performance of candidate models? Since wallet is usually an unobservable quantity, it is critical to have a clear idea how model performance can be measured at least approximately to give us an idea of future performance. Evaluation using survey values. The first and most obvious approach is to obtain actual values for customer wallets, typically through a survey, where customers (or potential customers) are called up and asked about their total IT spend for the previous year. We have at our disposal one such survey, encompassing 2000 IBM customers or potential customers. However, the attempts by us, as well as other groups, to use this survey for model evaluation have not been particularly successful. We attribute this to three main reasons: 1. Of our three wallet definitions, these self-reported wallets correspond to TO- TAL, which is the least relevant definition for marketing purposes. 2. Companies have no obligation to carefully consider their answers to such surveys. The exact IT spend may be difficult to calculate, and thus even companies that do not intend to miss-report their wallet may give vastly erroneous numbers. 3. Getting surveys which represent unbiased samples of the customer universe is often extremely difficult, due to sampling issues, response bias issues, etc. 4
5 The prohibitive cost of surveys compounds these difficulties, and increases our motivation to design methodologies that do not depend on primary research. Evaluation of high-level indicators. The second common approach is to evaluate wallet models by comparing high-level summaries of model predictions to known numbers, such as industry-level IT spending, or total spending with IBM, and thus evaluating the high-level performance of models. Such approaches include: Comparing the total wallet in an industry to a known industry opportunity based on econometric models; Comparing total wallet to the total spending with IBM for large segments of companies and comparing to some reasonable numbers; Comparing total wallet in a segment to the total sales of companies in that segment. These evaluation approaches suffer from lack of specificity, as they do not evaluate the performance of the models on individual companies, but rather concentrate on comparing aggregated measures. A more specific approach in the same spirit is based on the statistics of ordering between model predictions and known quantities such as COMPANY REVENUE and IBM SALES. If many of the wallet predictions are smaller than IBM SALES for the same companies or bigger than COMPANY REVENUE, this is evidence of the inadequacy of the model being evaluated. Lp Loss p=0.5 (Absolute Loss) p=0.8 p= Residual Figure 1: Quantile loss functions for various quantiles. Quantile loss based evaluation. A different approach is to search for evaluation loss functions which evaluate the performance of the model in a way that is directly 5
6 related to the goal of wallet estimation. Despite the fact that the wallet is unobservable, this evaluation turns out to be possible when the REALISTIC definition of wallet is used, using the quantile regression loss function [6]: given an observed IBM SALES number y and a predicted REALISTIC wallet ŷ, we define the quantile loss function for the pth quantile to be: L p (y, ŷ) = { p (y ŷ) if y ŷ (1 p) (ŷ y) otherwise (1) In Figure 1, we plot the quantile loss function for p {0.2, 0.5, 0.8}. With p = 0.5 this is just absolute error loss. Expected quantile loss is minimized by correctly predicting the (conditional) pth quantile of the residual distribution. That is, if we fix a prediction point x, and define c p (x) to be the pth quantile of the conditional distribution of y given x: P(y c p (x) x) = p, x then the loss function is optimized by correctly predicting c p (x): arg min c E(L p (y,c) x) = c p (x) With p = 0.5, the expected absolute loss is minimized by predicting the median, while when p = 0.8 we are in fact evaluating a model s ability to correctly predict the 80th percentile of the distribution of y given x exactly the definition of the REALISTIC wallet! Thus, we have a loss function which allows us to directly evaluate model performance in estimating REALISTIC wallets. We will use this loss function L p (x) in the next section, both for model training and for model evaluation. 4 Modeling approaches The data we use for modeling is comprised of two main components: 1. Firmographics. We derive this data from Dun & Bradstreet 1, which carries information on all businesses in the US and around the world. The data includes: Company size information: revenue and number of employees. Also some information on dynamics of change in these figures in recent years. Various level of industrial classification: industrial sector, industry, subindustry, etc. Company location: city, state, country. Company structure descriptors: legal status, location in corporate hierarchy (headquarters/subsidiary), etc
7 2. Relationship variables. The IBM relationship variables consist of detailed information about historical purchases made by each IBM customer in the different product categories, as well as other interactions between IBM and its customers. Our data set contains several tens of thousands of IBM customers (we omit the exact number for confidentiality reasons) who have had business with IBM in 2002 and 2003, together with their firmographics and their IBM Relationship variables for the years , including their total spending with IBM. Our goal for this modeling exercise is to predict their 2004 wallet (although the models will eventually be applied to predict future year wallets, of course). Since we have numerous examples, we leave out for testing our models, and use the rest as data for training. Throughout our analysis we transform all monetary variables in particular, the 2004 IBM SALES, used as response to the log scale, due to the fact that these numbers have very long tailed distributions, which implies that evaluation would be dominated by the few largest companies in the data set. It has often been observed that monetary variables have exponential-type distributions and behave much better when log-transformed (cf. the often cited pareto rule ). We discuss here in detail two novel modeling approaches and their performance on this dataset. These approaches aim to model the REALISTIC wallet by utilizing existing machine learning approaches in this case, nearest neighbor modeling and quantile regression. We are also working on a third approach, which adapts a regression model to estimate either the SERVED or REALISTIC wallet, by relying on some conditional independence assumptions. Details on this modeling approach will be given in future papers. The REALISTIC wallet definition suggests that some subset of good customers already spend close to their total wallet with IBM. Under this assumption, we have a subset of observations for which the wallet is known and could be used as training sample to build a predictive model. It is however impossible to decide a priori which subset of firms are those good customers that spent their total wallet with IBM. Consider for instance the naive rule of selecting the top p% firms with the largest absolute value of IBM SALES. This approach would simply identify the biggest firms. The problem is that the absolute value of the wallet is dependent on firmographics such as size and industry, as well as potentially on the company s historical relationship with IBM. Without knowing the effect of these variables on the wallet it is impossible to identify these good customers. K-nearest neighbor approach. A reasonable approach towards achieving this relativity effect is through k-nearest neighbor (k-nn) approaches, since the good customers would have higher IBM SALES than similar companies. We compare each company to its immediate neighbors in the (historical IBM relationship, firmographics) space, and consider the neighbors with the highest IBM SALES to be good customers who are close to spending their whole IT wallet with IBM. The 7
8 main firmographic variables we use in our k-nn implementation are industry and company size (as indicated by the D&B employee and/or revenue numbers). We also use the previous year (in this case, 2003) total IBM SALES. The wallet of a firm A is estimated by first ranking all other firms based on their similarity to A, selecting a neighborhood of k firms that are very similar to A and finally ranking the k firms by their IBM SALES. Following the initial argument that the best customers spent all their attainable wallet with IBM, we can predict a wallet as a statistic on the ranked IBM SALES of these k firms. A case can be made for taking the MAX- IMUM as the appropriate statistic, however this measure is very volatile and tends to result in overly optimistic prediction. A more practical approach is to take the REALISTIC definition of wallet and apply it directly, by taking the pth quantile of these neighbors as our estimate for firm A. We have implemented this strategy and evaluated different neighborhood sizes and different variables for the similarity definition. By far the best neighborhood definition we found was using the log(2003 IBM SALES) for the distance while grouping on the industry. In other words, A s neighbors are the companies in the same industry with the closest IBM SALES in In our experiments we used the p = 0.9 quantile of the 2004 IBM SALES in a company s neighborhood in the training data as the prediction of its 2004 wallet. We experimented with various neighborhood sizes, ranging from 10 to 400. Our evaluation on the test samples uses the quantile loss L 0.9 of equation (1) on the log scale. Figure 2 shows the mean loss incurred by these models on the test set observations (the solid line marked k-nn). A naive baseline model to compare these results to is simply predicting the 2003 IBM SALES as the 2004 wallet, which gives a mean test error using L 0.9 of Figure 2 also shows the quantile regression results discussed below. The model using 50 neighbors is clearly the best and it achieves about 25% lower error on the holdout data than the naive model. It is also interesting to investigate the residuals of these models. Figure 3 (left panel) shows a sample of 1000 test-set predictions of the model with k=50 neighbors, plotted against the 2004 IBM SALES. We see that most of the predictions are higher than the sales, as we would expect from a wallet prediction model. In fact, the use of p = 0.9 in the quantile loss implies we would expect about 90% of predictions to be higher than the actual number. Two sets of observations stand out in this scatter plot: The ones with no 2004 IBM SALES which form the vertical column of points on the extreme left. Despite these customers buying nothing from IBM in 2004, our model still predicts a non-zero wallet for all of them, for some even a substantial wallet (as high as $163K, after exponentiation). This is completely consistent with the concept of REALISTIC wallet, which models what the customer could spend, as opposed to actual spending. Four companies get a prediction of 0. These companies have no 2003 IBM 8
9 SALES, and consequently so do many of their neighbors in the training data. In these four neighborhoods, it happens that the 90% 2004 IBM SALES among the neighbors was also 0, and thus so was the prediction. It should be noted that these four companies are not the only ones with no 2003 IBM SALES, and the other companies did get non-zero predictions, because their neighborhoods contained companies that did buy from IBM in Mean Quantile Loss of Log Wallet Model Neighboorhod Size 100 k-nn Quantile Regression Figure 2: Quantile loss of k-nn and quantile regression models in predicting 2004 wallet. Also shown are 95% confidence bars on the performance. The right panel of Figure 3 shows the histogram of the residuals, and we observe that about 86% of the test set observations (17183/20000) generate predictions that are higher than actual 2004 IBM SALES. As discussed in Section 3, we can consider this to be a high level summary which indicates the reasonable performance of our approach. Quantile regression approach. Another way of accounting for the relative definition of good customers is to treat this as a predictive modeling problem, but model the pth quantile of the conditional spending distribution of y (future IBM SALES) given x (firmographics and relationship variables). As we have seen in Section 3, the quantile regression loss function L p of equation (1) is optimized by predicting this pth quantile. This loss function can be used for training as well. This is the 9
10 Log of Predicted Wallet Density Log of Actual Revenue Difference between Predicted Log Wallet and Revenue Figure 3: A sample of model predictions from the 50-NN model, plotted against 2004 IBM SALES (left) and a histogram of test set residuals obtained from this model, i.e., model prediction IBM SALES (right), demonstrating that about 90% of predictions are higher than actual sales, as we would expect. method typically known as Quantile Regression [5, 6]. Given a set of training data {x i,y i } n i=1, linear quantile regression seeks the linear model in x which minimizes the empirical quantile error L p : ˆβ = arg min β L p (y i,β x i ) i The R package quantreg 2 implements linear quantile regression and is the one we used for our experiments below. It should be noted that due to issues such as over-fitting and bias, it is not guaranteed that using the correct loss function for modeling will result in a good prediction model for future data with regard to the same loss function. It can be shown, however, that this loss function is consistent, i.e., with infinite training data and correct model specification it will indeed result in the pth quantile of P(y x) as the model predictions. We have experimented with several quantile regression models, varying the choice of independent variable x chosen for the model. With our large training data set, we concluded that the saturated main effect model, with no interactions, is the best one to use. This model included numerous firmographic and IBM relationship variables, with a total of 45 degrees of freedom. The test set performance of this model, using the L 0.9 loss function, is shown in Figure 2. We see that it is comparable to the 2 For details on R, see 10
11 Log of Predicted Wallet Density Log of Actual Revenue Difference between Predicted Log Wallet and Log Revenue Figure 4: Same plots as in Figure 3, for the quantile regression model. The conclusions are similar. best k-nn model, using k=50, but slightly inferior (0.280 vs ). Given the very large size of our test set, this difference is marginally significant (t-test p-value 0.02), but it is clearly too small to be considered meaningful. Figure 4 shows the residual analysis of the test set performance of the quantile regression model, in the same spirit as Figure 3. The conclusions are also similar. Note that there are no 0 predictions from this model, but the cloud of low predictions now contains nine points, as compared to four for the nearest neighbor model. This is due to the fact that the 2003 IBM SALES play a dominant role in determining model prediction. Thus, all companies with no 2003 IBM SALES get very low predictions, although none are exactly zero. Overall, the percentage of test set data points for which the prediction is above the 2004 IBM SALES is almost exactly 90% (18009/20000). This supports our conclusion that our model is doing what we expect and suffers little or no overfitting. 5 Summary We have proposed three definitions for customer wallet: TOTAL, SERVED and RE- ALISTIC, and argued that the last two are the ones that should be modeled. We have described some of the difficulties in evaluating performance of wallet models without relying on primary research results, which are expensive, of questionable quality and usually are relevant only for TOTAL wallets. In this context, we have 11
12 suggested the use of the quantile regression loss function for evaluating the REAL- ISTIC wallet predictions. Our main focus is on developing predictive modeling approaches for SERVED and REALISTIC wallet estimation. We have proposed and discussed in detail two new methodologies for modeling the REALISTIC wallet: quantile nearest neighbor and quantile regression. Our experiments in applying these approaches to actual data show promising results. Our future work will report on our third methodology, the decomposition approach, which adapts a standard regression model to estimate the either the SERVED or REALISTIC wallet, by relying on some conditional independence assumptions. Acknowledgements We thank Alexey Ershov and John Waldes from IBM for useful discussions. References [1] Du, Rex, Thesis proposal, Fuqua School of Business, Duke University, [2] Epsilon Data Management, Solving the Customer Value Puzzle, [3] Garland R., Share of wallet s role in customer profitability, Journal of Financial Services Marketing, volume 8, number 8, pp (10), [4] Keiningham T.L., Perkins-Munn T. and Evans H., The Impact of Customer Satisfaction on Share-of-Wallet in a Business-to-Business Environment, Journal of Service Research, volume 6, number 1, pp (14), [5] Koenker, R. Quantile Regression. Econometric Society Monograph Series, Cambridge University Press, [6] Koenker, R. and Hallok, K. Quantile Regression: An Introduction, Journal of Economic Perspectives, [7] Malthouse, E. C. and Wang, P. Database Segmentation using Share of Customer, Journal of Database Marketing, volume 6, number 3, pp , [8] Pompa N., Berry J., Reid J. and Webber R., Adopting share of wallet as a basis for communications and customer relationship management, Interactive Marketing, volume 2, number 1, pp (12), [9] Rosset S., Neumann E., Eick U. and Vatnik N., Lifetime Value Models for Decision Support. Data Mining and Knowledge Discovery Journal, Vol. 7, , [10] Rosset, S., Perlich, C. and Zadrozny, B. Ranking-Based Evaluation of Regression Models. IEEE Conference on Data Mining 2005, to appear. 12
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
On Marginal Effects in Semiparametric Censored Regression Models
On Marginal Effects in Semiparametric Censored Regression Models Bo E. Honoré September 3, 2008 Introduction It is often argued that estimation of semiparametric censored regression models such as the
Statistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Introduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
Econometric analysis of the Belgian car market
Econometric analysis of the Belgian car market By: Prof. dr. D. Czarnitzki/ Ms. Céline Arts Tim Verheyden Introduction In contrast to typical examples from microeconomics textbooks on homogeneous goods
Multiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)
INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its
Physics Lab Report Guidelines
Physics Lab Report Guidelines Summary The following is an outline of the requirements for a physics lab report. A. Experimental Description 1. Provide a statement of the physical theory or principle observed
The Probit Link Function in Generalized Linear Models for Data Mining Applications
Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications
Statistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
Session 7 Bivariate Data and Analysis
Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
The Taxman Game. Robert K. Moniot September 5, 2003
The Taxman Game Robert K. Moniot September 5, 2003 1 Introduction Want to know how to beat the taxman? Legally, that is? Read on, and we will explore this cute little mathematical game. The taxman game
III. DATA SETS. Training the Matching Model
A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: [email protected] Prem Melville IBM T.J. Watson
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
arxiv:1506.04135v1 [cs.ir] 12 Jun 2015
Reducing offline evaluation bias of collaborative filtering algorithms Arnaud de Myttenaere 1,2, Boris Golden 1, Bénédicte Le Grand 3 & Fabrice Rossi 2 arxiv:1506.04135v1 [cs.ir] 12 Jun 2015 1 - Viadeo
Risk Decomposition of Investment Portfolios. Dan dibartolomeo Northfield Webinar January 2014
Risk Decomposition of Investment Portfolios Dan dibartolomeo Northfield Webinar January 2014 Main Concepts for Today Investment practitioners rely on a decomposition of portfolio risk into factors to guide
Music Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
Language Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
Two Correlated Proportions (McNemar Test)
Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with
Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),
Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
Companies already have customer data Creating a more effective sales team With all of this improved technology Most companies have a CRM
Many sales organizations are faced with the underutilization of their sales assets, meaning the potential of available resources, both sales reps and territories, are not being used to their fullest. Every
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and
Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.
1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Empirical Methods in Applied Economics
Empirical Methods in Applied Economics Jörn-Ste en Pischke LSE October 2005 1 Observational Studies and Regression 1.1 Conditional Randomization Again When we discussed experiments, we discussed already
Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection
Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.
Getting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
LOGISTIC REGRESSION ANALYSIS
LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic
4. Continuous Random Variables, the Pareto and Normal Distributions
4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random
Rates for Vehicle Loans: Race and Loan Source
Rates for Vehicle Loans: Race and Loan Source Kerwin Kofi Charles Harris School University of Chicago 1155 East 60th Street Chicago, IL 60637 Voice: (773) 834-8922 Fax: (773) 702-0926 e-mail: [email protected]
Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
PS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance
2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample
Part 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
IBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
L13: cross-validation
Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario [email protected]
Data Mining in CRM & Direct Marketing Jun Du The University of Western Ontario [email protected] Outline Why CRM & Marketing Goals in CRM & Marketing Models and Methodologies Case Study: Response Model Case
Chapter 1 Introduction. 1.1 Introduction
Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
An Evaluation of Chinese Economic Forecasts
An Evaluation of Chinese Economic Forecasts H.O. Stekler 1 Huixia Zhang 2 Department of Economics 2115 G St. NW George Washington University Washington DC 20052 1 Corresponding author, [email protected],
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
Market Simulators for Conjoint Analysis
Chapter 10 Market Simulators for Conjoint Analysis The market simulator is usually considered the most important tool resulting from a conjoint analysis project. The simulator is used to convert raw conjoint
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
5 Discussion and Implications
5 Discussion and Implications 5.1 Summary of the findings and theoretical implications The main goal of this thesis is to provide insights into how online customers needs structured in the customer purchase
Are Lottery Players Affected by Winning History? Evidence from China s Individual Lottery Betting Panel Data
Are Lottery Players Affected by Winning History? Evidence from China s Individual Lottery Betting Panel Data Jia Yuan University of Macau September 2011 Abstract I explore a unique individual level lottery
Disseminating Research and Writing Research Proposals
Disseminating Research and Writing Research Proposals Research proposals Research dissemination Research Proposals Many benefits of writing a research proposal Helps researcher clarify purpose and design
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
Conn Valuation Services Ltd.
CAPITALIZED EARNINGS VS. DISCOUNTED CASH FLOW: Which is the more accurate business valuation tool? By Richard R. Conn CMA, MBA, CPA, ABV, ERP Is the capitalized earnings 1 method or discounted cash flow
SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS
SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS John Banasik, Jonathan Crook Credit Research Centre, University of Edinburgh Lyn Thomas University of Southampton ssm0 The Problem We wish to estimate an
Estimating Industry Multiples
Estimating Industry Multiples Malcolm Baker * Harvard University Richard S. Ruback Harvard University First Draft: May 1999 Rev. June 11, 1999 Abstract We analyze industry multiples for the S&P 500 in
Discussion. Seppo Laaksonen 1. 1. Introduction
Journal of Official Statistics, Vol. 23, No. 4, 2007, pp. 467 475 Discussion Seppo Laaksonen 1 1. Introduction Bjørnstad s article is a welcome contribution to the discussion on multiple imputation (MI)
Non-Inferiority Tests for One Mean
Chapter 45 Non-Inferiority ests for One Mean Introduction his module computes power and sample size for non-inferiority tests in one-sample designs in which the outcome is distributed as a normal random
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
What really drives customer satisfaction during the insurance claims process?
Research report: What really drives customer satisfaction during the insurance claims process? TeleTech research quantifies the importance of acting in customers best interests when customers file a property
The Macroeconomic Effects of Tax Changes: The Romer-Romer Method on the Austrian case
The Macroeconomic Effects of Tax Changes: The Romer-Romer Method on the Austrian case By Atila Kilic (2012) Abstract In 2010, C. Romer and D. Romer developed a cutting-edge method to measure tax multipliers
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random
[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator
MTH 140 Statistics Videos
MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative
The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy
BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.
Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
Collaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.
SPSS Regressions Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. [email protected] Course Objective This course is designed
The Dangers of Using Correlation to Measure Dependence
ALTERNATIVE INVESTMENT RESEARCH CENTRE WORKING PAPER SERIES Working Paper # 0010 The Dangers of Using Correlation to Measure Dependence Harry M. Kat Professor of Risk Management, Cass Business School,
On Cross-Validation and Stacking: Building seemingly predictive models on random data
On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 [email protected] A number of times when using cross-validation
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
As we explained in the textbook discussion of statistical estimation of demand
Estimating and Forecasting Industry Demand for Price-Taking Firms As we explained in the textbook discussion of statistical estimation of demand and statistical forecasting, estimating the parameters of
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Lasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.
Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study Prepared by: Centers for Disease Control and Prevention National
A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.
MARKET SEGMENTATION The simplest and most effective way to operate an organization is to deliver one product or service that meets the needs of one type of customer. However, to the delight of many organizations
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Identification of Demand through Statistical Distribution Modeling for Improved Demand Forecasting
Identification of Demand through Statistical Distribution Modeling for Improved Demand Forecasting Murphy Choy Michelle L.F. Cheong School of Information Systems, Singapore Management University, 80, Stamford
Keynesian Macroeconomic Theory
2 Keynesian Macroeconomic Theory 2.1. The Keynesian Consumption Function 2.2. The Complete Keynesian Model 2.3. The Keynesian-Cross Model 2.4. The IS-LM Model 2.5. The Keynesian AD-AS Model 2.6. Conclusion
Centre for Central Banking Studies
Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics
The Bayesian Approach to Forecasting. An Oracle White Paper Updated September 2006
The Bayesian Approach to Forecasting An Oracle White Paper Updated September 2006 The Bayesian Approach to Forecasting The main principle of forecasting is to find the model that will produce the best
Software Defect Prediction Modeling
Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University [email protected] Abstract Defect predictors are helpful tools for project managers and developers.
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
NCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING
LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.
Nonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
