Myth or Fact: The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions



Similar documents
Modeling Lifetime Value in the Insurance Industry

Measures of Central Tendency and Variability: Summarizing your Data for Others

Using Pragmatic Data Mining Approaches for Donor Acquisition Programs

Insurance Pricing Models Using Predictive Analytics. March 5 th, 2012

Analyzing CRM Results: It s Not Just About Software and Technology

Credit Score Basics, Part 1: What s Behind Credit Scores? October 2011

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Math 1. Month Essential Questions Concepts/Skills/Standards Content Assessment Areas of Interaction

Lean Six Sigma Black Belt Body of Knowledge

Module 5: Measuring (step 3) Inequality Measures

Mathematical goals. Starting points. Materials required. Time needed

Session 7 Bivariate Data and Analysis

Intelligent Process Management & Process Visualization. TAProViz 2014 workshop. Presenter: Dafna Levy

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Measurement with Ratios

Scientific Method. 2. Design Study. 1. Ask Question. Questionnaire. Descriptive Research Study. 6: Share Findings. 1: Ask Question.

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

IBM SPSS Direct Marketing

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

Exercise 1.12 (Pg )

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

Customer Life Time Value

F.IF.7b: Graph Root, Piecewise, Step, & Absolute Value Functions

Module 3: Correlation and Covariance

A MARKETER'S GUIDE TO REPORTS THAT MATTER

The Data Discovery: Investing in Customer Insight

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Advanced analytics at your hands

Best Practices. Modifying NPS. When to Bend the Rules

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Data Exploration Data Visualization

RESEARCH. Poor Prescriptions. Poverty and Access to Community Health Services. Richard Layte, Anne Nolan and Brian Nolan.

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Fairfield Public Schools

Explode Six Direct Marketing Myths

Deepening Member Relationships With Big Data And Analytics

Disseminating Research and Writing Research Proposals

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Five Ways Retailers Can Profit from Customer Intelligence

Data exploration with Microsoft Excel: analysing more than one variable

Objective. Materials. TI-73 Calculator

Analyzing and interpreting data Evaluation resources from Wilder Research

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

IMPORTANCE OF QUANTITATIVE TECHNIQUES IN MANAGERIAL DECISIONS

Foundation of Quantitative Data Analysis

Means, standard deviations and. and standard errors

II. DISTRIBUTIONS distribution normal distribution. standard scores

Multi-channel Marketing

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Manufacturing Analytics: Uncovering Secrets on Your Factory Floor

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

Prentice Hall Algebra Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

Introduction to Quantitative Methods

AMSCO S Ann Xavier Gantert

Five Predictive Imperatives for Maximizing Customer Value

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

VIANELLO FORENSIC CONSULTING, L.L.C.

Clovis Community College Core Competencies Assessment Area II: Mathematics Algebra

CALCULATIONS & STATISTICS

MEASURES OF VARIATION

DRIVING BUSINESS PERFORMANCE WITH BUSINESS INTELLIGENCE

analytics stone Automated Analytics and Predictive Modeling A White Paper by Stone Analytics

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

2 SYSTEM DESCRIPTION TECHNIQUES

Mathematics Pre-Test Sample Questions A. { 11, 7} B. { 7,0,7} C. { 7, 7} D. { 11, 11}

Session 7 Fractions and Decimals

Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Basics of STATA. 1 Data les. 2 Loading data into STATA

QUANTIFIED THE IMPACT OF AGILE. Double your productivity. Improve Quality by 250% Balance your team performance. Cut Time to Market in half

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Council of Ambulance Authorities

IBM SPSS Direct Marketing 23

Numerical Algorithms Group

ALGEBRA 2/TRIGONOMETRY

Digital Segmentation. Basic principles of effective customer segmentation

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Module 5: Multiple Regression Analysis

Standard Deviation Estimator

IBM SPSS Direct Marketing 22

The Role of Credit Scores in Consumer Loan Securitization

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

THE THREE "Rs" OF PREDICTIVE ANALYTICS

Chapter 2. Sociological Investigation

Introduction Assignment

096 Professional Readiness Examination (Mathematics)

Multichannel Customer Listening and Social Media Analytics

HOW TO DO A SCIENCE PROJECT Step-by-Step Suggestions and Help for Elementary Students, Teachers, and Parents Brevard Public Schools

Managing Customer Retention

Quantifying Incentive Compensation Plan Effectiveness:

Factoring Quadratic Trinomials

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Transcription:

Myth or Fact: The Diminishing Marginal Returns of Variable in Data Mining Solutions Data Mining practitioners will tell you that much of the real value of their work is the ability to derive and create new variables from the source information within a database or file. For example, the calculation of averages or totals related to a specific timeframe or period represents information that is unlikely to be directly extracted from source files. Another good example is that of postal area variables that are based on 1 st digit of the postal code. Although these examples seem pretty simplistic, it is not uncommon to have derived variables comprise over 90% of the information within an analytical exercise. From the data analyst s perspective, there are no limits to the number of derived variables that can be created. The limitation in variable creation is only confined to the imagination of the analyst or practitioner. Obviously, creating variables and adding new information in theory should provide incremental benefit to any data mining solution, but as with any exercise or project, there are diminishing returns as one begins to explore the many possibilities and permutations that exist within the variable creation exercise. In this article, I will attempt to address this issue not in an academic manner but in the practical sense of whether it provides incremental business benefit to a given data mining solution. Specifically, I explore the impact of exploding the number of variables beyond our traditional techniques of variable creation. The Traditional Techniques of Variable Our traditional techniques attempt to yield insights by creating variables in the following manner: 1. Binary Variables (yes/no outcome which represents the occurrence of some activity or event) 2. /Median or Sum of Variable 3. Index/Ordinal Variables whereby variable outcomes and values are placed into ranked groups. For example, age might be grouped into three outcomes with 1 being under 30 years, 2 being 31-50, and 3 being 51+. 4. Change/Velocity variables that look at how behaviour has changed overtime. Across these four areas, it is not unusual to discover that the analyst has created hundreds of variables for a given analytical exercise. But are there other transformations that we should consider when building solutions? The notion of looking at a whole new suite of additional variables is that this potential new information can provide incremental benefit/insight to a given solution. In our research, we looked at predictive models that have been built using our traditional approach to variable creation note above. 1

Exploding the Number of Variables We then looked at additional approaches that would explode the number of variables in our analytical file. Further mathematical transformations were employed consisting of the following: Log Transformation Square Root Transformation Sine transformation Cosine transformation Tangent Transformations We also looked at combining pairs of top variables that were significant within the correlation analysis of the given predictive model. A good example of this is age and gender where we can actually capture the impact of age and gender together and observe their impact on the modeled behaviour. In the variable pair routines, we attempted to look at all possible combinations. If there are 20 variables that are derived using the traditional approach, then the potential number of possible variable pairs is 190 (20 X 19)/2. Both these transformations (mathematical and pairs) dramatically increase the number of variables. The number of variables increased to 100 using the mathematical transformations (5 X20) and 190 (as seen above) for the possible number of variable pair transformations. In this simple exercise, 20 variables using the traditional approach now increases to 310 (20+100+190). Business Cases to Demonstrate the Point The challenge in this type of exercise is to determine if there is a real business benefit in exploding the number of variables from 20 to 310. The approach we took to assess this benefit was to look at models that were developed prior to this variable creation explosion and compare them to models that were developed after this variable explosion. Four models were looked at that were built by our company: Customer Acquisition Model Acquisition Model Gains Chart Comparison Before Extensive Variable After Extensive Variable 2

Product Up-Sell Model Upsell Model Gains Chart Comparison Before Extensive Variable After Extensive Variable Product Cross-Sell Model 0.16 0.14 0.12 0.1 0.08 0.06 0.04 Cross-Sell Model Gains Chart Comparison Before Extensive Variable After Extensive Variable 0 - Insurance Claim Loss Model Avg Loss Amt ; 12,000 10,000 8,000 6,000 4,000 2,000 Claim Loss Model Gains Chart Comparison After Extensive Variable Before Extensive Variable 0 3

The charts for each of these four models represent decile or gains chart reports where a sample of names is sorted by the model score into 10 deciles, with decile 1 representing the top (highest) model score and decile 10 the bottom (lowest) model score. The deciles are recorded on the x-axis and the observed model behaviour is plotted on the y-axis. As discussed in previous articles, a good model results when the line is quite steep and trending downward from decile 1 to decile 10. Our results from these four models indicate that that the lines or plots are virtually identical for models before and after the variable explosion. There is no tendency to have a more downward slope (better model providing more differentiation between customers) with the explosion in variables. This would indicate that there is no significant business benefit to increasing the number of variables using mathematical transformations or creating potential variable pairs. But besides looking at the numbers and results, we may still want a more complete understanding of why these results occur. Let s look at variable creation in each of its stages. Considering Demographic variables The first stage actually represents the source information or variables that are not manipulated by the analyst. Good examples of these variables are, customer tenure, income, household size, etc. This information is extremely useful as it captures the basic demographics of the individual. The subsequent stages all comprise derived variables. For now, we are going to focus on stages that deal with the following: 1. Grouping of variables 2. Calculating basic arithmetic diagnostics (mean, median, standard deviation, min and max) 3. Calculating change variables 1. Grouping of Variable Values This stage looks at variables that represent the grouping of values into categories. A good example of this is postal code where postal codes are grouped into regions (i.e. all postal codes beginning with the 1 st digit M represent Toronto postal codes). Other examples might represent the grouping of specific product types or service codes into much broader product or service categories. 2. Calculating basic arithmetic diagnostics (total, mean, median, standard deviation, min and max) This stage deals with the ability to summarize numerical information into a meaningful metric. However, it is only meaningful if we have historical information that can be used to calculate these kinds of diagnostics. For example, if we are calculating the average spend or the variation in spend, the question we must address is what timeframe are we looking at? Is the average or variation based on 6 months, 12 months, etc.? 4

3. Calculating change variables Extending this logic of using historical information, we may want to identify how summarized behaviour changes over time. Has spending or product purchase behaviour changed over a period of time? Has it changed drastically over the last 3 months, over the last 6 months, etc. Putting some perspective on the Traditional Approach to Variable In each of these stages, key information is being produced that is unique in explaining the behaviour we are modeling. Let s explore this thinking in more detail. The sourcelevel information in many cases yields demographic information such as age or customer tenure that at one time or another have been key model variables. Grouping of values like assigning postal codes into regions, can produce more meaningful insight when attempting to look at geography in a broader context. Arithmetic diagnostics look at how summarized behaviour regarding key metrics can add value to a desired modeled behaviour. We all have seen examples of summarized metrics such as total products purchased or average spend as key variables within models. The diagnostic type variables differ from source variables in that source variables look at information at a point in time while diagnostics look at information over a period of time. Meanwhile, the change type variables add another dimension in that we are looking at how this summarized behaviour changes over time. Each of these stages, demographics, point in time variables from the source data, summarized data from calculating basic diagnostics, and change variables from the summarized data, represent unique ways of looking at the information. Because of these different unique perspectives, models will typically incorporate variables from each of these approaches. Extensive Variable By providing some rationalization that the creation of variables using the traditional approach adds significant value to the modeling process, one may begin to ask whether a more extensive process can continue to add value to the process. Based on our analysis, it appears that this extensive expansion of variable creation provides minimal value to any model. The results are indicating that there is no real additional unique information that delivers incremental benefit to the model solution. It is our contention that unique views of information represent the real nuggets that add value to a data mining solution. What does this mean going forward? Considering that variable creation can be the most laborious part of any data mining exercise, these findings can provide some direction in how the analyst should best focus his or her time in any data mining project. Given the time pressure that analysts and practitioners face from business users, analysts can better focus their variable creation efforts and build solutions that are both timely and optimal. 5