# Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Size: px
Start display at page:

Download "Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets"

Transcription

1 Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets August 2015 Salford Systems

2 Course Outline Demonstration of two classification examples in SPM o Bank Marketing o KDD cup 2009 Predictive Modeling package used for the examples o o o o o o o Core Statistics Logistic Regression CART Decision Tree (original, by Jerome Friedman) MARS Spline Regression (original, by Jerome Friedman) TreeNet gradient boosting machine ((original, by Jerome Friedman) RandomForests (original, Breiman and Cutler) Automation and model acceleration Salford Systems

3 Bank Marketing Data Portuguese bank marketing data o o o o 41,188 records 20 attributes, such as age, job, education, housing status The goal is to predict whether the client will subscribe a term deposit Output variable (desired target): has the client subscribed a term deposit? (binary: 'yes','no') Dataset is publicly available at UCI machine learning repository o Challenges o o o Missing Value Mixed categorical and numerical variables Variable selection Copyright Salford Systems 2013

4 Sample Data AGE JOB MARITAL DEF HOUSING LOAN CONTACT EMP_VAR_RAT E CPI CCI EURIBOR NUM_EMP Y housemai 56 d married no no no telephone no 57 services married no no telephone no 37 services married no yes no telephone no 40 admin. married no no no telephone no 56 services married no no yes telephone no 45 services married no no telephone no 59 admin. married no no no telephone no bluecollar 41 married no no telephone no 24 technician single no yes no telephone no 25 services single no yes no telephone no Other variables include: level of education, date of last contact, outcome of last campaign, days since last contact, etc. Note: missing values, categorical and numeric variables

5 Copyright Salford Systems 2013 Open Raw Data: bank.csv

6 Character Variables and Missing Values Copyright Salford Systems 2013

7 Request Descriptive Statistics All variables are included in default Copyright Salford Systems 2013

8 Brief Descriptive Stats We always check for prevalence of missing data Always review number of distinct values (too few?, too many?) Anything looks wrong in the dataset Copyright Salford Systems 2013

9 Full Descriptive Stats Output contains detailed descriptive statistics for every variable Copyright Salford Systems 2013

10 Frequency of Target variable Target Variable 0 means non subscriber 1 means subscriber It s not surprised that there are only a small percentage of people subscribed term deposit Copyright Salford Systems 2013

11 Data Preparation The records in this dataset are ordered by date (from May 2008 to November 2010) Note that 2008 economy crisis made this dataset complicated because time has to be considered as a factor in the analysis. We partitioned 80% as learning data and remaining 20% as testing data in time order. Note: pdays 999 means the clients have never been contacted before this phone call. Copyright Salford Systems 2013

12 Build LOGIT Model Copyright Salford Systems 2013

13 LOGIT Model Summary ROC learn value is 0.94 which should get your attention to exam if it is too good to be true ROC learning and test difference tells us that time does have an impact Copyright Salford Systems 2013

14 LOGIT Model Coefficients Partial coefficients are shown in the table above Copyright Salford Systems 2013

15 CART Classification and Regression Trees o o o o Separates relevant from irrelevant predictors Yields simply, easy to understand results Doesn t require variable transformations Impervious to outliers and missing values Fastest, most versatile predictive modeling algorithm available to analysts Provides the foundation to modern data mining techniques such as bagging and boosting

16 Build CART Model Copyright Salford Systems 2013

17 Copyright Salford Systems 2013 Testing Method

18 CART Model Learn and Test sample perform quite different with this model which means time does contribute as a factor to influence the outcome Also learning sample performance looks too good to be true Copyright Salford Systems 2013

19 Variable Importance Duration: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Copyright Salford Systems 2013

20 Rerun CART model excluding Duration Copyright Salford Systems 2013

21 Variable Importance Ranking CART gives an initial look of what variable are important, it is useful when there are quite a few predictors in your dataset. Copyright Salford Systems 2013

22 Root Node Split Very Effective We can view nodes detail by clicking Tree Details in CART output window The first splitter is month which is also shown in variable importance ranking table as the most influential predictor The whole tree with details can be viewed as well Copyright Salford Systems 2013

23 MARS Multivariate Adaptive Regression Splines Uses knots to impose local linearities These knots create basis functions to decompose the information in each variable individually MV LSTAT MV LSTAT

24 Build MARS Model Copyright Salford Systems 2013

25 MARs Model Setup Max basis Function default setting is 15 where often time model hits this limit and stop before reaching the optimal model So we set it as 60 after a couple of runs Salford Systems

26 MARS Output Window This output window shows you the number of basis functions in the model against the performance of the model. Because MARS is a regression engine, the MSE and R-squared values will still be reported, but can be ignored here. Copyright Salford Systems 2013

27 Summary This model improved in targeting customers, with an ROC of Copyright Salford Systems 2013

28 MARS Basis function Here is where the logistic regression equation is laid out in terms of the basis functions (transformations of the predictors). Each basis function is described and the final model is listed at the bottom. This form of output is especially desired by those who are comfortable with standard regression. Copyright Salford Systems 2013

29 MARs Plots Note: The presence of nonlinearity in this dataset Salford Systems

30 TreeNet Stochastic Gradient Boosting Small decision trees built in an errorcorrecting sequence 1. Begin with small tree as initial model 2. Compute residuals from this model for all records 3. Grow a second small tree to predict these residuals 4. And so on

31 Build TreeNet Model Copyright Salford Systems 2013

32 TreeNet Output Window The Output window shows a graph of the number of trees in the ensemble with its corresponding ROC value. The vertical green bar denotes the model with the optimal ROC: 9 trees at Copyright Salford Systems 2013

33 Partial Dependency Plots Using TreeNet for targeted marketing has improved random calling and given you an idea of how the predictors affect subscription Copyright Salford Systems 2013

34 Random Forests Ensemble of trees built on bootstrap samples Algorithm: o o o Each tree is grown on a bootstrap sample from the learning data During tree growing, only P predictors are selected and tried at each node By default, P is the square root of total predictors The overall prediction is determined by averaging Law of Large Numbers ensures convergence The key to accuracy is low correlation and bias To keep bias low, trees are grown to maximum depth

35 Build RandomForests Model Copyright Salford Systems 2013

36 RandomForests Output1 RandomForests optimal model is always the one with most trees, Copyright Salford Systems 2013

37 RandomForests Summary Copyright Salford Systems 2013

38 Prediction Success Table1 We want to minimize the false non-subscribers rate to spend least effort to reach most subscribers Copyright Salford Systems 2013

39 Adjust Class Weights Class Weights default is BALANCED which means Upweight small classes to equal size of largest target class. Now we manually upweight class 1 which is the small class even more than Balanced setting Salford Systems

40 Prediction Success Table2 Salford Systems

41 Conclusion CART, MARs, TreeNet and RandomForests o o o o o handles missing value automatically Detect interaction and nonlinearity automatically Model can be translate into other programing languages Model performance usually exceeds traditional classification algorithms Advanced setting boosts model performance CART provides initial insights of the dataset MARs gives equations in a linear regression format with transformation of original predictors TreeNet generates more accurate models RandomForests outperforms with wide datasets Salford Systems

42 KDD Cup 2009 Knowledge Discovery and Data mining competition held once a year to challenge modelers to a task o - competitions from o Includes tasks, data, rules, results, and FAQs KDD Cup 2009 was about customer relationship prediction French telecom company Orange provided large marketing databases Overall goal was to beat the in-house system implemented by Orange Salford Systems

43 50,000 customers 15,000 predictors Datasets o ex) demographic, geographic, behavioral Three binary classification tasks: o Appetency: customer buys new product or service o Churn: customer switches providers o Upselling: customer buys upgrade offered to them Training and testing dataset Smaller subsets of data available for practice Salford Systems

44 Challenges Large database o 50,000 x 15,000 Numerical and categorical variables Missing data Unbalanced class distributions o Many more customers NOT doing these things Sanitized data - no intuition Salford Systems

45 Data Preparation Combine multiple datasets o Large dataset broken into 5 chunks, 53 MB each o True target values needed to be appended Delete or impute missing values o Not necessary in SPM Handle categorical variables o Create dummy indicators o Combine levels in variables with many o Again, not necessary in SPM Salford Systems

46 Open Prepared Data Salford Systems

47 View Data Salford Systems

48 Run Descriptive Statistics Salford Systems

49 Target Frequencies Salford Systems

50 Appetency In this context, appetency is the propensity of the customer to buy a new product or service Salford Systems

51 CART Model Setup Choose CART as the Analysis Engine Our Target is coded -1/1, so we will choose Classification/Logistic Binary as the Target Type Appetency is our response variable and VAR1-VAR15000 are our predictors Salford Systems

52 Setting a Testing Method A separate test dataset is provided in the competition, but true target values were not included For model-building, we will use a 20% random partition of the training dataset to monitor performance Salford Systems

53 Restricting Tree Size We are interested in looking at CART ranking of important predictors By forcing the tree to only one split, we can quickly create a tree to access this information Salford Systems

54 Penalties We are aware there are variables with many missing values and variables with a high number of categorical levels Setting penalties on these cases makes it harder to include these in the model Salford Systems

55 Results - Single Split CART Tree Salford Systems

56 Variable Improvement Measures Salford Systems

57 TreeNet Model Setup Salford Systems

58 Results - TreeNet Ensemble Salford Systems

59 Variable Selection Improvement measures are averaged across all trees in the ensemble Only 185 of the original 15,000 predictors are flagged as important Salford Systems

60 Recursive Feature Elimination (RFE) Remove one variable at a time from the TOP of the variable importance list to eliminate too good predictors Salford Systems

61 RFE, Step 2 Remove one variable at a time from the BOTTOM of the variable importance list to eliminate weak predictors Final ROC: Salford Systems

62 Parameter Variation - Automates Each TreeNet control parameter can be automatically varied over its values A model is built at each step and summarized Salford Systems

63 Stability of the Model Automate PARTITION varies the learn/test partition so the user can observe the stability of model performance Salford Systems

64 Repeat on Churn Churn is the propensity of the customer to switch providers We repeat the same steps of model-building to achieve a final model Final ROC: Salford Systems

65 Repeat on Upsell Upsell is the propensity of the customer to buy an upgrade offered to them We repeat the same steps of model-building to achieve a final model Final ROC: Salford Systems

66 Summary of Results Rank Team Appetency Churn Upselling Score 1 IBM Research You! ID Analytics, Inc Old dogs with new tricks Crusaders Financial Engineering Group, Inc. Japan Unable to compare to true target values because these were only seen by competition judges However, we are confident in our results (2 of the above groups used SPM) Results can vary based on optimal selection criterion, random number seed, etc. Salford Systems

67 Overall Conclusions We were able to narrow down the predictor list significantly using TreeNet and Automate SHAVING o Of the original 15,000 predictors: Appetency: 167 Churn: 249 Upselling: 165 Handling of categorical variables and missing values was automatic and didn t cause any issues Small rates in the class of interest didn t pose a problem o Priors/Costs and Class Weights can control for this in CART and TreeNet Couldn t draw any insight as to the variables affecting appetency, churn, and upsell Salford Systems

### Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern

### THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

### Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

### CART 6.0 Feature Matrix

CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

Data Mining Approaches to Modeling Insurance Risk Dan Steinberg, Mikhail Golovnya, Scott Cardell Salford Systems 2009 Overview of Topics Covered Examples in the Insurance Industry Predicting at the outset

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

### Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct

### Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

### Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is

### Better credit models benefit us all

Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

### Package acrm. R topics documented: February 19, 2015

Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author

### Risk pricing for Australian Motor Insurance

Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model

### EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

### A Property & Casualty Insurance Predictive Modeling Process in SAS

Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

### An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### The Predictive Data Mining Revolution in Scorecards:

January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

### Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects

### Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

### Churn Modeling for Mobile Telecommunications:

Churn Modeling for Mobile Telecommunications: Winning the Duke/NCR Teradata Center for CRM Competition N. Scott Cardell, Mikhail Golovnya, Dan Steinberg Salford Systems http://www.salford-systems.com June

### Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### Data Mining: Overview. What is Data Mining?

Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

### Identifying SPAM with Predictive Models

Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional

### STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Why Ensembles Win Data Mining Competitions

Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:

### New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

### 15.564 Information Technology I. Business Intelligence

15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses

### Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

### How To Make A Credit Risk Model For A Bank Account

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

### Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

### Predicting borrowers chance of defaulting on credit loans

Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. Introduction: The Basel Capital Accord, ready for implementation in force around 2006, sets out

### Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing

Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing Niharika Sharma 1, Arvinder Kaur 2, Sheetal Gandotra 3, Dr Bhawna Sharma 4 B.E. Final Year Student, Department of Computer

### Didacticiel Études de cas

1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

### CART: Classification and Regression Trees

Chapter 10 CART: Classification and Regression Trees Dan Steinberg Contents 10.1 Antecedents... 180 10.2 Overview... 181 10.3 A Running Example... 181 10.4 The Algorithm Briefly Stated... 183 10.5 Splitting

### Easily Identify Your Best Customers

IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

### Customer Life Time Value

Customer Life Time Value Tomer Kalimi, Jacob Zahavi and Ronen Meiri Contents Introduction... 2 So what is the LTV?... 2 LTV in the Gaming Industry... 3 The Modeling Process... 4 Data Modeling... 5 The

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Binary Logistic Regression

Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

### What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : msiris@cityu.edu.hk 1 Aims To introduce the basic concepts of data mining

### Modeling Lifetime Value in the Insurance Industry

Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting

### Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

### Understanding Characteristics of Caravan Insurance Policy Buyer

Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended

### Data Mining for Fun and Profit

Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

Churn Prediction Vladislav Lazarov Technische Universität München vladislav.lazarov@in.tum.de Marius Capota Technische Universität München mariuscapota@yahoo.com ABSTRACT The rapid growth of the market

### Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS by XIAO CHENG (Under the Direction of Jeongyoun Ahn) ABSTRACT Big Data has been the new trend in businesses.

### !"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"

!"!!"#\$\$%&'()*+\$(,%!"#\$%\$&'()*""%(+,'-*&./#-\$&'(-&(0*".\$#-\$1"(2&."3\$'45"!"#"\$%&#'()*+',\$\$-.&#',/"-0%.12'32./4'5,5'6/%&)\$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

### Easily Identify the Right Customers

PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

### Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons

### IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

### Predictive Modeling and Big Data

Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation

### MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

Data Mining with SAS Mathias Lanner mathias.lanner@swe.sas.com Copyright 2010 SAS Institute Inc. All rights reserved. Agenda Data mining Introduction Data mining applications Data mining techniques SEMMA

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER

Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 How Do Predictive Analytics and Data Mining Work?.... 2 The Data Mining Process....

### An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an

STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne Ruiz-Gazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques

### Business Analytics and Credit Scoring

Study Unit 5 Business Analytics and Credit Scoring ANL 309 Business Analytics Applications Introduction Process of credit scoring The role of business analytics in credit scoring Methods of logistic regression

### Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

### IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

### IBM SPSS Data Preparation 22

IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release

### Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

### Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

### Why do statisticians "hate" us?

Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

### Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major

### Get to Know the IBM SPSS Product Portfolio

IBM Software Business Analytics Product portfolio Get to Know the IBM SPSS Product Portfolio Offering integrated analytical capabilities that help organizations use data to drive improved outcomes 123

### Beating the MLB Moneyline

Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

### Increasing Marketing ROI with Optimized Prediction

Increasing Marketing ROI with Optimized Prediction Yottamine s Unique and Powerful Solution Smart marketers are using predictive analytics to make the best offer to the best customer for the least cost.

### A Property and Casualty Insurance Predictive Modeling Process in SAS

Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly

### Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

### Model Validation Techniques

Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

### Simple Linear Regression

STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze

### Numerical Algorithms Group

Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful

### A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression

### Data Mining Applications in Higher Education

Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

### Working with telecommunications

Working with telecommunications Minimizing churn in the telecommunications industry Contents: 1 Churn analysis using data mining 2 Customer churn analysis with IBM SPSS Modeler 3 Types of analysis 3 Feature