Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
|
|
- Charlotte Manning
- 8 years ago
- Views:
Transcription
1 Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets August 2015 Salford Systems
2 Course Outline Demonstration of two classification examples in SPM o Bank Marketing o KDD cup 2009 Predictive Modeling package used for the examples o o o o o o o Core Statistics Logistic Regression CART Decision Tree (original, by Jerome Friedman) MARS Spline Regression (original, by Jerome Friedman) TreeNet gradient boosting machine ((original, by Jerome Friedman) RandomForests (original, Breiman and Cutler) Automation and model acceleration Salford Systems
3 Bank Marketing Data Portuguese bank marketing data o o o o 41,188 records 20 attributes, such as age, job, education, housing status The goal is to predict whether the client will subscribe a term deposit Output variable (desired target): has the client subscribed a term deposit? (binary: 'yes','no') Dataset is publicly available at UCI machine learning repository o Challenges o o o Missing Value Mixed categorical and numerical variables Variable selection Copyright Salford Systems 2013
4 Sample Data AGE JOB MARITAL DEF HOUSING LOAN CONTACT EMP_VAR_RAT E CPI CCI EURIBOR NUM_EMP Y housemai 56 d married no no no telephone no 57 services married no no telephone no 37 services married no yes no telephone no 40 admin. married no no no telephone no 56 services married no no yes telephone no 45 services married no no telephone no 59 admin. married no no no telephone no bluecollar 41 married no no telephone no 24 technician single no yes no telephone no 25 services single no yes no telephone no Other variables include: level of education, date of last contact, outcome of last campaign, days since last contact, etc. Note: missing values, categorical and numeric variables
5 Copyright Salford Systems 2013 Open Raw Data: bank.csv
6 Character Variables and Missing Values Copyright Salford Systems 2013
7 Request Descriptive Statistics All variables are included in default Copyright Salford Systems 2013
8 Brief Descriptive Stats We always check for prevalence of missing data Always review number of distinct values (too few?, too many?) Anything looks wrong in the dataset Copyright Salford Systems 2013
9 Full Descriptive Stats Output contains detailed descriptive statistics for every variable Copyright Salford Systems 2013
10 Frequency of Target variable Target Variable 0 means non subscriber 1 means subscriber It s not surprised that there are only a small percentage of people subscribed term deposit Copyright Salford Systems 2013
11 Data Preparation The records in this dataset are ordered by date (from May 2008 to November 2010) Note that 2008 economy crisis made this dataset complicated because time has to be considered as a factor in the analysis. We partitioned 80% as learning data and remaining 20% as testing data in time order. Note: pdays 999 means the clients have never been contacted before this phone call. Copyright Salford Systems 2013
12 Build LOGIT Model Copyright Salford Systems 2013
13 LOGIT Model Summary ROC learn value is 0.94 which should get your attention to exam if it is too good to be true ROC learning and test difference tells us that time does have an impact Copyright Salford Systems 2013
14 LOGIT Model Coefficients Partial coefficients are shown in the table above Copyright Salford Systems 2013
15 CART Classification and Regression Trees o o o o Separates relevant from irrelevant predictors Yields simply, easy to understand results Doesn t require variable transformations Impervious to outliers and missing values Fastest, most versatile predictive modeling algorithm available to analysts Provides the foundation to modern data mining techniques such as bagging and boosting
16 Build CART Model Copyright Salford Systems 2013
17 Copyright Salford Systems 2013 Testing Method
18 CART Model Learn and Test sample perform quite different with this model which means time does contribute as a factor to influence the outcome Also learning sample performance looks too good to be true Copyright Salford Systems 2013
19 Variable Importance Duration: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Copyright Salford Systems 2013
20 Rerun CART model excluding Duration Copyright Salford Systems 2013
21 Variable Importance Ranking CART gives an initial look of what variable are important, it is useful when there are quite a few predictors in your dataset. Copyright Salford Systems 2013
22 Root Node Split Very Effective We can view nodes detail by clicking Tree Details in CART output window The first splitter is month which is also shown in variable importance ranking table as the most influential predictor The whole tree with details can be viewed as well Copyright Salford Systems 2013
23 MARS Multivariate Adaptive Regression Splines Uses knots to impose local linearities These knots create basis functions to decompose the information in each variable individually MV LSTAT MV LSTAT
24 Build MARS Model Copyright Salford Systems 2013
25 MARs Model Setup Max basis Function default setting is 15 where often time model hits this limit and stop before reaching the optimal model So we set it as 60 after a couple of runs Salford Systems
26 MARS Output Window This output window shows you the number of basis functions in the model against the performance of the model. Because MARS is a regression engine, the MSE and R-squared values will still be reported, but can be ignored here. Copyright Salford Systems 2013
27 Summary This model improved in targeting customers, with an ROC of Copyright Salford Systems 2013
28 MARS Basis function Here is where the logistic regression equation is laid out in terms of the basis functions (transformations of the predictors). Each basis function is described and the final model is listed at the bottom. This form of output is especially desired by those who are comfortable with standard regression. Copyright Salford Systems 2013
29 MARs Plots Note: The presence of nonlinearity in this dataset Salford Systems
30 TreeNet Stochastic Gradient Boosting Small decision trees built in an errorcorrecting sequence 1. Begin with small tree as initial model 2. Compute residuals from this model for all records 3. Grow a second small tree to predict these residuals 4. And so on
31 Build TreeNet Model Copyright Salford Systems 2013
32 TreeNet Output Window The Output window shows a graph of the number of trees in the ensemble with its corresponding ROC value. The vertical green bar denotes the model with the optimal ROC: 9 trees at Copyright Salford Systems 2013
33 Partial Dependency Plots Using TreeNet for targeted marketing has improved random calling and given you an idea of how the predictors affect subscription Copyright Salford Systems 2013
34 Random Forests Ensemble of trees built on bootstrap samples Algorithm: o o o Each tree is grown on a bootstrap sample from the learning data During tree growing, only P predictors are selected and tried at each node By default, P is the square root of total predictors The overall prediction is determined by averaging Law of Large Numbers ensures convergence The key to accuracy is low correlation and bias To keep bias low, trees are grown to maximum depth
35 Build RandomForests Model Copyright Salford Systems 2013
36 RandomForests Output1 RandomForests optimal model is always the one with most trees, Copyright Salford Systems 2013
37 RandomForests Summary Copyright Salford Systems 2013
38 Prediction Success Table1 We want to minimize the false non-subscribers rate to spend least effort to reach most subscribers Copyright Salford Systems 2013
39 Adjust Class Weights Class Weights default is BALANCED which means Upweight small classes to equal size of largest target class. Now we manually upweight class 1 which is the small class even more than Balanced setting Salford Systems
40 Prediction Success Table2 Salford Systems
41 Conclusion CART, MARs, TreeNet and RandomForests o o o o o handles missing value automatically Detect interaction and nonlinearity automatically Model can be translate into other programing languages Model performance usually exceeds traditional classification algorithms Advanced setting boosts model performance CART provides initial insights of the dataset MARs gives equations in a linear regression format with transformation of original predictors TreeNet generates more accurate models RandomForests outperforms with wide datasets Salford Systems
42 KDD Cup 2009 Knowledge Discovery and Data mining competition held once a year to challenge modelers to a task o - competitions from o Includes tasks, data, rules, results, and FAQs KDD Cup 2009 was about customer relationship prediction French telecom company Orange provided large marketing databases Overall goal was to beat the in-house system implemented by Orange Salford Systems
43 50,000 customers 15,000 predictors Datasets o ex) demographic, geographic, behavioral Three binary classification tasks: o Appetency: customer buys new product or service o Churn: customer switches providers o Upselling: customer buys upgrade offered to them Training and testing dataset Smaller subsets of data available for practice Salford Systems
44 Challenges Large database o 50,000 x 15,000 Numerical and categorical variables Missing data Unbalanced class distributions o Many more customers NOT doing these things Sanitized data - no intuition Salford Systems
45 Data Preparation Combine multiple datasets o Large dataset broken into 5 chunks, 53 MB each o True target values needed to be appended Delete or impute missing values o Not necessary in SPM Handle categorical variables o Create dummy indicators o Combine levels in variables with many o Again, not necessary in SPM Salford Systems
46 Open Prepared Data Salford Systems
47 View Data Salford Systems
48 Run Descriptive Statistics Salford Systems
49 Target Frequencies Salford Systems
50 Appetency In this context, appetency is the propensity of the customer to buy a new product or service Salford Systems
51 CART Model Setup Choose CART as the Analysis Engine Our Target is coded -1/1, so we will choose Classification/Logistic Binary as the Target Type Appetency is our response variable and VAR1-VAR15000 are our predictors Salford Systems
52 Setting a Testing Method A separate test dataset is provided in the competition, but true target values were not included For model-building, we will use a 20% random partition of the training dataset to monitor performance Salford Systems
53 Restricting Tree Size We are interested in looking at CART ranking of important predictors By forcing the tree to only one split, we can quickly create a tree to access this information Salford Systems
54 Penalties We are aware there are variables with many missing values and variables with a high number of categorical levels Setting penalties on these cases makes it harder to include these in the model Salford Systems
55 Results - Single Split CART Tree Salford Systems
56 Variable Improvement Measures Salford Systems
57 TreeNet Model Setup Salford Systems
58 Results - TreeNet Ensemble Salford Systems
59 Variable Selection Improvement measures are averaged across all trees in the ensemble Only 185 of the original 15,000 predictors are flagged as important Salford Systems
60 Recursive Feature Elimination (RFE) Remove one variable at a time from the TOP of the variable importance list to eliminate too good predictors Salford Systems
61 RFE, Step 2 Remove one variable at a time from the BOTTOM of the variable importance list to eliminate weak predictors Final ROC: Salford Systems
62 Parameter Variation - Automates Each TreeNet control parameter can be automatically varied over its values A model is built at each step and summarized Salford Systems
63 Stability of the Model Automate PARTITION varies the learn/test partition so the user can observe the stability of model performance Salford Systems
64 Repeat on Churn Churn is the propensity of the customer to switch providers We repeat the same steps of model-building to achieve a final model Final ROC: Salford Systems
65 Repeat on Upsell Upsell is the propensity of the customer to buy an upgrade offered to them We repeat the same steps of model-building to achieve a final model Final ROC: Salford Systems
66 Summary of Results Rank Team Appetency Churn Upselling Score 1 IBM Research You! ID Analytics, Inc Old dogs with new tricks Crusaders Financial Engineering Group, Inc. Japan Unable to compare to true target values because these were only seen by competition judges However, we are confident in our results (2 of the above groups used SPM) Results can vary based on optimal selection criterion, random number seed, etc. Salford Systems
67 Overall Conclusions We were able to narrow down the predictor list significantly using TreeNet and Automate SHAVING o Of the original 15,000 predictors: Appetency: 167 Churn: 249 Upselling: 165 Handling of categorical variables and missing values was automatic and didn t cause any issues Small rates in the class of interest didn t pose a problem o Priors/Costs and Class Weights can control for this in CART and TreeNet Couldn t draw any insight as to the variables affecting appetency, churn, and upsell Salford Systems
Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
More informationTHE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationCART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationData Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009
Data Mining Approaches to Modeling Insurance Risk Dan Steinberg, Mikhail Golovnya, Scott Cardell Salford Systems 2009 Overview of Topics Covered Examples in the Insurance Industry Predicting at the outset
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationData Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:
More informationRole of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
More informationBenchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
More informationLocation matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationPackage acrm. R topics documented: February 19, 2015
Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author
More informationRisk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationA Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationBOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully
More informationAn Overview of Data Mining: Predictive Modeling for IR in the 21 st Century
An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationThe Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
More informationAddressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association
Addressing Analytics Challenges in the Insurance Industry Noe Tuason California State Automobile Association Overview Two Challenges: 1. Identifying High/Medium Profit who are High/Low Risk of Flight Prospects
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationChurn Modeling for Mobile Telecommunications:
Churn Modeling for Mobile Telecommunications: Winning the Duke/NCR Teradata Center for CRM Competition N. Scott Cardell, Mikhail Golovnya, Dan Steinberg Salford Systems http://www.salford-systems.com June
More informationImproving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationData Mining: Overview. What is Data Mining?
Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,
More informationIdentifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationPaper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM
Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional
More informationSTATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationWhy Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationData Mining Techniques
15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationUNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
More informationHow To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationInsurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationPredicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationA Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.
A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. Introduction: The Basel Capital Accord, ready for implementation in force around 2006, sets out
More informationEvaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing
Evaluation and Comparison of Data Mining Techniques Over Bank Direct Marketing Niharika Sharma 1, Arvinder Kaur 2, Sheetal Gandotra 3, Dr Bhawna Sharma 4 B.E. Final Year Student, Department of Computer
More informationDidacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
More informationCART: Classification and Regression Trees
Chapter 10 CART: Classification and Regression Trees Dan Steinberg Contents 10.1 Antecedents... 180 10.2 Overview... 181 10.3 A Running Example... 181 10.4 The Algorithm Briefly Stated... 183 10.5 Splitting
More informationEasily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
More informationCustomer Life Time Value
Customer Life Time Value Tomer Kalimi, Jacob Zahavi and Ronen Meiri Contents Introduction... 2 So what is the LTV?... 2 LTV in the Gaming Industry... 3 The Modeling Process... 4 Data Modeling... 5 The
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationBinary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
More informationWhat is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling
MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : msiris@cityu.edu.hk 1 Aims To introduce the basic concepts of data mining
More informationModeling Lifetime Value in the Insurance Industry
Modeling Lifetime Value in the Insurance Industry C. Olivia Parr Rud, Executive Vice President, Data Square, LLC ABSTRACT Acquisition modeling for direct mail insurance has the unique challenge of targeting
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationUnderstanding Characteristics of Caravan Insurance Policy Buyer
Understanding Characteristics of Caravan Insurance Policy Buyer May 10, 2007 Group 5 Chih Hau Huang Masami Mabuchi Muthita Songchitruksa Nopakoon Visitrattakul Executive Summary This report is intended
More informationData Mining for Fun and Profit
Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationChurn Prediction. Vladislav Lazarov. Marius Capota. vladislav.lazarov@in.tum.de. mariuscapota@yahoo.com
Churn Prediction Vladislav Lazarov Technische Universität München vladislav.lazarov@in.tum.de Marius Capota Technische Universität München mariuscapota@yahoo.com ABSTRACT The rapid growth of the market
More informationPredictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationTHE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)
THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS by XIAO CHENG (Under the Direction of Jeongyoun Ahn) ABSTRACT Big Data has been the new trend in businesses.
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationHow to Get More Value from Your Survey Data
Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2
More informationEasily Identify the Right Customers
PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationWinning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationPredictive Modeling and Big Data
Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation
More informationMAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
More informationData Mining with SAS. Mathias Lanner mathias.lanner@swe.sas.com. Copyright 2010 SAS Institute Inc. All rights reserved.
Data Mining with SAS Mathias Lanner mathias.lanner@swe.sas.com Copyright 2010 SAS Institute Inc. All rights reserved. Agenda Data mining Introduction Data mining applications Data mining techniques SEMMA
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationData Mining from A to Z: Better Insights, New Opportunities WHITE PAPER
Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 How Do Predictive Analytics and Data Mining Work?.... 2 The Data Mining Process....
More informationAn Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationII. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an
STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne Ruiz-Gazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques
More informationBusiness Analytics and Credit Scoring
Study Unit 5 Business Analytics and Credit Scoring ANL 309 Business Analytics Applications Introduction Process of credit scoring The role of business analytics in credit scoring Methods of logistic regression
More informationStepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection
Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationIBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
More informationChapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
More informationGetting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationSome vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.
Bonus Chapter Ten Major Predictive Analytics Vendors In This Chapter Angoss FICO IBM RapidMiner Revolution Analytics Salford Systems SAP SAS StatSoft, Inc. TIBCO This chapter highlights ten of the major
More informationGet to Know the IBM SPSS Product Portfolio
IBM Software Business Analytics Product portfolio Get to Know the IBM SPSS Product Portfolio Offering integrated analytical capabilities that help organizations use data to drive improved outcomes 123
More informationBeating the MLB Moneyline
Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
More informationIncreasing Marketing ROI with Optimized Prediction
Increasing Marketing ROI with Optimized Prediction Yottamine s Unique and Powerful Solution Smart marketers are using predictive analytics to make the best offer to the best customer for the least cost.
More informationA Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationModel Validation Techniques
Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost
More informationSimple Linear Regression
STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze
More informationNumerical Algorithms Group
Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful
More informationA Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND
Paper D02-2009 A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND ABSTRACT This paper applies a decision tree model and logistic regression
More informationData Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
More informationWorking with telecommunications
Working with telecommunications Minimizing churn in the telecommunications industry Contents: 1 Churn analysis using data mining 2 Customer churn analysis with IBM SPSS Modeler 3 Types of analysis 3 Feature
More informationData Mining III: Numeric Estimation
Data Mining III: Numeric Estimation Computer Science 105 Boston University David G. Sullivan, Ph.D. Review: Numeric Estimation Numeric estimation is like classification learning. it involves learning a
More informationTOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
More information