Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems
Course Outline Demonstration of two classification examples in SPM o Bank Marketing o KDD cup 2009 Predictive Modeling package used for the examples o o o o o o o Core Statistics Logistic Regression CART Decision Tree (original, by Jerome Friedman) MARS Spline Regression (original, by Jerome Friedman) TreeNet gradient boosting machine ((original, by Jerome Friedman) RandomForests (original, Breiman and Cutler) Automation and model acceleration Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 2
Bank Marketing Data Portuguese bank marketing data o o o o 41,188 records 20 attributes, such as age, job, education, housing status The goal is to predict whether the client will subscribe a term deposit Output variable (desired target): has the client subscribed a term deposit? (binary: 'yes','no') Dataset is publicly available at UCI machine learning repository o http://mlr.cs.umass.edu/ml/datasets/bank+marketing Challenges o o o Missing Value Mixed categorical and numerical variables Variable selection Copyright Salford Systems 2013
Sample Data AGE JOB MARITAL DEF HOUSING LOAN CONTACT EMP_VAR_RAT E CPI CCI EURIBOR NUM_EMP Y housemai 56 d married no no no telephone 1.1 93.994-36.4 4.857 5191 no 57 services married no no telephone 1.1 93.994-36.4 4.857 5191 no 37 services married no yes no telephone 1.1 93.994-36.4 4.857 5191 no 40 admin. married no no no telephone 1.1 93.994-36.4 4.857 5191 no 56 services married no no yes telephone 1.1 93.994-36.4 4.857 5191 no 45 services married no no telephone 1.1 93.994-36.4 4.857 5191 no 59 admin. married no no no telephone 1.1 93.994-36.4 4.857 5191 no bluecollar 41 married no no telephone 1.1 93.994-36.4 4.857 5191 no 24 technician single no yes no telephone 1.1 93.994-36.4 4.857 5191 no 25 services single no yes no telephone 1.1 93.994-36.4 4.857 5191 no Other variables include: level of education, date of last contact, outcome of last campaign, days since last contact, etc. Note: missing values, categorical and numeric variables
Copyright Salford Systems 2013 Open Raw Data: bank.csv
Character Variables and Missing Values Copyright Salford Systems 2013
Request Descriptive Statistics All variables are included in default Copyright Salford Systems 2013
Brief Descriptive Stats We always check for prevalence of missing data Always review number of distinct values (too few?, too many?) Anything looks wrong in the dataset Copyright Salford Systems 2013
Full Descriptive Stats Output contains detailed descriptive statistics for every variable Copyright Salford Systems 2013
Frequency of Target variable Target Variable 0 means non subscriber 1 means subscriber It s not surprised that there are only a small percentage of people subscribed term deposit Copyright Salford Systems 2013
Data Preparation The records in this dataset are ordered by date (from May 2008 to November 2010) Note that 2008 economy crisis made this dataset complicated because time has to be considered as a factor in the analysis. We partitioned 80% as learning data and remaining 20% as testing data in time order. Note: pdays 999 means the clients have never been contacted before this phone call. Copyright Salford Systems 2013
Build LOGIT Model Copyright Salford Systems 2013
LOGIT Model Summary ROC learn value is 0.94 which should get your attention to exam if it is too good to be true ROC learning and test difference tells us that time does have an impact Copyright Salford Systems 2013
LOGIT Model Coefficients Partial coefficients are shown in the table above Copyright Salford Systems 2013
CART Classification and Regression Trees o o o o Separates relevant from irrelevant predictors Yields simply, easy to understand results Doesn t require variable transformations Impervious to outliers and missing values Fastest, most versatile predictive modeling algorithm available to analysts Provides the foundation to modern data mining techniques such as bagging and boosting
Build CART Model Copyright Salford Systems 2013
Copyright Salford Systems 2013 Testing Method
CART Model Learn and Test sample perform quite different with this model which means time does contribute as a factor to influence the outcome Also learning sample performance looks too good to be true Copyright Salford Systems 2013
Variable Importance Duration: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Copyright Salford Systems 2013
Rerun CART model excluding Duration Copyright Salford Systems 2013
Variable Importance Ranking CART gives an initial look of what variable are important, it is useful when there are quite a few predictors in your dataset. Copyright Salford Systems 2013
Root Node Split Very Effective We can view nodes detail by clicking Tree Details in CART output window The first splitter is month which is also shown in variable importance ranking table as the most influential predictor The whole tree with details can be viewed as well Copyright Salford Systems 2013
MARS Multivariate Adaptive Regression Splines Uses knots to impose local linearities These knots create basis functions to decompose the information in each variable individually MV 60 50 40 30 20 10 0-10 0 10 20 30 40 LSTAT MV 60 50 40 30 20 10 0 0 10 20 30 40 LSTAT
Build MARS Model Copyright Salford Systems 2013
MARs Model Setup Max basis Function default setting is 15 where often time model hits this limit and stop before reaching the optimal model So we set it as 60 after a couple of runs Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 25
MARS Output Window This output window shows you the number of basis functions in the model against the performance of the model. Because MARS is a regression engine, the MSE and R-squared values will still be reported, but can be ignored here. Copyright Salford Systems 2013
Summary This model improved in targeting customers, with an ROC of 0.72. Copyright Salford Systems 2013
MARS Basis function Here is where the logistic regression equation is laid out in terms of the basis functions (transformations of the predictors). Each basis function is described and the final model is listed at the bottom. This form of output is especially desired by those who are comfortable with standard regression. Copyright Salford Systems 2013
MARs Plots Note: The presence of nonlinearity in this dataset Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 29
TreeNet Stochastic Gradient Boosting Small decision trees built in an errorcorrecting sequence 1. Begin with small tree as initial model 2. Compute residuals from this model for all records 3. Grow a second small tree to predict these residuals 4. And so on
Build TreeNet Model Copyright Salford Systems 2013
TreeNet Output Window The Output window shows a graph of the number of trees in the ensemble with its corresponding ROC value. The vertical green bar denotes the model with the optimal ROC: 9 trees at 0.69. Copyright Salford Systems 2013
Partial Dependency Plots Using TreeNet for targeted marketing has improved random calling and given you an idea of how the predictors affect subscription Copyright Salford Systems 2013
Random Forests Ensemble of trees built on bootstrap samples Algorithm: o o o Each tree is grown on a bootstrap sample from the learning data During tree growing, only P predictors are selected and tried at each node By default, P is the square root of total predictors The overall prediction is determined by averaging Law of Large Numbers ensures convergence The key to accuracy is low correlation and bias To keep bias low, trees are grown to maximum depth
Build RandomForests Model Copyright Salford Systems 2013
RandomForests Output1 RandomForests optimal model is always the one with most trees, Copyright Salford Systems 2013
RandomForests Summary Copyright Salford Systems 2013
Prediction Success Table1 We want to minimize the false non-subscribers rate to spend least effort to reach most subscribers Copyright Salford Systems 2013
Adjust Class Weights Class Weights default is BALANCED which means Upweight small classes to equal size of largest target class. Now we manually upweight class 1 which is the small class even more than Balanced setting Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 39
Prediction Success Table2 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 40
Conclusion CART, MARs, TreeNet and RandomForests o o o o o handles missing value automatically Detect interaction and nonlinearity automatically Model can be translate into other programing languages Model performance usually exceeds traditional classification algorithms Advanced setting boosts model performance CART provides initial insights of the dataset MARs gives equations in a linear regression format with transformation of original predictors TreeNet generates more accurate models RandomForests outperforms with wide datasets Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 41
KDD Cup 2009 Knowledge Discovery and Data mining competition held once a year to challenge modelers to a task o http://www.kdd.org/kddcup/index.php - competitions from 1997-2010 o Includes tasks, data, rules, results, and FAQs KDD Cup 2009 was about customer relationship prediction French telecom company Orange provided large marketing databases Overall goal was to beat the in-house system implemented by Orange Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 42
50,000 customers 15,000 predictors Datasets o ex) demographic, geographic, behavioral Three binary classification tasks: o Appetency: customer buys new product or service o Churn: customer switches providers o Upselling: customer buys upgrade offered to them Training and testing dataset Smaller subsets of data available for practice Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 43
Challenges Large database o 50,000 x 15,000 Numerical and categorical variables Missing data Unbalanced class distributions o Many more customers NOT doing these things Sanitized data - no intuition Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 44
Data Preparation Combine multiple datasets o Large dataset broken into 5 chunks, 53 MB each o True target values needed to be appended Delete or impute missing values o Not necessary in SPM Handle categorical variables o Create dummy indicators o Combine levels in variables with many o Again, not necessary in SPM Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 45
Open Prepared Data Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 46
View Data Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 47
Run Descriptive Statistics Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 48
Target Frequencies Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 49
Appetency In this context, appetency is the propensity of the customer to buy a new product or service Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 50
CART Model Setup Choose CART as the Analysis Engine Our Target is coded -1/1, so we will choose Classification/Logistic Binary as the Target Type Appetency is our response variable and VAR1-VAR15000 are our predictors Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 51
Setting a Testing Method A separate test dataset is provided in the competition, but true target values were not included For model-building, we will use a 20% random partition of the training dataset to monitor performance Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 52
Restricting Tree Size We are interested in looking at CART ranking of important predictors By forcing the tree to only one split, we can quickly create a tree to access this information Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 53
Penalties We are aware there are variables with many missing values and variables with a high number of categorical levels Setting penalties on these cases makes it harder to include these in the model Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 54
Results - Single Split CART Tree Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 55
Variable Improvement Measures Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 56
TreeNet Model Setup Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 57
Results - TreeNet Ensemble Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 58
Variable Selection Improvement measures are averaged across all trees in the ensemble Only 185 of the original 15,000 predictors are flagged as important Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 59
Recursive Feature Elimination (RFE) Remove one variable at a time from the TOP of the variable importance list to eliminate too good predictors Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 60
RFE, Step 2 Remove one variable at a time from the BOTTOM of the variable importance list to eliminate weak predictors Final ROC: 0.9048 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 61
Parameter Variation - Automates Each TreeNet control parameter can be automatically varied over its values A model is built at each step and summarized Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 62
Stability of the Model Automate PARTITION varies the learn/test partition so the user can observe the stability of model performance Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 63
Repeat on Churn Churn is the propensity of the customer to switch providers We repeat the same steps of model-building to achieve a final model Final ROC: 0.7320 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 64
Repeat on Upsell Upsell is the propensity of the customer to buy an upgrade offered to them We repeat the same steps of model-building to achieve a final model Final ROC: 0.9059 Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 65
Summary of Results Rank Team Appetency Churn Upselling Score 1 IBM Research 0.8830 0.7611 0.9038 0.8493 You! 0.9048 0.7320 0.9059 0.8476 2 ID Analytics, Inc. 0.8724 0.7565 0.9056 0.8448 3 Old dogs with new tricks 0.8740 0.7541 0.9050 0.8443 4 Crusaders 0.8688 0.7569 0.9034 0.8430 5 Financial Engineering Group, Inc. Japan 0.8732 0.7498 0.9057 0.8429 Unable to compare to true target values because these were only seen by competition judges However, we are confident in our results (2 of the above groups used SPM) Results can vary based on optimal selection criterion, random number seed, etc. Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 66
Overall Conclusions We were able to narrow down the predictor list significantly using TreeNet and Automate SHAVING o Of the original 15,000 predictors: Appetency: 167 Churn: 249 Upselling: 165 Handling of categorical variables and missing values was automatic and didn t cause any issues Small rates in the class of interest didn t pose a problem o Priors/Costs and Class Weights can control for this in CART and TreeNet Couldn t draw any insight as to the variables affecting appetency, churn, and upsell Salford Systems 2015 http://info.salford-systems.com/jsm-2015-ctw 67