# Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

1 Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets August 2015 Salford Systems

2 Course Outline Demonstration of two classification examples in SPM o Bank Marketing o KDD cup 2009 Predictive Modeling package used for the examples o o o o o o o Core Statistics Logistic Regression CART Decision Tree (original, by Jerome Friedman) MARS Spline Regression (original, by Jerome Friedman) TreeNet gradient boosting machine ((original, by Jerome Friedman) RandomForests (original, Breiman and Cutler) Automation and model acceleration Salford Systems

3 Bank Marketing Data Portuguese bank marketing data o o o o 41,188 records 20 attributes, such as age, job, education, housing status The goal is to predict whether the client will subscribe a term deposit Output variable (desired target): has the client subscribed a term deposit? (binary: 'yes','no') Dataset is publicly available at UCI machine learning repository o Challenges o o o Missing Value Mixed categorical and numerical variables Variable selection Copyright Salford Systems 2013

4 Sample Data AGE JOB MARITAL DEF HOUSING LOAN CONTACT EMP_VAR_RAT E CPI CCI EURIBOR NUM_EMP Y housemai 56 d married no no no telephone no 57 services married no no telephone no 37 services married no yes no telephone no 40 admin. married no no no telephone no 56 services married no no yes telephone no 45 services married no no telephone no 59 admin. married no no no telephone no bluecollar 41 married no no telephone no 24 technician single no yes no telephone no 25 services single no yes no telephone no Other variables include: level of education, date of last contact, outcome of last campaign, days since last contact, etc. Note: missing values, categorical and numeric variables

5 Copyright Salford Systems 2013 Open Raw Data: bank.csv

6 Character Variables and Missing Values Copyright Salford Systems 2013

7 Request Descriptive Statistics All variables are included in default Copyright Salford Systems 2013

8 Brief Descriptive Stats We always check for prevalence of missing data Always review number of distinct values (too few?, too many?) Anything looks wrong in the dataset Copyright Salford Systems 2013

9 Full Descriptive Stats Output contains detailed descriptive statistics for every variable Copyright Salford Systems 2013

10 Frequency of Target variable Target Variable 0 means non subscriber 1 means subscriber It s not surprised that there are only a small percentage of people subscribed term deposit Copyright Salford Systems 2013

11 Data Preparation The records in this dataset are ordered by date (from May 2008 to November 2010) Note that 2008 economy crisis made this dataset complicated because time has to be considered as a factor in the analysis. We partitioned 80% as learning data and remaining 20% as testing data in time order. Note: pdays 999 means the clients have never been contacted before this phone call. Copyright Salford Systems 2013

12 Build LOGIT Model Copyright Salford Systems 2013

13 LOGIT Model Summary ROC learn value is 0.94 which should get your attention to exam if it is too good to be true ROC learning and test difference tells us that time does have an impact Copyright Salford Systems 2013

14 LOGIT Model Coefficients Partial coefficients are shown in the table above Copyright Salford Systems 2013

15 CART Classification and Regression Trees o o o o Separates relevant from irrelevant predictors Yields simply, easy to understand results Doesn t require variable transformations Impervious to outliers and missing values Fastest, most versatile predictive modeling algorithm available to analysts Provides the foundation to modern data mining techniques such as bagging and boosting

16 Build CART Model Copyright Salford Systems 2013

17 Copyright Salford Systems 2013 Testing Method

18 CART Model Learn and Test sample perform quite different with this model which means time does contribute as a factor to influence the outcome Also learning sample performance looks too good to be true Copyright Salford Systems 2013

19 Variable Importance Duration: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Copyright Salford Systems 2013

20 Rerun CART model excluding Duration Copyright Salford Systems 2013

21 Variable Importance Ranking CART gives an initial look of what variable are important, it is useful when there are quite a few predictors in your dataset. Copyright Salford Systems 2013

22 Root Node Split Very Effective We can view nodes detail by clicking Tree Details in CART output window The first splitter is month which is also shown in variable importance ranking table as the most influential predictor The whole tree with details can be viewed as well Copyright Salford Systems 2013

23 MARS Multivariate Adaptive Regression Splines Uses knots to impose local linearities These knots create basis functions to decompose the information in each variable individually MV LSTAT MV LSTAT

24 Build MARS Model Copyright Salford Systems 2013

25 MARs Model Setup Max basis Function default setting is 15 where often time model hits this limit and stop before reaching the optimal model So we set it as 60 after a couple of runs Salford Systems

26 MARS Output Window This output window shows you the number of basis functions in the model against the performance of the model. Because MARS is a regression engine, the MSE and R-squared values will still be reported, but can be ignored here. Copyright Salford Systems 2013

27 Summary This model improved in targeting customers, with an ROC of Copyright Salford Systems 2013

28 MARS Basis function Here is where the logistic regression equation is laid out in terms of the basis functions (transformations of the predictors). Each basis function is described and the final model is listed at the bottom. This form of output is especially desired by those who are comfortable with standard regression. Copyright Salford Systems 2013

29 MARs Plots Note: The presence of nonlinearity in this dataset Salford Systems

30 TreeNet Stochastic Gradient Boosting Small decision trees built in an errorcorrecting sequence 1. Begin with small tree as initial model 2. Compute residuals from this model for all records 3. Grow a second small tree to predict these residuals 4. And so on

31 Build TreeNet Model Copyright Salford Systems 2013

32 TreeNet Output Window The Output window shows a graph of the number of trees in the ensemble with its corresponding ROC value. The vertical green bar denotes the model with the optimal ROC: 9 trees at Copyright Salford Systems 2013

33 Partial Dependency Plots Using TreeNet for targeted marketing has improved random calling and given you an idea of how the predictors affect subscription Copyright Salford Systems 2013

34 Random Forests Ensemble of trees built on bootstrap samples Algorithm: o o o Each tree is grown on a bootstrap sample from the learning data During tree growing, only P predictors are selected and tried at each node By default, P is the square root of total predictors The overall prediction is determined by averaging Law of Large Numbers ensures convergence The key to accuracy is low correlation and bias To keep bias low, trees are grown to maximum depth

35 Build RandomForests Model Copyright Salford Systems 2013

36 RandomForests Output1 RandomForests optimal model is always the one with most trees, Copyright Salford Systems 2013

37 RandomForests Summary Copyright Salford Systems 2013

38 Prediction Success Table1 We want to minimize the false non-subscribers rate to spend least effort to reach most subscribers Copyright Salford Systems 2013

39 Adjust Class Weights Class Weights default is BALANCED which means Upweight small classes to equal size of largest target class. Now we manually upweight class 1 which is the small class even more than Balanced setting Salford Systems

40 Prediction Success Table2 Salford Systems

41 Conclusion CART, MARs, TreeNet and RandomForests o o o o o handles missing value automatically Detect interaction and nonlinearity automatically Model can be translate into other programing languages Model performance usually exceeds traditional classification algorithms Advanced setting boosts model performance CART provides initial insights of the dataset MARs gives equations in a linear regression format with transformation of original predictors TreeNet generates more accurate models RandomForests outperforms with wide datasets Salford Systems

42 KDD Cup 2009 Knowledge Discovery and Data mining competition held once a year to challenge modelers to a task o - competitions from o Includes tasks, data, rules, results, and FAQs KDD Cup 2009 was about customer relationship prediction French telecom company Orange provided large marketing databases Overall goal was to beat the in-house system implemented by Orange Salford Systems

43 50,000 customers 15,000 predictors Datasets o ex) demographic, geographic, behavioral Three binary classification tasks: o Appetency: customer buys new product or service o Churn: customer switches providers o Upselling: customer buys upgrade offered to them Training and testing dataset Smaller subsets of data available for practice Salford Systems

44 Challenges Large database o 50,000 x 15,000 Numerical and categorical variables Missing data Unbalanced class distributions o Many more customers NOT doing these things Sanitized data - no intuition Salford Systems

45 Data Preparation Combine multiple datasets o Large dataset broken into 5 chunks, 53 MB each o True target values needed to be appended Delete or impute missing values o Not necessary in SPM Handle categorical variables o Create dummy indicators o Combine levels in variables with many o Again, not necessary in SPM Salford Systems

46 Open Prepared Data Salford Systems

47 View Data Salford Systems

48 Run Descriptive Statistics Salford Systems

49 Target Frequencies Salford Systems

50 Appetency In this context, appetency is the propensity of the customer to buy a new product or service Salford Systems

51 CART Model Setup Choose CART as the Analysis Engine Our Target is coded -1/1, so we will choose Classification/Logistic Binary as the Target Type Appetency is our response variable and VAR1-VAR15000 are our predictors Salford Systems

52 Setting a Testing Method A separate test dataset is provided in the competition, but true target values were not included For model-building, we will use a 20% random partition of the training dataset to monitor performance Salford Systems

53 Restricting Tree Size We are interested in looking at CART ranking of important predictors By forcing the tree to only one split, we can quickly create a tree to access this information Salford Systems

54 Penalties We are aware there are variables with many missing values and variables with a high number of categorical levels Setting penalties on these cases makes it harder to include these in the model Salford Systems

55 Results - Single Split CART Tree Salford Systems

56 Variable Improvement Measures Salford Systems

57 TreeNet Model Setup Salford Systems

58 Results - TreeNet Ensemble Salford Systems

59 Variable Selection Improvement measures are averaged across all trees in the ensemble Only 185 of the original 15,000 predictors are flagged as important Salford Systems

60 Recursive Feature Elimination (RFE) Remove one variable at a time from the TOP of the variable importance list to eliminate too good predictors Salford Systems

61 RFE, Step 2 Remove one variable at a time from the BOTTOM of the variable importance list to eliminate weak predictors Final ROC: Salford Systems

62 Parameter Variation - Automates Each TreeNet control parameter can be automatically varied over its values A model is built at each step and summarized Salford Systems

63 Stability of the Model Automate PARTITION varies the learn/test partition so the user can observe the stability of model performance Salford Systems

64 Repeat on Churn Churn is the propensity of the customer to switch providers We repeat the same steps of model-building to achieve a final model Final ROC: Salford Systems

65 Repeat on Upsell Upsell is the propensity of the customer to buy an upgrade offered to them We repeat the same steps of model-building to achieve a final model Final ROC: Salford Systems

66 Summary of Results Rank Team Appetency Churn Upselling Score 1 IBM Research You! ID Analytics, Inc Old dogs with new tricks Crusaders Financial Engineering Group, Inc. Japan Unable to compare to true target values because these were only seen by competition judges However, we are confident in our results (2 of the above groups used SPM) Results can vary based on optimal selection criterion, random number seed, etc. Salford Systems

67 Overall Conclusions We were able to narrow down the predictor list significantly using TreeNet and Automate SHAVING o Of the original 15,000 predictors: Appetency: 167 Churn: 249 Upselling: 165 Handling of categorical variables and missing values was automatic and didn t cause any issues Small rates in the class of interest didn t pose a problem o Priors/Costs and Class Weights can control for this in CART and TreeNet Couldn t draw any insight as to the variables affecting appetency, churn, and upsell Salford Systems

### Working with telecommunications

Working with telecommunications Minimizing churn in the telecommunications industry Contents: 1 Churn analysis using data mining 2 Customer churn analysis with IBM SPSS Modeler 3 Types of analysis 3 Feature