Data Mining Using SAS Enterprise Miner 7.1 Lorne Rothman Lorne.rothman@sas.com Principal Statistician SAS Institute (Canada) Inc. Copyright 2010 SAS Institute Inc. All rights reserved.
Data Mining The process of data selection, exploration and model building using vast data stores to uncover previously unknown patterns that lead to proactive decision making. What statisticians and scientists were taught not to do. 2
The Data Experimental Opportunistic Purpose Research Operational Value Scientific Commercial Generation Actively Passively controlled observed Size Small Massive Hygiene Clean Dirty State Static Dynamic 3
Data Deluge 4
Data Deluge 5
Data Mining Techniques Market Basket Analysis Exploring the frequency of co-occurrences of events Unsupervised Classification Classifying cases based on their attributes Predictive Modeling Predicting the near future using the recent past 6
Market Basket Analysis Most commonly applied in business e.g. product bundling and marketing though has applications in many fields including health e.g. the frequency of co-occurrences of medical conditions in patients. 7
Market Basket Analysis Associations can be visualized in link diagrams. 8
Unsupervised Classification inputs grouping cluster 1 cluster 2 cluster 3 cluster 1 Unsupervised classification: grouping of cases based on similarities in input values. cluster 2 9
k-means Clustering Algorithm Training Data 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence. 10 10 10...
Predictive Modeling 11
Errors, Outliers, and Missings cking #cking ADB NSF dirdep SVG bal Y 1 468.11 1 1876 Y 1208 Y 1 68.75 0 0 Y 0 Y 1 212.04 0 6 0.. 0 0 Y 4301 y 2 585.05 0 7218 Y 234 Y 1 47.69 2 1256 238 Y 1 4687.7 0 0 0.. 1 0 Y 1208 Y... 1598 0 1 0.00 0 0 0 Y 3 89981.12 0 0 Y 45662 Y 2 585.05 0 7218 Y 234 12
Separate Sampling for Rare Events OK Rare Condition 13
High Dimensionality I I III I II III I X 1 I X 1 X 2 I I I I II I I I I I I I X 1 X 3 X 2 I I I IIII I I I II I I I X 2 X 3 X 1 X 2 X 3 14 X 3 14
Model Selection I II I I I II II III I II IIIIIII III II I IIIIIII II II IIIIIIIII Overfitting Underfitting Just Right IIIIII I I I I III III IIIIII IIIII IIII I IIIIII II I I 15 15
Data Splitting 16
Input Layer Neural Networks 17
Decision Trees 18
Generalized Linear Model 19
And Other Modeling Tools Gradient Boosting Rule Induction Memory Based Reasoning Support Vector Machines Least Angular Regression Partial Least Squares SAS Rapid Predictive Modeler Two Stage Models Ensemble Models 20
Scenario: Early Detection for Low Birth Weight North Carolina births for 2000 and 2001. The original data sets included over 120,000 births in each year and contain data on the race, age, education level and marital status of the parents; prenatal medical care received; and information on the mother's reproductive history including number of previous pregnancies and live births (State Center for Health Statistics, 2001, 2002). Plural births were filtered from the data. The set, DEVELOP00 represents an oversample (50% LBWT=1, 50% LBWT=0) of 17,097 records from 2000 to be used for training and validation. The percentage of low birth weight babies prior to oversampling is 7.2%. The data, TEST01 represents an oversample (50% LBWT=1, 50% LBWT=0) of 16,687 records from 2001 to be used as a future test set. The percentage of low birth weight was also 7.2%. 21
Early Detection for Low Birth Weight General socio-,eco-, demo- graphics and behaviour of parents Age, edu, race, place of residence, smoking etc. Prior pregnancy related data # pregnancies, last outcome, fetal deaths etc. Medical History for pregnancy Hypertension, cardiac disease, etc. Obstetric procedures Amniocentesis, ultrasound, etc. Events of Labor Breech, fetal distress etc. Method of delivery Vaginal, c-section etc. New born characteristics congenital anomalies (spinabifida, heart), apgar score, anemia 22
Temporal Infidelity I.e. using information to build a model that will not yet be available when the model is deployed. Parent socio-,eco,- demo- graphics and behaviour Prior pregnancy related data Medical History for pregnancy (Early) Obstetric procedures Events of Labor Method of delivery New born characteristics Data Cutoff 23
Data Partitioning for Model Development Validation Training Test 2000 2001 17,097 females 16,687 females 24
Model Assessment Predicted** 1 0 1 TP FN AP Accuracy = (TP+TN)/n Sensitivity = TP/AP 0 FP TN AN Specificity = TN/AN Lift = (TP/PP)/π 1 PP PN n ** - Predicted 1 where Posterior Probability > Cutoff 25
Model Assessment Lift Charts ROC Charts Explore measures across a range of decreasing cutoffs TP FN TP FN TP FN TP FN TP FN TP FN FP TN FP TN FP TN FP TN FP TN FP TN 26
Model Deployment x (1.1, 3.0) Pregnant women go to the doctor. Relevant attributes are measured. Measures are supplied to a scoring engine and a score indicating propensity for low birth weight is generated. Decisions are made as to future care based upon this score. Scoring Code logit( pˆ ) 1.6.14 x.50x 1 2 ˆp.05 Predicted Probability of LBWT Baby. 27
Predictive Modeling in Enterprise Miner 28
Enterprise Miner LBWT Flow 29
Configure the Metadata Define variable roles and levels. 30
Partition the Data and Define a Test Set A 60% training, 40% validation data partition is used. A separate test set containing the 2001 data is added to the flow. 31
Replace Variable Values using a Code Node The SAS Code node is a powerful tool that enables the analysts to integrate SAS code into an Enterprise Miner flow. 32
Fit a Decision Tree Trees are simple modeling tools in that they require very little in the way of data preparation. Here we use a CHAID like tree with validation data. 33
Explore Decision Tree Results The tree is tuned on validation Average Square Error. A 28 leaf tree has minimum error on the validation set. Father s race, hypertension during pregnancy, and smoking are the top three most important variables in the model. 34
Explore Decision Tree Results For father s race = 1, the highest probability of LBWT occurs amongst women who smoke and have uterine bleeding (or missing uterine bleeding values). 35
Impute Missing Values Further data preparation is required for regression and neural networks. Decision tree models are used to impute class and interval variables in the Impute node. Indicator variables are created to flag prior missing values amongst the inputs. 36
Select Variables using Decision Trees A CART type tree is fit to screen variables for subsequent models. All variables with importance values greater than 0.05 are passed on as inputs to subsequent modeling nodes. 37
Consolidate Categorical Variables using Decision Trees A tree is used to further reduce dimensions by consolidating the 19 levels of parent race into 6 categories. 38
Change Variable Roles The Metadata node enables you to change the roles or measurement scales of variables in mid-flow. Here RACEMOM and RACEDAD are rejected as their information has now been consolidated within a variable output by the Collapse RACE Decision Tree tree called, _NODE_. 39
Tune Regression and Neural Network Models The iteration in a neural network that minimizes validation data error is selected as the final mode. The step in a stepwise regression that minimizes validation error is selected as the final model. 40
Explore Regression Results 41
Assess and Compare Models Models can be assessed and compared using the Model Comparison node. 42
Assess and Compare Models The neural network had the lowest error, and the highest ROC index and lift. Regression results are similar to the neural network results. Individuals in the top 5% predicted most likely to have LBWT babies are 3.8 time more likely to have LBWT babies than the average. 43
Generate Scoring Code A Score node can be added to generate BASE SAS code that will apply model results to new patients. Score code is not simply a model equation but includes all data preparation steps such as replacement, missing value imputation, collapsing categorical variables etc. 44
Apply Scoring Code Score code can be run against new data in BASE SAS. Enterprise Miner is not required. 45
Apply Model Results to Decision Making A dataset containing predictions is produced by the score code. A cutoff is applied to these predicted probabilities to classify cases as LBWT or normal, and decisions are then applied. E.g. Every mother with a predicted probability of having a LBWT baby greater than 0.10 will be: given pre-natal education; scheduled for special post natal classes, and care facilities etc. etc. 46
THANK YOU Lorne Rothman Principal Statistician SAS Institute (Canada) Inc. Lorne.rothman@sas.com Copyright 2010 SAS Institute Inc. All rights reserved.