Data Mining Using SAS Enterprise Miner 7.1

Similar documents
Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining Algorithms Part 1. Dejan Sarka

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Enhancing Compliance with Predictive Analytics

Data mining and statistical models in marketing campaigns of BT Retail

Supervised Learning (Big Data Analytics)

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Methods: Applications for Institutional Research

Course Syllabus. Purposes of Course:

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining Applications in Higher Education

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Predictive Modeling of Titanic Survivors: a Learning Competition

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

A fast, powerful data mining workbench designed for small to midsize organizations

Customer and Business Analytic

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

PharmaSUG2011 Paper HS03

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

Data Mining Techniques Chapter 6: Decision Trees

Chapter 6. The stacking ensemble approach

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Leveraging Ensemble Models in SAS Enterprise Miner

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Classification and Prediction

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Foundations of Artificial Intelligence. Introduction to Data Mining

Sanjeev Kumar. contribute

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Machine Learning Capacity and Performance Analysis and R

Data Mining Techniques

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

MANA Home Birth Data : Consumer Considerations

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Data Mining. Nonlinear Classification

Advanced analytics at your hands

The Predictive Data Mining Revolution in Scorecards:

Predictive Modeling and Big Data

Chapter 12 Discovering New Knowledge Data Mining

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Data Mining: Overview. What is Data Mining?

Data Mining from A to Z: Better Insights, New Opportunities WHITE PAPER

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

LVQ Plug-In Algorithm for SQL Server

Data Mining for Knowledge Management. Classification

Supplementary online appendix

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Fast Analytics on Big Data with H20

The Data Mining Process

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Social Media Mining. Data Mining Essentials

Model Deployment. Dr. Saed Sayad. University of Toronto

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Predictive Data modeling for health care: Comparative performance study of different prediction models

Make Better Decisions Through Predictive Intelligence

Knowledge Discovery and Data Mining

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Text Analytics using High Performance SAS Text Miner

Introduction to Data Mining

Easily Identify Your Best Customers

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Predictive Analytics in the Public Sector: Using Data Mining to Assist Better Target Selection for Audit

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Data Mining - The Next Mining Boom?

IBM SPSS Direct Marketing

Evaluation & Validation: Credibility: Evaluating what has been learned

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

2015 Workshops for Professors

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Analytics on Big Data

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Mining Practical Machine Learning Tools and Techniques

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Data Mining Classification: Decision Trees

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers

Information Management course

Transcription:

Data Mining Using SAS Enterprise Miner 7.1 Lorne Rothman Lorne.rothman@sas.com Principal Statistician SAS Institute (Canada) Inc. Copyright 2010 SAS Institute Inc. All rights reserved.

Data Mining The process of data selection, exploration and model building using vast data stores to uncover previously unknown patterns that lead to proactive decision making. What statisticians and scientists were taught not to do. 2

The Data Experimental Opportunistic Purpose Research Operational Value Scientific Commercial Generation Actively Passively controlled observed Size Small Massive Hygiene Clean Dirty State Static Dynamic 3

Data Deluge 4

Data Deluge 5

Data Mining Techniques Market Basket Analysis Exploring the frequency of co-occurrences of events Unsupervised Classification Classifying cases based on their attributes Predictive Modeling Predicting the near future using the recent past 6

Market Basket Analysis Most commonly applied in business e.g. product bundling and marketing though has applications in many fields including health e.g. the frequency of co-occurrences of medical conditions in patients. 7

Market Basket Analysis Associations can be visualized in link diagrams. 8

Unsupervised Classification inputs grouping cluster 1 cluster 2 cluster 3 cluster 1 Unsupervised classification: grouping of cases based on similarities in input values. cluster 2 9

k-means Clustering Algorithm Training Data 1. Select inputs. 2. Select k cluster centers. 3. Assign cases to closest center. 4. Update cluster centers. 5. Reassign cases. 6. Repeat steps 4 and 5 until convergence. 10 10 10...

Predictive Modeling 11

Errors, Outliers, and Missings cking #cking ADB NSF dirdep SVG bal Y 1 468.11 1 1876 Y 1208 Y 1 68.75 0 0 Y 0 Y 1 212.04 0 6 0.. 0 0 Y 4301 y 2 585.05 0 7218 Y 234 Y 1 47.69 2 1256 238 Y 1 4687.7 0 0 0.. 1 0 Y 1208 Y... 1598 0 1 0.00 0 0 0 Y 3 89981.12 0 0 Y 45662 Y 2 585.05 0 7218 Y 234 12

Separate Sampling for Rare Events OK Rare Condition 13

High Dimensionality I I III I II III I X 1 I X 1 X 2 I I I I II I I I I I I I X 1 X 3 X 2 I I I IIII I I I II I I I X 2 X 3 X 1 X 2 X 3 14 X 3 14

Model Selection I II I I I II II III I II IIIIIII III II I IIIIIII II II IIIIIIIII Overfitting Underfitting Just Right IIIIII I I I I III III IIIIII IIIII IIII I IIIIII II I I 15 15

Data Splitting 16

Input Layer Neural Networks 17

Decision Trees 18

Generalized Linear Model 19

And Other Modeling Tools Gradient Boosting Rule Induction Memory Based Reasoning Support Vector Machines Least Angular Regression Partial Least Squares SAS Rapid Predictive Modeler Two Stage Models Ensemble Models 20

Scenario: Early Detection for Low Birth Weight North Carolina births for 2000 and 2001. The original data sets included over 120,000 births in each year and contain data on the race, age, education level and marital status of the parents; prenatal medical care received; and information on the mother's reproductive history including number of previous pregnancies and live births (State Center for Health Statistics, 2001, 2002). Plural births were filtered from the data. The set, DEVELOP00 represents an oversample (50% LBWT=1, 50% LBWT=0) of 17,097 records from 2000 to be used for training and validation. The percentage of low birth weight babies prior to oversampling is 7.2%. The data, TEST01 represents an oversample (50% LBWT=1, 50% LBWT=0) of 16,687 records from 2001 to be used as a future test set. The percentage of low birth weight was also 7.2%. 21

Early Detection for Low Birth Weight General socio-,eco-, demo- graphics and behaviour of parents Age, edu, race, place of residence, smoking etc. Prior pregnancy related data # pregnancies, last outcome, fetal deaths etc. Medical History for pregnancy Hypertension, cardiac disease, etc. Obstetric procedures Amniocentesis, ultrasound, etc. Events of Labor Breech, fetal distress etc. Method of delivery Vaginal, c-section etc. New born characteristics congenital anomalies (spinabifida, heart), apgar score, anemia 22

Temporal Infidelity I.e. using information to build a model that will not yet be available when the model is deployed. Parent socio-,eco,- demo- graphics and behaviour Prior pregnancy related data Medical History for pregnancy (Early) Obstetric procedures Events of Labor Method of delivery New born characteristics Data Cutoff 23

Data Partitioning for Model Development Validation Training Test 2000 2001 17,097 females 16,687 females 24

Model Assessment Predicted** 1 0 1 TP FN AP Accuracy = (TP+TN)/n Sensitivity = TP/AP 0 FP TN AN Specificity = TN/AN Lift = (TP/PP)/π 1 PP PN n ** - Predicted 1 where Posterior Probability > Cutoff 25

Model Assessment Lift Charts ROC Charts Explore measures across a range of decreasing cutoffs TP FN TP FN TP FN TP FN TP FN TP FN FP TN FP TN FP TN FP TN FP TN FP TN 26

Model Deployment x (1.1, 3.0) Pregnant women go to the doctor. Relevant attributes are measured. Measures are supplied to a scoring engine and a score indicating propensity for low birth weight is generated. Decisions are made as to future care based upon this score. Scoring Code logit( pˆ ) 1.6.14 x.50x 1 2 ˆp.05 Predicted Probability of LBWT Baby. 27

Predictive Modeling in Enterprise Miner 28

Enterprise Miner LBWT Flow 29

Configure the Metadata Define variable roles and levels. 30

Partition the Data and Define a Test Set A 60% training, 40% validation data partition is used. A separate test set containing the 2001 data is added to the flow. 31

Replace Variable Values using a Code Node The SAS Code node is a powerful tool that enables the analysts to integrate SAS code into an Enterprise Miner flow. 32

Fit a Decision Tree Trees are simple modeling tools in that they require very little in the way of data preparation. Here we use a CHAID like tree with validation data. 33

Explore Decision Tree Results The tree is tuned on validation Average Square Error. A 28 leaf tree has minimum error on the validation set. Father s race, hypertension during pregnancy, and smoking are the top three most important variables in the model. 34

Explore Decision Tree Results For father s race = 1, the highest probability of LBWT occurs amongst women who smoke and have uterine bleeding (or missing uterine bleeding values). 35

Impute Missing Values Further data preparation is required for regression and neural networks. Decision tree models are used to impute class and interval variables in the Impute node. Indicator variables are created to flag prior missing values amongst the inputs. 36

Select Variables using Decision Trees A CART type tree is fit to screen variables for subsequent models. All variables with importance values greater than 0.05 are passed on as inputs to subsequent modeling nodes. 37

Consolidate Categorical Variables using Decision Trees A tree is used to further reduce dimensions by consolidating the 19 levels of parent race into 6 categories. 38

Change Variable Roles The Metadata node enables you to change the roles or measurement scales of variables in mid-flow. Here RACEMOM and RACEDAD are rejected as their information has now been consolidated within a variable output by the Collapse RACE Decision Tree tree called, _NODE_. 39

Tune Regression and Neural Network Models The iteration in a neural network that minimizes validation data error is selected as the final mode. The step in a stepwise regression that minimizes validation error is selected as the final model. 40

Explore Regression Results 41

Assess and Compare Models Models can be assessed and compared using the Model Comparison node. 42

Assess and Compare Models The neural network had the lowest error, and the highest ROC index and lift. Regression results are similar to the neural network results. Individuals in the top 5% predicted most likely to have LBWT babies are 3.8 time more likely to have LBWT babies than the average. 43

Generate Scoring Code A Score node can be added to generate BASE SAS code that will apply model results to new patients. Score code is not simply a model equation but includes all data preparation steps such as replacement, missing value imputation, collapsing categorical variables etc. 44

Apply Scoring Code Score code can be run against new data in BASE SAS. Enterprise Miner is not required. 45

Apply Model Results to Decision Making A dataset containing predictions is produced by the score code. A cutoff is applied to these predicted probabilities to classify cases as LBWT or normal, and decisions are then applied. E.g. Every mother with a predicted probability of having a LBWT baby greater than 0.10 will be: given pre-natal education; scheduled for special post natal classes, and care facilities etc. etc. 46

THANK YOU Lorne Rothman Principal Statistician SAS Institute (Canada) Inc. Lorne.rothman@sas.com Copyright 2010 SAS Institute Inc. All rights reserved.