Didacticiel Études de cas



Similar documents
2 Decision tree + Cross-validation with R (package rpart)

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

In this tutorial, we try to build a roc curve from a logistic regression.

Predictive Data modeling for health care: Comparative performance study of different prediction models

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Science with R. Introducing Data Mining with Rattle and R.

Introduction Predictive Analytics Tools: Weka

An Introduction to WEKA. As presented by PACE

Didacticiel - Études de cas

SAS Analyst for Windows Tutorial

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Tutorial Segmentation and Classification

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

How To Understand Data Mining In R And Rattle

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model

1 Topic. 2 Scilab. 2.1 What is Scilab?

Didacticiel - Etudes de cas

How To Make A Credit Risk Model For A Bank Account

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

Knowledge Discovery and Data Mining

Data Mining. Nonlinear Classification

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Data Science with R Ensemble of Decision Trees

Better credit models benefit us all

WEKA KnowledgeFlow Tutorial for Version 3-5-8

Didacticiel - Études de cas

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Easily Identify Your Best Customers

Advanced analytics at your hands

COC131 Data Mining - Clustering

Predictive Modeling of Titanic Survivors: a Learning Competition

GeoGebra Statistics and Probability

Viewing and Troubleshooting Perfmon Logs

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Analytics and Business Intelligence (8696/8697)

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Tutorial for proteome data analysis using the Perseus software platform

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Fast Analytics on Big Data with H20

An Overview and Evaluation of Decision Tree Methodology

Azure Machine Learning, SQL Data Mining and R

Scatter Plots with Error Bars

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Directions for using SPSS

Regression Clustering

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

How to Deploy Models using Statistica SVB Nodes

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

2015 Workshops for Professors

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

ABSTRACT INTRODUCTION EXERCISE 1: EXPLORING THE USER INTERFACE GRAPH GALLERY

THE TOP TEN TIPS FOR USING QUALTRICS AT BYU

Data Mining with Weka

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Creating Online Surveys with Qualtrics Survey Tool

2011 Data Miner Survey Highlights

Instructions for data-entry and data-analysis using Epi Info

The Predictive Data Mining Revolution in Scorecards:

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Exploratory data analysis (Chapter 2) Fall 2011

Affiliation Security

NaviCell Data Visualization Python API

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

SPSS: Getting Started. For Windows

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Business Analytics and Credit Scoring

To do a factor analysis, we need to select an extraction method and a rotation method. Hit the Extraction button to specify your extraction method.

Data Exploration Data Visualization

Tutorial 2 Online and offline Ship Visualization tool Table of Contents

II. Methods X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an

Data Mining Tools. Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Jean-

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

MultiExperiment Viewer Quickstart Guide

Tutorial #7A: LC Segmentation with Ratings-based Conjoint Data

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

SAS Software to Fit the Generalized Linear Model

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

A Demonstration of Hierarchical Clustering

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

This means that any user from the testing domain can now logon to Cognos 8 (and therefore Controller 8 etc.).

Tutorial: Get Running with Amos Graphics

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Transcription:

1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see Kdnuggets Polls Data Mining/ Analytic Tools Used 2011). Among the reasons which explain this success, we distinguish two very interesting characteristics: (1) we can extend almost indefinitely the features of the tool with the packages; (2) we have a programming language which allows to perform easily sequences of complex operations. But this second property can be also a drawback. Indeed, some users do not want to learn a new programming language before being able to realize projects. For this reason, tools which allow to define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still remain a valuable alternative with the data miners. In this tutorial, we present the "Rattle" package which allows to the data miners to use R without needing to know the associated programming language. All the operations are performed with simple clicks, such as for any software driven by menus. But, in addition, all the commands are stored. We can save them in a file. Then, in a new working session, we can easily repeat all the operations. Thus, we find one of the important properties which miss to the tools driven by menus. To describe the use of the rattle package, we perform an analysis similar to the one suggested by the rattle's author in its presentation paper (G.J. Williams, «Rattle : A Data Mining GUI for R», in The R Journal, volume 1 / 2, pages 45 55, December 2009, http://journal.r project.org/archive/2009 2/RJournal_2009 2_Williams.pdf). We perform the following steps: loading the data file; partitioning the instances into learning and test samples; specifying the types of the variables (target or input); computing some descriptive statistics; learning the predictive models from the learning sample; assessing the models on the test sample (confusion matrix, error rate, some curves). 2 Dataset We use the «heart» 1 data file. We want to explain the occurrence of the DISEASE from the characteristics of patients. We show here the first instances of the dataset. 1 http://eric.univ lyon2.fr/~ricco/tanagra/fichiers/heart_for_rattle.txt ; a description of this data file is available on the following website: http://archive.ics.uci.edu/ml/datasets/heart+disease 26 août 2011 Page 1 sur 14

3 Data Mining with Rattle 3.1 Loading the rattle package First, we load the rattle package [library()]. Then, we start the GUI with the command rattle(). > #loading the package > library(rattle) > #lauching the GUI > rattle() Into the R console, we have From now, we perform all the operations by clicking on the appropriate menu or button. All these operations are recorded as R commands by rattle. The rattle GUI is displayed. 26 août 2011 Page 2 sur 14

The use of rattle is always the same: we define the command by working in the appropriate tab (Data: load the dataset; Explore: some descriptive statistics; Test: some statistical tests, etc.); then, we launch the calculations by clicking on the EXECUTER button into the toolbar. 3.2 Importing the data file Into the Data tab, we click on the FILENAME button. We select the heart_for_rattle.txt data file. We specify the column separator: «SEPARATOR = \t». Then we click on EXECUTER. 26 août 2011 Page 3 sur 14

The dataset is loaded. The variable type is automatically detected from the distinct values into each column (discrete or continuous). We can define the TARGET attribute and the INPUT ones. Last, we specify the size of the training (70% of instances, drawn randomly) and test (30%) samples. 3.3 Dataset description Into the Explore tab, we obtain some descriptive statistics indicators about the variables (SUMMARY / SUMMARY option). For the discrete variables, rattle lists the values (levels). For the continuous ones, we have the min, max, mean, quartiles. All the indicators are computed on the learning sample. 26 août 2011 Page 4 sur 14

With the SUMMARY / DESCRIBE option, we obtain a more detailed description. Among others, for the continuous variables, the indications are useful to detect unusual values (outliers). Into the Explore tab still, with the DISTRIBUTIONS option, we obtain some graphical representations of the distributions. We have for instance the conditional box plots of AGE and CHOL according to the values of DISEASE. 26 août 2011 Page 5 sur 14

We can obtain also the conditional distribution functions. 26 août 2011 Page 6 sur 14

About the discrete variables, we can obtain the Mosaic of the variables, according still to the values of the target attribute. For instance, about SEX, the men (MALE) are more numerous than women (FEMALE) into the sample; and the proportion of disease is higher for the men. We can also obtain the correlations about the continuous input attributes. The correlations are described in a hierarchical structure. It is useful for instance for the detection of the redundant variables. 26 août 2011 Page 7 sur 14

3.4 Data transformation The "Transform" tab is dedicated to the variable transformation. Some usual operators are available (e.g. logarithm, rank, etc.). 3.5 Supervised learning This step is at the heart of our analysis. We select the "Model" tab. We want to evaluate three methods: decision tree induction, random forest, logistic regression. 26 août 2011 Page 8 sur 14

About the decision tree, rattle uses the rpart command from the rpart package. We note the default parameters used. We click on the EXECUTER button. We obtain the rules associated to the tree by clicking on the RULES button. We can obtain also a graphical representation of the tree with the DRAW option. 26 août 2011 Page 9 sur 14

About the random forest approach, rattle uses the randomforest command from the randomforest package. We obtain the following results with the default settings. The OOB (out of bag) error estimation is 16.5%. We will compare this value to the one obtained on the test set below. About the logistic regression, we use the glm() command. It automatically transforms the discrete predictors using dummy variables. We obtain the following results. 26 août 2011 Page 10 sur 14

3.6 Measuring the generalization performance Last step of our analysis, we want to evaluate the performances of the classifiers on the test sample (30% of the whole dataset). We activate the Evaluate tab. First, we want to obtain the confusion matrix and the associated error rate. We select the Error Matrix option. For the Data item, we must select the Testing option. Only the models learned into the Model tab are available here. We click on the EXECUTER menu. We observe that the logistic regression is the better here with a test error rate equal to 18.18%. We note also that the OOB error rate (16.5%) seems underestimate the error rate for the random forest (20.45% on the test set). But, because the test set size is small, and the test error rate being also an estimation of the true error rate, we consider with many cautions this result. 26 août 2011 Page 11 sur 14

Actually, the error rate is not a good criterion here. We note that the differences between the methods are based only on one misclassified instance. In our context, it is perhaps more interesting to use the ROC curve which highlights the ability of the methods to assign higher score to the positive instances compared with the negative ones (see http://data miningtutorials.blogspot.com/2008/11/roc curve for classifier comparison.html or http://data miningtutorials.blogspot.com/2008/10/computing roc curve.html). We select the ROC option under rattle. 26 août 2011 Page 12 sur 14

According the AUC criterion, the decision tree is definitely the worst compared with the two other classifiers, which are similar in terms of performance. It is not surprising. We know that the decision tree is not well adapted to the scoring process. 3.7 R commands associated to the treatments 26 août 2011 Page 13 sur 14

One of the main criticisms which we make for the software driven by menu is that once the process is finalized, when we close the software, we have no recollection of the sequence of operations we performed. In the next working session, it is complicated to reproduce them as before. It is necessary to have an excellent memory, or to have taken care of noting all that we made. Rattle allows to overtake this drawback by translating all the operations (corresponding to a click on the EXECUTER menu) performed by the user in a sequence of R commands. We can visualize them in the "Log" tab. We can store these commands (and the comments) into a file. In the next working session, it is very easy to perform the same data processing by loading these commands. 4 Rattle under Linux (Ubuntu) The installation of the Rattle package under Linux is not easy. We must follow carefully the description available on the website. In case of problem, a troubleshooting procedure is proposed. This is the one that I used (see http://datamining.togaware.com/survivor/install_gnu_linux.html). When the installation is finalized, Rattle works properly under Linux (Ubuntu) as we see below. 5 Conclusion In this tutorial, we showed that it was possible to use R without knowledge about its programming language with the help of the rattle package. This package is rather specialized about the data mining methods. For the statisticians, there are other packages such as "R Commander". 26 août 2011 Page 14 sur 14