In this tutorial, we try to build a roc curve from a logistic regression.



Similar documents
Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

WEKA KnowledgeFlow Tutorial for Version 3-5-8

2 Decision tree + Cross-validation with R (package rpart)

1 Topic. 2 Scilab. 2.1 What is Scilab?

Advanced analytics at your hands

Didacticiel Études de cas

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Tutorial #7A: LC Segmentation with Ratings-based Conjoint Data

Chapter 4 Displaying and Describing Categorical Data

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

How To Understand How Weka Works

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Microsoft Project 2010

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

Azure Machine Learning, SQL Data Mining and R

Regression Clustering

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

VWF. Virtual Wafer Fab

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Create Reports Utilizing SQL Server Reporting Services and PI OLEDB. Tutorial

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Jump-Start Tutorial For ProcessModel

Data Mining with Weka

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Data mining techniques: decision trees

WEKA Explorer User Guide for Version 3-4-3

BIG DATA What it is and how to use?

Tutorial Segmentation and Classification

Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable

Data Mining. Nonlinear Classification

How to Deploy Models using Statistica SVB Nodes

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Plot and Solve Equations

COC131 Data Mining - Clustering

Data quality in Accounting Information Systems

Information Technology Solutions

Didacticiel - Études de cas

Counters & Polls. Dynamic Content 1

Tutorial Bass Forecasting

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Polynomial Neural Network Discovery Client User Guide

Benefits of Upgrading to Phoenix WinNonlin 6.2

Petrel TIPS&TRICKS from SCM

DEPLOYING A VISUAL BASIC.NET APPLICATION

Expense Reports Training Document. Oracle iexpense

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Subscribe to RSS in Outlook Find RSS Feeds. Exchange Outlook 2007 How To s / RSS Feeds 1of 7

Implementation of Breiman s Random Forest Machine Learning Algorithm

Creating a Tableau Data Visualization on Cincinnati Crime By Jeffrey A. Shaffer

Crystal Converter User Guide

Exercise 10: Basic LabVIEW Programming

About SharePoint Server 2007 My Sites

Final Software Tools and Services for Traders

Tableau Your Data! Wiley. with Tableau Software. the InterWorks Bl Team. Fast and Easy Visual Analysis. Daniel G. Murray and

Audit TM. The Security Auditing Component of. Out-of-the-Box

Combination Chart Extensible Visualizations. Product: IBM Cognos Business Intelligence Area of Interest: Reporting

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Predict Influencers in the Social Network

ArcGIS Tutorial: Adding Attribute Data

Ad Hoc Reporting. Usage and Customization

Intellect Platform - The Workflow Engine Basic HelpDesk Troubleticket System - A102

How To Use Query Console

MadCap Software. SharePoint Guide. Flare 11.1

Oracle Data Mining Hands On Lab

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

InfiniteInsight 6.5 sp4

Data Mining III: Numeric Estimation

Universal Simple Control, USC-1

Studying Auto Insurance Data

Oracle Data Miner (Extension of SQL Developer 4.0)

SPSS: Getting Started. For Windows

An Introduction to WEKA. As presented by PACE

Introduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

Data Mining - Evaluation of Classifiers

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Product Guide Nintex. All rights reserved. Errors and omissions excepted.

Tutorial: Mobile Business Object Development. Sybase Unwired Platform 2.2 SP02

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Working with SQL Server Integration Services

Introduction to IBM SPSS Statistics

Leveraging Ensemble Models in SAS Enterprise Miner

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Knowledge Discovery and Data Mining

Data Analysis Tools. Tools for Summarizing Data

Tutorial: Mobile Business Object Development. SAP Mobile Platform 2.3

Didacticiel - Études de cas

Data Mining Tools. Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Jean-

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

C:\Users\<your_user_name>\AppData\Roaming\IEA\IDBAnalyzerV3

CSC 177 Fall 2014 Team Project Final Report

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Software Application Tutorial

Introduction Predictive Analytics Tools: Weka

Transcription:

Subject In this tutorial, we try to build a roc curve from a logistic regression. Regardless the software we used, even for commercial software, we have to prepare the following steps when we want build a ROC curve. Import the dataset in the soft; Compute descriptive statistics; Select target and input attributes; Select the positive value of the target attribute; Split the dataset into learning (e.g. 70%) and test set (30%); Choose the learning algorithm. Be careful, the softwares can have different implementation and present a slightly different results; Build the prediction model on the learning set and visualize the results; Build the ROC curve on the test set. According the softwares, the progression can be different but it is clear that we must, explicitly or not, process these steps. Dataset We use the dataset from the Komarek s website which implement the LR-TRIRLS logistic regression library (http://komarix.org/ac/lr). The DS1-10 dataset contains 26733 examples, 10 continuous descriptors; the frequency of the positive value of the target attribute is 5%. We use ARFF file format for WEKA and TANAGRA, TXT for ORANGE (TANAGRA can also handle TXT file format). Building a ROC curve with WEKA The number of methods is impressive in WEKA, but it is also the main weakness of this software, a through initiation is necessary. The needed components for the construction of a roc curve are not obvious. Page 1 / 9

Execution of WEKA When we execute WEKA, a dialog box enables to choose the execution mode. We select the KNOWLEDGE FLOW option. We have used the 3.4.7 version in this tutorial; the results can be slightly different on the others versions. The organization of the software is classic. In the top of the window, we find the tools, machine learning components, in some palettes. The KNOWLEDGE FLOW layout allows us to define the succession of data treatments. Tool palettes Components : Data Mining tools << Stream pane Load the dataset The ARFF LOADER component (DATASOURCES palette) enables to load a dataset. We add it in the workspace, we can select the file with the CONFIGURE menu of the component. Descriptive statistics The ATTRIBUT SUMMARIZER component (VISUALIZATION) shows the data distribution; we can quickly detect abnormalities in the dataset. We add this component and we to connect ARFF LOADER to this new component (DATASET connection). Page 2 / 9

The treatment will be executed when we select the START LOADING menu of the ARFF LOADER component. To see the results, we select the SHOW SUMMARIES option of ATTRIBUTES SUMMARIZER. We see that there are not abnormalities on the descriptors; we see also that we have imbalanced dataset (804 positive vs. 25929 negative ). Page 3 / 9

Define target and input attributes The last column is the default class attribute for WEKA, but we can also explicitly select the column of the class attribute. In this case, we use the CLASS ASSIGNER component (EVALUATION palette). We add this component in our diagram; we connect ARFF LOADER to this new component (DATASET connection). We select the CONFIGURE menu in order to select the right class attribute. In the following step, we must specify the positive value of the class attribute. We use the CLASSVALUE PICKER (EVALUATION palette) component and select the TRUE value in the parameter dialog box.

Subdivision of the dataset into learning and test set We want to build our prediction model on the 70% of the whole dataset, and compute the ROC curve on the remaining. So, we set the TRAINTEST SPLIT MAKER (EVALUATION) in the diagram and configure its parameters. Page 5 /9

Logistic regression The logistic regression component is in the CLASSIFIERS palette. We set it in the diagram, we connect twice the TRAIN TEST SPLIT MAKER to this new component: twice because we must use together the training and the test set which are produced by the same component. The results To see the results of the regression, we connect the LOGISTIC component to TEXT VIEWER (VISUALIZATION palette) that we set in the diagram. 20/02/2006 Page 6 sur 21

We execute again the diagram (START LOADING of ARFF LOADER component). The SHOW RESULTS of TEXT VIEWER opens a new window with the results of the learning process. We obtain the regression coefficients. ROC curve In order to evaluate the learning process, we add the CLASSIFIER PERFORMANCE EVALUATOR component (EVALUATION). We connect the BATCH CLASSIFIER output of LOGISTIC to this new component. Page 7 / 9

We set MODEL PERFORMANCE CHART (VISUALIZATION) in the diagram; we use the THRESOLD DATA (!) output of the CLASSIFIER PERFORMANCE EVALUATOR when we connect the two components. We run again the diagram with the START LOADING menu of ARFF LOADER. We select the SHOW PLOT menu of the last component (MODEL PERFORMANCE CHART). We obtain the ROC curve. Page 8 / 9

Page 9 / 9