Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable



Similar documents
How to Deploy Models using Statistica SVB Nodes

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Learn how to create web enabled (browser) forms in InfoPath 2013 and publish them in SharePoint InfoPath 2013 Web Enabled (Browser) forms

What is a Mail Merge?

Managing Contacts in Outlook

SPSS: Getting Started. For Windows

MICROSOFT OUTLOOK 2010 WORK WITH CONTACTS

Introduction to Simulink

BID2WIN Workshop. Advanced Report Writing

Access 2010: The Navigation Pane

DEPLOYING A VISUAL BASIC.NET APPLICATION

Data Mining: STATISTICA

Hamline University Administrative Computing Page 1

Chapter 4 Displaying and Describing Categorical Data

Fairfield University Using Xythos for File Sharing

How To Send An Encrypted In Outlook 2000 (For A Password Protected ) On A Pc Or Macintosh (For An Ipo) On Pc Or Ipo (For Pc Or For A Password Saf ) On An Iphone Or

STATISTICA VERSION 9 STATISTICA ENTERPRISE INSTALLATION INSTRUCTIONS FOR USE WITH TERMINAL SERVER

Search help. More on Office.com: images templates

Microsoft Office Access 2007 Basics

Instructions for Creating an Outlook Distribution List from an Excel File

SharePoint List Filter Favorites Installation Instruction

Connecting to LUA s webmail

STATISTICA VERSION 11 CONCURRENT NETWORK LICENSE WITH BORROWING INSTALLATION INSTRUCTIONS

Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5.

Identity Finder Setup

Microsoft Outlook 2000 Configuration Creation of a SPAM Filter

Finance Reporting. Millennium FAST. User Guide Version 4.0. Memorial University of Newfoundland. September 2013

How to create pop-up menus

A Guide to Getting Started with the AmeriCorps VISTA Applicant Tracking Tool

PortfolioCenter Export Wizard in Practice: Evaluating IRA Account Holder Ages and Calculating Required Minimum Distribution (RMD) Amounts

FRONTPAGE FORMS

PC Agent Quick Start. Open the Agent. Autonomy Connected Backup. Version 8.8. Revision 0

Composite.Community.Newsletter - User Guide

Installing Windows Server Update Services (WSUS) on Windows Server 2012 R2 Essentials

Instructions for Configuring a SAS Metadata Server for Use with JMP Clinical

How To Understand The Basic Concepts Of A Database And Data Science

Oracle Data Mining Hands On Lab

Creating and Using Databases with Microsoft Access

Junk Settings. Options

Installation Guide for Crossroads Software s Traffic Collision Database

Outlook Tips & Tricks. Training For Current & New Employees

Microsoft Outlook And- Outlook Web App (OWA) Using Office 365

Site Maintenance Using Dreamweaver

Project Management with Enterprise Architect

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

Creating a Participants Mailing and/or Contact List:

NextGen Setup Guide First-time Workstation Setup & Logging In

Microsoft Outlook 2007 Working with Signatures

FrontPage 2003: Forms

4. Are you satisfied with the outcome? Why or why not? Offer a solution and make a new graph (Figure 2).

Microsoft Excel 2013: Using a Data Entry Form

Chapter 4: Website Basics

Introduction to Microsoft Project 2010

Mac Outlook Calendar/Scheduler and Tasks

Microsoft Office Access 2007 which I refer to as Access throughout this book

Virtual Office Remote Installation Guide

Introduction to Final Cut Pro 7 - Editing Basics

MICROSOFT EXCEL 2010 ANALYZE DATA

Staying Organized with the Outlook Journal

As your financial institution completes its system conversion, you

SMS for Outlook. Installation, Configuration and Usage Guide

DCA. Document Control & Archiving USER S GUIDE

Outlook 2010 Essentials

MICROSOFT OUTLOOK 2010 READ, ORGANIZE, SEND AND RESPONSE S

Moving Rockwell Software Activation Keys to the VersaView 200R Industrial Computer

Creating Reports with Microsoft Dynamics AX SQL Reporting Services

Initial Setup of Microsoft Outlook with Google Apps Sync for Windows 7. Initial Setup of Microsoft Outlook with Google Apps Sync for Windows 7

Technical White Paper

Lesson 07: MS ACCESS - Handout. Introduction to database (30 mins)

Call Recorder Quick CD Access System

Making a Web Page with Microsoft Publisher 2003

Don't have Outlook? Download and configure the Microsoft Office Suite (which includes Outlook)!

Mail Merge: Create Mailing Labels Using Excel Data and Filtering the Contents in the Data

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Lab: Data Backup and Recovery in Windows XP

A computer running Windows Vista or Mac OS X

Module One: Getting Started Opening Outlook Setting Up Outlook for the First Time Understanding the Interface...

Sync Appointments from the Schedule Certifications Screen

INTERMEDIATE Excel 2013

Outlook Web App. in Office 365. The Outlook Window. Signing In. (Outlook Exchange Faculty & Staff) Getting Started

Creating Database Tables in Microsoft SQL Server

NDA ISSUE 1 STOCK # CallCenterWorX-Enterprise IMX MAT Quick Reference Guide MAY, NEC America, Inc.

Access Tutorial 1 Creating a Database. Microsoft Office 2013 Enhanced

Intellicus Enterprise Reporting and BI Platform

Adding Outlook to a Blackberry, Downloading, Installing and Configuring Blackberry Desktop Manager

This Skill Builder demonstrates how to define and place sketched symbols in drawings.

STATISTICA VERSION 10 STATISTICA ENTERPRISE SERVER INSTALLATION INSTRUCTIONS

Implementing Mission Control in Microsoft Outlook 2010

MICROSOFT ACCESS 2007 BOOK 2

2010 Ing. Punzenberger COPA-DATA GmbH. All rights reserved.

The basic steps involved in installing FLEETMATE Enterprise Edition and preparing it for initial use are as follows:

HOW TO USE THIS GUIDE

Getting started with 2c8 plugin for Microsoft Sharepoint Server 2010

Excel 2010: Create your first spreadsheet

Mail Merge Using Thunderbird. Bob Booth February 2009 AP-Tbird2

HOW TO ORGANIZE PICTURES

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Transcription:

Página 1 de 6 Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable STATISTICA Data Miner includes a complete deployment engine with various options for deploying solutions derived form predictive data mining projects. In this example we will illustrate the basic "mechanism" of how STATISTICA Data Miner can generate automatically all information necessary for deployment, i.e., to automatically predict values for new observations based on the parameters estimated for one or more estimated models. This example will be based on the example data file Patients.sta (also used in Example 3 of Nonlinear Estimation Analysis) reported in Neter, Wasserman, and Kutner (1985, page 469). Suppose you want to predict the number of days that patients are likely to spend in a hospital based on some prognostic information. The Patients.sta data file contains observed ("learning") data for 15 patients on two variables: The number of days that each patient was hospitalized (in the variable Days) and an index of the prognosis for recovery for each patient (in variable Prognosis; larger values reflect a better prognosis). The purpose of this project is to "build" a deployed system that will allow users to enter data for variable Prognosis and compute an estimate for the number of days the respective patient will likely stay in the hospital. In similar real-world applications of STATISTICA Data Miner, you most likely would have many variables that are related to patients' prognosis for recovery; those variables could simply be treated as additional predictors. If many thousands of possible predictors are available, you may want to use the Feature Selection and Variable Screening methods of STATISTICA to preselect likely predictors before applying analyses that will build models for predictions (such as neural networks, regression, etc.). Also, in real-world applications the input data are like "noisy," requiring some initial cleaning and filtering (such as illustrated in Example 1. The data may also reside in a remote database that needs to be connected to STATISTICA Data Miner for in-place database processing However, this example will illustrate the basic mechanism of building data miner projects for prediction and deployment. Setting up the Project; Connecting the Data. Start by selecting Build Your Own Project from the Statistics - Data Mining - General Modeler and Multivariate Explorer submenu (see also Data Mining Tools) Instead of Data Miner - All Procedures, in this case, we select the more specialized General Modeler and Multivariate Explorer. This will automatically "connect" the General Modeler and Multivariate Explorer Node Browser configuration to the project, with the specialized nodes for automatic deployment. Note that you can also select these from the general All Procedures Node Browser configuration, but you would have to scroll to them to "find them." As described further in the Node Browser each project is associated with a default Node Browser configuration; however, you can also choose nodes from any of multiple Node Browsers currently open to insert nodes into the currently active data miner workspace. Next, click the New Data Source button on the data miner workspace and open the example data file Patients.sta.

Página 2 de 6 On the Select dependent variables and predictors dialog, click the Variables button and select variable Days as the Dependent; continuous variable, and variable Prognosis as the Predictor; continuous variable. Then close the dialogs (click the OK button on the variable selection dialog and on the Select dependent variables and predictors dialog) to insert this data source into the Data Acquisition area of the data miner workspace. In a real-world application, at this point we would want to carefully review the input data to ensure that the data are "clean," i.e., do not contain erroneous values, miscoded entries, etc. (see also Example 1, or Crucial Concepts in Data Mining). However, for this example we will skip this step and proceed directly to the data analysis (model building) portion of the project. Selecting and Inserting Analysis Nodes. Because we are not sure about the nature of the relationship between the prognostic variables (single variable in this example) and the outcome variable of interest (number of Days likely to be spent in the hospital), we will select several linear and nonlinear prediction methods to tackle this problem. We will select these from the Regression Modeling and Multivariate Explorer folder of the Node Browser, so that the estimated models (solutions) are automatically available for deployment, i.e., for prediction of new observations (to predict the likely length of the hospital stay from prognostic information, as patients check into the hospital). Specifically, select for this example the Standard Multiple Regression with Deployment node and the two neural network nodes. Click the Insert into workspace button to insert these nodes into the workspace; if you also currently have the data source highlighted, then these nodes will automatically be connected to the data file. Then click Run.

Página 3 de 6 After a few seconds, the program will fit a linear regression model, a multilayer perceptron neural network, and a radial basis function neural network. You can review the results by double-clicking on the workbook icons in the Reports section of the data miner workspace, or change specific analysis parameter by double-clicking on the respective analysis icons. You can also review the predicted values in the spreadsheet nodes (icons) labeled Training..., which contain the observed and predicted values for each respective model; it is often very informative to connect to these data sources additional graphics nodes to perform some visual inspection of the quality of the fit for each model (see also Example 2: Visual Data Mining). However, for this example, we will directly proceed to the deployment stage. Computing Predicted Values for new Data. Suppose that the purpose of this project is to implement an automatic system for predicting the number of Days a patient is likely to stay in the hospital, i.e., to predict the length of the hospital stay based on prognostic information. Because we chose analysis nodes explicitly labeled as... with Deployment from the Regression Modeling and Multivariate Exploration folder, the information required for deployment, for making predictions from new data, is readily available to us at this point. Specifying Data for Deployment. For example, suppose we have prognostic information (data) for 3 new patients, and that information is entered (or transferred automatically) into a data file NewPatients.sta. You can create this data file for this example; when you do, make sure that you use the same variable names when creating the file as those used in the data file from which the current models were estimated, i.e., make sure to name the variables DAYS and PROGNOSIS, respectively. Next, insert this new data file as a new data source into the Data Acquisition area of the data miner project in the same manner as the original data file. On the Select dependent variables and predictors dialog, select the same variables as before: Specify variable Days as the continuous dependent variable, and variable Prognosis as the continuous predictor variable. In addition, make sure to select the check box Data for deployed project; do not reestimate models.

Página 4 de 6 As also described in the section on Deploying Solutions, the node labeled...with Deployment will automatically apply the most recently estimated model to the new data, to compute predicted values. Deployment: Computing Predicted Values. After inserting the new data sources marked for deployment into the workspace, connect it to the analysis nodes in this project. You can also at this point disable the other arrows (from the data source used to estimate the models) so that on updating the project, the models will not be reestimated (see also Data Miner Workspace). Then click Run to compute predicted values. The predicted values are available in the data sources (spreadsheet documents) generated by the analysis nodes. These are labeled by default Testing_xxx, where xxx usually is an abbreviation to reference the respective method and node ID that generated the prediction. For example, right-click on the Testing_RRBFx data source and select View Document from the shortcut menu (see also STATISTICA Data Miner Workspace Options).

Página 5 de 6 The predictions from the radial basis function network are shown in the first column of the spreadsheet. As you can see, patient G. Hill had the best prognosis (highest value for variable Prognosis), and is predicted to stay in the hospital between 9 or 10 days. Predicting new observations, when observed values are not (yet) available. In general, one of the main purposes of predictive data mining (see Crucial Concepts in Data Mining) is to allow for accurate prediction (predicted classification) of new observations, for which observed values or classifications are not (yet) available. When connecting data for deployment (prediction or predicted classification) to the nodes for Classification and Discrimination or Regression Modeling and Multivariate Exploration, make sure that the "structure" of the input file for deployment is the same as that used for building the models (see also the Data for deployed project; do not reestimate models option on the Select dependent variables and predictors dialog). Specifically, make sure that the same numbers and types of predictor variables are specified, that a (continuous or categorical) dependent variable is specified (even if all values for that variable are missing), and that the variable names match those in the data file used to build the models (this is particularly important for the deployment of neural networks, which will rely on this information). Computing an Average Prediction. The Node Browser folder Regression Modeling and Multivariate Exploration contains a node called Compute Best Prediction From All Models. This node will automatically take the most recent information for deployment generated by the nodes (see also Analysis Nodes with Automatic Deployment), and compute predicted values from each; the node can also compute an average prediction for all current models and for advanced applications (see also Example 4) and even choose the best prediction from all models currently available (see also the terms boosting, bagging, and meta-learning). Insert this node into the current workspace, and connect it to the data source containing the Prognosis data for the new observations (for prediction); then choose Run to Node (or SHIFT F5) to generate the predictions. You can right-click on the generated data source and choose option View Document to display the spreadsheet with the predictions. The average prediction from all three models (methods) is also automatically computed; note that these results may be slightly different for your analyses, because, for example, the neural network algorithms

Página 6 de 6 use (by default) random sub-sampling to create validation samples for "steering" the estimation algorithm (e.g., to avoid over-fitting, and to terminate the estimation procedures). Deploying the Solution to the "Field". As described in Analysis Nodes with Automatic Deployment, the deployment information is kept along with the data miner project in a Global Dictionary, which is a workspacewide repository of parameters. (You can review the current parameters available in the global dictionary via the Edit Global Dictionary Parameters dialog.) This means that you could now save this data miner project under a different name, and then delete all analysis nodes and related information except the Compute Best Prediction from All Models node and the data source with new observations (marked for deployment). A user could now simply enter values (for variable Prognosis) and run this project (with the Compute Best Prediction from All Models node only), and thus quickly compute predicted values for new patients. Because the STATISTICA Data Miner, as all analyses in STATISTICA, can be called from other applications, advanced applications could involve projects like these called automatically with data passed to them from some other (e.g., data entry) application. Making sure that deployment info is up-to-date. To reiterate, in general the deployment information for the different nodes that are named...with Deployment is stored in various forms locally along with each node, as well as globally, "visible" to other nodes in the same project. This is an important point to remember, because for Classification and Discrimination, as well as Regression Modeling and Multivariate Exploration, the node Compute Prediction from All Models will compute predictions based on all deployment information currently available in the global dictionary. Therefore, when building models for deployment using these options, make sure that all deployment information is up to date, i.e., based on models trained on the most current set of data. You can also use the Clear All Deployment Info nodes in the data miner workspace to programmatically clear out-ofdate deployment information every time the project is updated ("re-trained"). See also, Data Mining Definition, Data Mining with STATISTICA Data Miner, Structure and User Interface of STATISTICA Data Miner, STATISTICA Data Miner Summary, and Getting Started with STATISTICA Data Miner.