Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable

de 6 Example 3: Predictive Data Mining and Deployment for a Continuous Output Variable STATISTICA Data Miner includes a complete deployment engine with various options for deploying solutions derived form predictive data mining projects. In this example we will illustrate the basic "mechanism" of how STATISTICA Data Miner can generate automatically all information necessary for deployment, i.e., to automatically predict values for new observations based on the parameters estimated for one or more estimated models. This example will be based on the example data file Patients.sta (also used in Example 3 of Nonlinear Estimation Analysis) reported in Neter, Wasserman, and Kutner (1985, page 469). Suppose you want to predict the number of days that patients are likely to spend in a hospital based on some prognostic information. The Patients.sta data file contains observed ("learning") data for 15 patients on two variables: The number of days that each patient was hospitalized (in the variable Days) and an index of the prognosis for recovery for each patient (in variable Prognosis; larger values reflect a better prognosis). The purpose of this project is to "build" a deployed system that will allow users to enter data for variable Prognosis and compute an estimate for the number of days the respective patient will likely stay in the hospital. In similar real-world applications of STATISTICA Data Miner, you most likely would have many variables that are related to patients' prognosis for recovery; those variables could simply be treated as additional predictors. If many thousands of possible predictors are available, you may want to use the Feature Selection and Variable Screening methods of STATISTICA to preselect likely predictors before applying analyses that will build models for predictions (such as neural networks, regression, etc.). Also, in real-world applications the input data are like "noisy," requiring some initial cleaning and filtering (such as illustrated in Example 1. The data may also reside in a remote database that needs to be connected to STATISTICA Data Miner for in-place database processing However, this example will illustrate the basic mechanism of building data miner projects for prediction and deployment. Setting up the Project; Connecting the Data. Start by selecting Build Your Own Project from the Statistics - Data Mining - General Modeler and Multivariate Explorer submenu (see also Data Mining Tools) Instead of Data Miner - All Procedures, in this case, we select the more specialized General Modeler and Multivariate Explorer. This will automatically "connect" the General Modeler and Multivariate Explorer Node Browser configuration to the project, with the specialized nodes for automatic deployment. Note that you can also select these from the general All Procedures Node Browser configuration, but you would have to scroll to them to "find them." As described further in the Node Browser each project is associated with a default Node Browser configuration; however, you can also choose nodes from any of multiple Node Browsers currently open to insert nodes into the currently active data miner workspace. Next, click the New Data Source button on the data miner workspace and open the example data file Patients.sta.

de 6 On the Select dependent variables and predictors dialog, click the Variables button and select variable Days as the Dependent; continuous variable, and variable Prognosis as the Predictor; continuous variable. Then close the dialogs (click the OK button on the variable selection dialog and on the Select dependent variables and predictors dialog) to insert this data source into the Data Acquisition area of the data miner workspace. In a real-world application, at this point we would want to carefully review the input data to ensure that the data are "clean," i.e., do not contain erroneous values, miscoded entries, etc. (see also Example 1, or Crucial Concepts in Data Mining). However, for this example we will skip this step and proceed directly to the data analysis (model building) portion of the project. Selecting and Inserting Analysis Nodes. Because we are not sure about the nature of the relationship between the prognostic variables (single variable in this example) and the outcome variable of interest (number of Days likely to be spent in the hospital), we will select several linear and nonlinear prediction methods to tackle this problem. We will select these from the Regression Modeling and Multivariate Explorer folder of the Node Browser, so that the estimated models (solutions) are automatically available for deployment, i.e., for prediction of new observations (to predict the likely length of the hospital stay from prognostic information, as patients check into the hospital). Specifically, select for this example the Standard Multiple Regression with Deployment node and the two neural network nodes. Click the Insert into workspace button to insert these nodes into the workspace; if you also currently have the data source highlighted, then these nodes will automatically be connected to the data file. Then click Run.

de 6 After a few seconds, the program will fit a linear regression model, a multilayer perceptron neural network, and a radial basis function neural network. You can review the results by double-clicking on the workbook icons in the Reports section of the data miner workspace, or change specific analysis parameter by double-clicking on the respective analysis icons. You can also review the predicted values in the spreadsheet nodes (icons) labeled Training..., which contain the observed and predicted values for each respective model; it is often very informative to connect to these data sources additional graphics nodes to perform some visual inspection of the quality of the fit for each model (see also Example 2: Visual Data Mining). However, for this example, we will directly proceed to the deployment stage. Computing Predicted Values for new Data. Suppose that the purpose of this project is to implement an automatic system for predicting the number of Days a patient is likely to stay in the hospital, i.e., to predict the length of the hospital stay based on prognostic information. Because we chose analysis nodes explicitly labeled as... with Deployment from the Regression Modeling and Multivariate Exploration folder, the information required for deployment, for making predictions from new data, is readily available to us at this point. Specifying Data for Deployment. For example, suppose we have prognostic information (data) for 3 new patients, and that information is entered (or transferred automatically) into a data file NewPatients.sta. You can create this data file for this example; when you do, make sure that you use the same variable names when creating the file as those used in the data file from which the current models were estimated, i.e., make sure to name the variables DAYS and PROGNOSIS, respectively. Next, insert this new data file as a new data source into the Data Acquisition area of the data miner project in the same manner as the original data file. On the Select dependent variables and predictors dialog, select the same variables as before: Specify variable Days as the continuous dependent variable, and variable Prognosis as the continuous predictor variable. In addition, make sure to select the check box Data for deployed project; do not reestimate models.

de 6 As also described in the section on Deploying Solutions, the node labeled...with Deployment will automatically apply the most recently estimated model to the new data, to compute predicted values. Deployment: Computing Predicted Values. After inserting the new data sources marked for deployment into the workspace, connect it to the analysis nodes in this project. You can also at this point disable the other arrows (from the data source used to estimate the models) so that on updating the project, the models will not be reestimated (see also Data Miner Workspace). Then click Run to compute predicted values. The predicted values are available in the data sources (spreadsheet documents) generated by the analysis nodes. These are labeled by default Testing_xxx, where xxx usually is an abbreviation to reference the respective method and node ID that generated the prediction. For example, right-click on the Testing_RRBFx data source and select View Document from the shortcut menu (see also STATISTICA Data Miner Workspace Options).

de 6 The predictions from the radial basis function network are shown in the first column of the spreadsheet. As you can see, patient G. Hill had the best prognosis (highest value for variable Prognosis), and is predicted to stay in the hospital between 9 or 10 days. Predicting new observations, when observed values are not (yet) available. In general, one of the main purposes of predictive data mining (see Crucial Concepts in Data Mining) is to allow for accurate prediction (predicted classification) of new observations, for which observed values or classifications are not (yet) available. When connecting data for deployment (prediction or predicted classification) to the nodes for Classification and Discrimination or Regression Modeling and Multivariate Exploration, make sure that the "structure" of the input file for deployment is the same as that used for building the models (see also the Data for deployed project; do not reestimate models option on the Select dependent variables and predictors dialog). Specifically, make sure that the same numbers and types of predictor variables are specified, that a (continuous or categorical) dependent variable is specified (even if all values for that variable are missing), and that the variable names match those in the data file used to build the models (this is particularly important for the deployment of neural networks, which will rely on this information). Computing an Average Prediction. The Node Browser folder Regression Modeling and Multivariate Exploration contains a node called Compute Best Prediction From All Models. This node will automatically take the most recent information for deployment generated by the nodes (see also Analysis Nodes with Automatic Deployment), and compute predicted values from each; the node can also compute an average prediction for all current models and for advanced applications (see also Example 4) and even choose the best prediction from all models currently available (see also the terms boosting, bagging, and meta-learning). Insert this node into the current workspace, and connect it to the data source containing the Prognosis data for the new observations (for prediction); then choose Run to Node (or SHIFT F5) to generate the predictions. You can right-click on the generated data source and choose option View Document to display the spreadsheet with the predictions. The average prediction from all three models (methods) is also automatically computed; note that these results may be slightly different for your analyses, because, for example, the neural network algorithms

de 6 use (by default) random sub-sampling to create validation samples for "steering" the estimation algorithm (e.g., to avoid over-fitting, and to terminate the estimation procedures). Deploying the Solution to the "Field". As described in Analysis Nodes with Automatic Deployment, the deployment information is kept along with the data miner project in a Global Dictionary, which is a workspacewide repository of parameters. (You can review the current parameters available in the global dictionary via the Edit Global Dictionary Parameters dialog.) This means that you could now save this data miner project under a different name, and then delete all analysis nodes and related information except the Compute Best Prediction from All Models node and the data source with new observations (marked for deployment). A user could now simply enter values (for variable Prognosis) and run this project (with the Compute Best Prediction from All Models node only), and thus quickly compute predicted values for new patients. Because the STATISTICA Data Miner, as all analyses in STATISTICA, can be called from other applications, advanced applications could involve projects like these called automatically with data passed to them from some other (e.g., data entry) application. Making sure that deployment info is up-to-date. To reiterate, in general the deployment information for the different nodes that are named...with Deployment is stored in various forms locally along with each node, as well as globally, "visible" to other nodes in the same project. This is an important point to remember, because for Classification and Discrimination, as well as Regression Modeling and Multivariate Exploration, the node Compute Prediction from All Models will compute predictions based on all deployment information currently available in the global dictionary. Therefore, when building models for deployment using these options, make sure that all deployment information is up to date, i.e., based on models trained on the most current set of data. You can also use the Clear All Deployment Info nodes in the data miner workspace to programmatically clear out-ofdate deployment information every time the project is updated ("re-trained"). See also, Data Mining Definition, Data Mining with STATISTICA Data Miner, Structure and User Interface of STATISTICA Data Miner, STATISTICA Data Miner Summary, and Getting Started with STATISTICA Data Miner.