How to Deploy Models using Statistica SVB Nodes Abstract Dell Statistica is an analytics software package that offers data preparation, statistics, data mining and predictive analytics, machine learning, model deployment, and reporting. This technical brief provides step-by-step instructions for using Statistica to build models from training data and run those models against new data. Specifically, this brief explains how to use Statistica nodes that offer scripting ability using the industry-standard Statistica Visual Basic (SVB) language. Introduction One important aspect of data mining is to build models from training data and run these models against new data or testing data The process of applying a trained model to new data is known as deployment after a satisfactory model or set of models has been identified (the trained model), you deploy those models with new data to quickly obtain predictions or predicted classifications. When new data needs to be analyzed, you do not have to train the models again; instead, you can simply connect the new data to a prediction node. For example, a credit card company might use a trained model to predict credit risk based on the information provided on credit card applications. This technical brief details, step by step, how to deploy models using Statistica nodes that offer scripting ability with the Statistica Visual Basic language. For more information about using Statistica nodes, as well as detailed instructions on deploying nodes that do not offer SVB scripting, see the related technical brief, How to Deploy Models using Statistica Nodes. This technical brief assumes that you have a basic understanding of how to navigate through the workspace. If you need a refresher, see How to Navigate the Statistica Workspace.
In predictive data mining, the process of applying a trained model for prediction or classification to new data is known as deployment. Figure 1. Opening the sample data set Deployment using SVB scripted nodes For this tech brief, we will use historical data on customers who either satisfy their loans (have Good credit) or default on their loans (have Bad credit), which is provided in the sample spreadsheet Creditscoring.sta. We will split this data into training and testing data, use two nodes to model the training data set, and then compare the predictions from those models. The sample data set To open the sample data set, go to the Home tab and click Open Examples from the Open drop-down menu, as shown in Figure 1. From the Examples folder, select the Datasets folder and then double-click the file Creditscoring.sta. To add the spreadsheet to a workspace, go to the Home tab, and in the Output group, click Add to Workspace and then click Add to New Workspace, as illustrated in Figure 2. Figure 2. Adding the sample data set to the workspace 2
Figure 3. Accessing the scripted procedures Step 1: Prepare to split the sample data into training and testing samples by adding a node. First, we must split the sample data into training and testing samples. Select Scripted Procedures, as shown in Figure 3. On the Data tab in the Manage group, click Sampling. On the resulting submenu, select Split Input Data into Training and Testing Samples (Classification). Double-click the node to display the Edit Parameters dialog box. Specify 25 in the Approximate percent of cases for testing box, as shown in Figure 4. Click OK to close this dialog box. First, we must split the sample data into training and testing samples. Figure 4. Specifying the percentage of cases for testing 3
Scripted procedures have two types of nodes: one with deployment and one without. Figure 5. Selecting dependent variables and predictors Step 2: Select the dependent variables and predictors. Double-click the CreditScoring data set. In the variable selection dialog box, click the Variables button and specify the variables as shown in Figure 5. Then click OK to close the dialog box. Step 3: Split the sample data into training and testing samples by running the node. Run the Split Input Data into Training and Testing Samples (Classification) node by doing either of the following: Click the green arrow icon on the lower left corner of the node. Right click the node and click Run to Node from the shortcut menu. After you run the node, your workspace should look like Figure 6. Step 4: Add the appropriate nodes to the workspace. Now we will use the Boosted Classification Trees and Random Forest Classification nodes to model the training data set. Scripted procedures have two types of nodes: one with deployment and one without. Since the models in our example will be deployed to the testing data, it is important to select the nodes that contain the deployment feature. Click the Node Browser button to display the Node Browser. In the left pane, expand the Data Mining folder and then the Deployment folder. Then expand the Scripted Deployment folder and select Classification. In the right pane, doubleclick the Boosting Classification Trees with Deployment node (as shown in Figure 6. The workspace after running the node 4
Figure 7. Adding the Boosting Classification Trees with Deployment node to the workspace Figure 7) to add it into the workspace. Then double-click Random Forest Classification with Deployment and Compute Best Predicted Classification from all Models to add those nodes to the workspace as well. Alternatively, you can drag the nodes to the workspace. Step 5: Connect the data to the nodes. Now, connect the Training Data node to the Boosting Classification Trees with Deployment and Random Forest Classification with Deployment nodes, and connect the Testing Data node to the Compute Best Predicted Classification from all Models node. To connect two nodes, click the gold diamond icon in the center-right side of one node, hold down the mouse button, draw an arrow to another node, and then release the click. To connect two nodes, click the gold diamond icon in the center-right side of one node, hold down the mouse button, draw an arrow to another node, and then release the click. 5
The resulting Final Prediction for Credit Rating spreadsheet contains the final credit predictions from each model, plus the voted predictions. Figure 8. The workspace after running the models Step 6: Run the models. Click Run All in the upper left corner of the workspace. The workspace should look like Figure 8. The resulting Final Prediction for Credit Rating spreadsheet, shown in Figure 9, contains the final credit predictions from each model (in the CBT_Prediction and RF_Prediction columns), plus the voted predictions (in the Ensemble_Prediction column). Figure 9. The resulting spreadsheet, which contains the final credit predictions from each model, plus the voted predictions 6
Figure 10. Error message displayed if a node with no deployment is selected Troubleshooting If a node with no deployment is selected and you run the Compute Best Predicted Classification from all Models node, the error shown in Figure 10 will be displayed. Return to Step 4 and select nodes that include the words with deployment. Conclusion Statistica delivers advanced and predictive analytics, data mining, statistical analysis and advanced machine learning algorithms for building predictive models. This technical brief shows how easy it is to create an analysis using Statistica nodes with SVB scripting ability. Be sure to explore the many other features of these nodes not illustrated here. About Dell Software Dell Software helps customers unlock greater potential through the power of technology delivering scalable, affordable and simple-to-use solutions that simplify IT and mitigate risk. The Dell Software portfolio addresses five key areas of customer needs: data center and cloud management, information management, mobile workforce management, security and data protection. This software, when combined with Dell hardware and services, drives unmatched efficiency and productivity to accelerate business results. www.dellsoftware.com. This technical brief shows how easy it is to create an analysis using Statistica nodes with SVB scripting ability. Be sure to explore the many other features of these nodes not illustrated here. 7
For More Information 2014 Dell, Inc. ALL RIGHTS RESERVED. This document contains proprietary information protected by copyright. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording for any purpose without the written permission of Dell, Inc. ( Dell ). Dell, Dell Software, the Dell Software logo and products as identified in this document are registered trademarks of Dell, Inc. in the U.S.A. and/or other countries. All other trademarks and registered trademarks are property of their respective owners. The information in this document is provided in connection with Dell products. No license, express or implied, by estoppel or otherwise, to any intellectual property right is granted by this document or in connection with the sale of Dell products. EXCEPT AS SET FORTH IN DELL S TERMS AND CONDITIONS AS SPECIFIED IN THE LICENSE AGREEMENT FOR THIS PRODUCT, DELL ASSUMES NO LIABILITY WHATSOEVER AND DISCLAIMS ANY EXPRESS, IMPLIED OR STATUTORY WARRANTY RELATING TO ITS PRODUCTS INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. IN NO EVENT SHALL DELL BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE, SPECIAL OR INCIDENTAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION OR LOSS OF INFORMATION) ARISING OUT OF THE USE OR INABILITY TO USE THIS DOCUMENT, EVEN IF DELL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Dell makes no representations or warranties with respect to the accuracy or completeness of the contents of this document and reserves the right to make changes to specifications and product descriptions at any time without notice. Dell does not make any commitment to update the information contained in this document. About Dell Software Dell Software helps customers unlock greater potential through the power of technology delivering scalable, affordable and simple-to-use solutions that simplify IT and mitigate risk. The Dell Software portfolio addresses five key areas of customer needs: data center and cloud management, information management, mobile workforce management, security and data protection. This software, when combined with Dell hardware and services, drives unmatched efficiency and productivity to accelerate business results. www.dellsoftware.com. If you have any questions regarding your potential use of this material, contact: Dell Software 5 Polaris Way Aliso Viejo, CA 92656 www.dellsoftware.com Refer to our Web site for regional and international office information. 8 TechBrief-Statistica-SVBnodes-US-VG-24976