IBM SPSS Modeler Server performance and optimization

Transcription

1 IBM Software Business Analytics SPSS Modeler Server IBM SPSS Modeler Server performance and optimization Improve performance and scalability in high-volume environments

2 2 IBM SPSS Modeler Server performance and optimization Contents 2 Introduction 3 Performance and scalability 19 Optimizing performance 27 Advanced performance optimization 29 Scoping and sizing SPSS Modeler Server 30 Conclusion Introduction Predictive analytics offers organizations the ability to add predictive intelligence to the process of making decisions. Predictive intelligence improves the decisions made by individuals, groups, systems and organizations in multiple business areas such as customer analytics, operational analytics and proactive risk and fraud mitigation. Data mining is at the core of predictive analytics because it helps organizations understand the patterns in their data. As a result, organizations can make the smart decisions that drive superior outcomes. IBM SPSS Modeler is a data mining workbench that enables improved decision-making with quick development of predictive models and quick deployment of these models into business operations. SPSS Modeler: Works in a variety of operating environments Can scale from a single desktop to an enterprise-wide deployment Supports virtually any data source (including Hadoop when used with IBM SPSS Analytics Server) Provides the ability to incorporate structured and unstructured data. It is available in three editions: IBM SPSS Modeler Professional uncovers hidden patterns in structured data with advanced algorithms, data manipulation and automated modeling and preparation techniques. IBM SPSS Modeler Premium adds the ability to use natural language processing and sentiment analysis on text data as part of a predictive analytics project. Entity analytics disambiguates identities, and social network analysis identifies influencers in social networks. IBM SPSS Modeler Gold includes the full range of predictive capabilities for structured and unstructured data. Users can combine, optimize and deploy predictive models and business rules to an organization s processes and operational systems to provide recommended actions at the point of impact. As a result, people and systems can make the right decision every time.

3 IBM Software 3 All editions of SPSS Modeler use a client/server architecture. The client provides the visual workbench for predictive analytics. The server adds increased performance and efficiency, along with features that support additional scale. IBM SPSS Modeler Server is designed to improve performance by minimizing the need to move data in the client environment and by pushing memory-intensive operations such as scoring and data preparation to the server. SPSS Modeler Server also provides support for SQL push-back and in-database modeling capabilities so users can take better advantage of their existing infrastructure and data warehouse and further improve overall performance. This paper highlights the capabilities and possibilities of SPSS Modeler Server, and it serves as a guide to understanding and maximizing SPSS Modeler Server performance. Initial sections provide performance benchmarking results for IBM SPSS Modeler Professional and IBM SPSS Modeler Premium rather than the performance of models post-deployment. Subsequent sections describe performance optimization and sizing recommendations. Many of the results provided in this document address SPSS Modeler Server performance as it relates to issues of scalability. By utilizing options only available with the server (such as SQL pushback/generation, in-database mining, scoring adapters and more), users are able to fully exploit the client/server architecture to improve performance and deliver a quicker return on IT investment. Performance and scalability SPSS Modeler Server has been designed and developed to provide high performance and scalability for all data mining tasks. For example, SQL generation and parallel processing are automatic. As a result, SPSS Modeler users do not need to make any changes to the way they work to get consistently high performance. To benchmark performance, IBM measured the ability of SPSS Modeler Server to carry out the common tasks of data preparation, model building and model scoring. IBM used a variety of operating environments and altered the size of the data files. Data mining involves more than simply model building and model scoring. Data preparation is a major component of the process. So, IBM s tests also evaluated the performance of common steps such as reading, sorting, aggregating, merging and writing data. Reading and writing data Times have been recorded for reading the data sets in SPSS Modeler with the stream shown in Figure 1. The Sample node in these test streams means that IBM was able to record only the time taken to read the data in (no data write time was added).

4 4 IBM SPSS Modeler Server performance and optimization Figure 1: Test stream to read the data into SPSS Modeler To obtain the results for writing to the various formats, the stream in Figure 2 was used. To improve performance, SPSS Modeler executes the reading and writing operations at the same time; the data writing operation starts before all the data has been read in. Therefore, when measuring how fast SPSS Modeler writes data, the time includes both reading and writing. For the most frequently used data formats (CSV, Database Table, SPSS Statistics files [.sav]), a million records is read in less than 30 seconds and written back to the source in less than a minute. Figure 2: Test stream to write from SPSS Modeler to the file formats tested during benchmarking. Benchmarking results show that, as the number of records in a dataset increase, so does the processing time for reading and writing (Figure 3). Overall, performance is slower for XML and Excel when compared to other formats tested (CSV, database table and SPSS.sav). For CSV, database and SPSS. sav, read performance time doubles when the number of records is increased from 100,000 to 1 million. The processing time for those formats remains below 25 seconds at 1 million records.

5 IBM Software 5 Figure 3: Execution of data read from along with read from and written to is shown for various data formats. Results for CSV, SPSS.sav and DB2 tables are plotted by seconds on the left axis and those for XML and Excel by minutes on the right axis to better highlight the differences and similarities in performance between data sources. Results also show that performance is consistent for various environments for reading and writing datasets in all the various file formats.

6 6 IBM SPSS Modeler Server performance and optimization Sorting data The sorting test involved sorting the data sets by a single column with the stream shown in Figure 4. For a more realistic reflection of a customer scenario, the times include how long it took to read the data and the time taken to sort. Figure 4: Test stream for sorting data with SPSS Modeler. Figure 5 shows that the sorting performance of SPSS Modeler Server is linear as the number of records sorted is increased throughout increasingly powerful operating environments. The test results also show that the use of SQL pushback functionality provides a significant increase in performance for the sorting operation. By enabling SQL pushback in a stream, the SQL instructions are pushed back and executed in the database itself. This means that performance depends on the operating environment rather than on SPSS Modeler. Data aggregation A 5 million record data set was used to test aggregating with SPSS Modeler. The number of unique values that appeared in the designated field was scaled with the stream in Figure 6. For the test results to reflect how it would be used in an operating environment, the times measured for the SQL pushback also include the time it takes to read in the data. Figure 5: Sorting data with SPSS Modeler. SQL pushback improves the performance when a database is used. The CSV file was stored locally on the server system and the IBM DB2 database was running on a remote system.

7 IBM Software 7 The test results show that the aggregation operation scales well as the number of unique categories to be aggregated increases. Figure 7 highlights the performance times. Note the dramatic improvement in the SQL pushback functionality when a database table is used as the source data for aggregation as compared to a CSV file. The process was complete in almost half the time of the CSV file. This result was consistent in all the operating environments used for testing. Figure 6: Test stream to aggregate data with SPSS Modeler Merging data To check the performance of the merge operation, the stream in Figure 8 was used. Times were recorded as the size of the data sets increased. An inner join used a unique ID column, which means that the merge was one-to-one. Every record in the first data set had only one match in the second data set. Figure 7: Aggregating data with SPSS Modeler. Using SQL pushback within DB2 was almost twice as fast as a CSV file.

8 8 IBM SPSS Modeler Server performance and optimization Figure 8: Test stream to merge data with SPSS Modeler. The test results show that the merge operation scales relatively linearly in relation to the number of records being merged (Figure 9). Yet again, the improvement in time with the use of a database and SQL pushback is evident. With SQL pushback, the merge has already taken place before the data is read out of the database and then brought into SPSS Modeler. The increase in performance is most notable at scale. The time recorded in these results includes the time that was taken for the data to be read into SPSS Modeler. Text analytics The ability to structure text is an important capability of IBM SPSS Modeler Premium. Including concepts derived from text increases modeling accuracy. For example, when predicting customer purchase propensity for a product, customer attitudes and preferences are often derived from surveys, call center notes and social media to augment behavioral and demographic data. Building a text model provides a way to apply a structure to new text based on the analysis done on historical or existing text. The text analytics capabilities of IBM SPSS Modeler can use a variety of data sources. For testing purposes, however, IBM used . The testing for text analytics performance used the following input data: Approximately 500,000 s Average number of words per Average number of characters per Figure 9: Merging data with SPSS Modeler. With SQL pushback, the merge has already taken place before the data is read out of the database into SPSS Modeler.

9 IBM Software 9 Text analytics model building Tests were run to measure the performance of building a non-interactive (automatically derived) Text Mining concept model from Basic Resources and Opinions with the stream in Figure 10. Text analytics model scoring After concepts are extracted, SPSS Modeler creates a text model that can be used in predictive streams. Scoring against the text model means that new text is categorized with the patterns established during the model building process. IBM s tests assessed the speed of scoring new records against an existing model with the stream in Figure 12. Figure 10: Test stream for text analytics model building The tests showed that, after initial training (which uses more overhead), performance accelerates (Figure 11). Figure 12: Test stream for model scoring Figure 11: Initially, training time uses more overhead. After the training is complete, performance accelerates.

10 10 IBM SPSS Modeler Server performance and optimization The test results showed that scoring performance is linear in relation to the number of records (Figure 13). Cube Complexity Level (1 = simple, 5= complex) Number of fields when data is viewed in SPSS Modeler Number of Dimensions Cube_ Cube_ Cube_ Cube_ Cube_ Number of measures Figure 13: Scoring performance is linear in relation to the number of records. TM1 Integration SPSS Modeler supports data imports from and data exports to IBM Cognos TM1. These operations are controlled by Cognos TM1 process scripts in the Cognos TM1 server. When an SPSS Modeler TM1 import or export operation is executed, SPSS Modeler runs these process scripts first (alongside any native SPSS Modeler processing that is required). TM1 tests: Cube complexity levels The cube complexity levels defined in the tests are based on test cubes created by the Cognos TM1 team. The objective was to best represent the different levels of complexity that a Cognos TM1 user might have in a cube. The following table shows how the complexity levels are defined. TM1 import TM1 Import works by passing a view from Cognos TM1 to SPSS Modeler for additional analysis (Figure 14). To achieve the best performance, users are encouraged to define Cognos TM1 views that are as specific as possible to reduce the overhead of moving large data files between Cognos TM1 and SPSS Modeler Server systems.

11 IBM Software 11 read 1,000 records into SPSS Modeler, the size of the dataset passed over the network from Cognos TM1 to SPSS Modeler Server is actually 10,000 records. SPSS Modeler Server processes the full set and then filters this 10,000 record data set to the 1,000 records required for display. The largest dataset tested has 1 million records, and the size of the dataset passed over the network from Cognos TM1 to SPSS Modeler (before filtering) is 10 million records. Figure 14: Cognos TM1 import that indicates that no records are output. To represent a real user scenario, the cubes for the Cognos TM1 import test contained more information (a factor of 10 in relation to number of records) in the view than would be required based on the settings of the Cognos TM1 Import node. For example, when you import the simple cube 1 and Tests were run by importing data from TM1 with scaling by both cube complexity and cube size (number of records). The graphs in Figure 15 show how the SPSS Modeler and Cognos TM1 integration scales linearly in relation to both aspects.

12 12 IBM SPSS Modeler Server performance and optimization Figure 15: Cognos TM1 import test results. The integration scales linearly in relation to both aspects.

13 IBM Software 13 TM1 export Tests were run by exporting data from Cognos TM1 (Figure 16) and scaling by both cube size (number of records) and cube complexity. Figure 16: Export to Cognos TM1 stream Figure 18: Cognos TM1 export test results for cube complexity. The graphs in Figure 17 and Figure 18 show that the SPSS Modeler and Cognos TM1 integration scales linearly for both scaling aspects. Figure 17: Cognos TM1 export test results for number of cubes

14 14 IBM SPSS Modeler Server performance and optimization Model building Figure 19 shows a stream that was used to test the modelbuilding execution in SPSS Modeler. The test used datasets with 100,000 records, 500,000 records and 1 million records. Results show that performance is linear related to the size of data. Results are shown by model type to ease analysis and are grouped as follows: Classification models Segmentation models Association models Automated models Figure 19: Test stream used for model building. In this case, it was a C5.0 model. Classification models use the values of one or more input fields to predict the value of one or more output or target fields (for example, logistic regression or a decision tree). Neural Net had the slowest performance because of the sophistication and learning that the technique requires. However, all of the techniques built models in less than 3 minutes for a set of 1 million records (Figure 20). Segmentation models divide the data into segments or clusters of records that have similar patterns or characteristics, such as KMeans clustering. They can also identify patterns that are unusual, or anomaly detection. The KNN technique is included with this set, although it is typically used for classification. KNN classifies cases based on similarity to other cases nearby, which mirrors the computations that are done by classical segmentation models. Because more computation is used, its performance lags behind that of the other techniques shown. Anomaly, Kmeans and Two Step were the quickest. Two Step completed within 1 minute for 1 million records. Figure 21 shows the results. Figure 20: Model building times for classification models. Most of the models completed in less than 30 seconds for 250,000 records, within 1 minute for 500,000 records and all completed in less than 3 minutes for 1 million records.

15 IBM Software 15 Figure 21: Model building times for segmentation models in SPSS Modeler. The Two Step, KMeans and Anomaly were the quickest of the five models. Association models are used to find patterns in data where one or more entities (such as events, purchases or attributes) are associated with one or more other entities. The models construct rule sets that define these relationships. For example, these techniques are used for Market Basket Analysis, which models the next likely purchase for a customer based on their previous purchases and identifies products that are typically bought together or at a certain sequence. Figure 22 shows that both the Carma and Apriori models were built in less than 30 seconds for a dataset with 1 million records.

16 16 IBM SPSS Modeler Server performance and optimization Figure 22: Model building times for association models in SPSS Modeler. At 1,000,000 records, Apriori completed in 24 seconds and Carma completed in less than 17 seconds. The automated models (Auto Classifier, Auto Cluster and Auto Numeric) estimate, compare and combine multiple modeling techniques in a single run. Automated models eliminate the need for users to sequentially test multiple techniques individually. They are designed to make modeling easier for those users unfamiliar with all of the underlying algorithms that IBM SPSS Modeler supports. Although ALM (Automated Linear Modeling) does not use multiple algorithms to build a model, it does have an automated data preparation step that transforms the target and predictor variable automatically to maximize the predictive power of the model it creates. Figure 23 shows that the performance of the automated techniques is directly proportional to the size of the dataset. All complete within 8 minutes for 500,000 records and within 15 minutes for 1 million records. ALM completes within 2 minutes for 1 million records, which reflects the speed of the automatic data preparation.

17 IBM Software 17 Figure 23: Model building times for the automated models in SPSS Modeler. Model scoring Scoring is defined as applying a created model to new data. This process generates new data, which is typically a prediction (score). Multiple fields are typically calculated and appended to the records. Scoring can occur in batch or in real-time. Batch scoring is done as an event. For example, you can score customers each month whose contract is up for renewal against a model that calculates whether and how likely they are to cancel. An example of time scoring in real time is calculating and providing a likelihood of fraud score to an agent recording an insurance claim as the agent gathers data. Scoring in real-time is provided in SPSS Modeler Gold and is used by organizations that are integrating predictive intelligence into operational systems. IBM s test recorded the results for batch scoring that used a data set with 10,000 rows and 20 columns. The resulting model was then used in a stream (Figure 24) and files of various sizes were then scored. Figure 24: Test stream used for model scoring. In this case, it was a C5.0 model. The results showed that, as the number of records being scored increased, the performance of many models increased to a point and then remained constant. This increase is related to the fact that there is an initial fixed overhead related to the scoring process. This overhead is not related to the number of rows scored, rather a one-off cost. Therefore, the one-off cost becomes less important as the number of rows to be scored increases.

18 18 IBM SPSS Modeler Server performance and optimization Figure 25: Model scoring times for classification Figure 26: Model scoring times for segmentation and association models

19 IBM Software 19 Figure 27: Model building times for the automated models Figures 25, 26 and 27 illustrate the performance of the scoring models for classification models (Figure 25), segmentation and association models (Figure 26) and automated models (Figure 27). Note that the charts show the scores per second rather than elapsed time to allow for better comparisons between the test cases and your actual data scoring requirement. Optimizing performance SPSS Modeler Server achieves most of its high performance with optimizations that are running by default. However, at times analysts and data miners will need more control over the optimization of their SPSS Modeler streams. SPSS Modeler Server supports this by providing immediate feedback upon execution.

20 20 IBM SPSS Modeler Server performance and optimization In Figure 28, the SPSS Modeler stream is executed with SQL generation, and the nodes turn purple rather than the usual white. Purple nodes indicate that the operations they represent have been translated into SQL and executed in database. This feedback helps ensure that as much of the stream as possible is executed in the database. Additional options enable the user to examine the SQL that is generated. R model building IBM s tests measured model building performance for the R Linear model [lm()] algorithm, and they were run by scaling the number of records used for model building (Figure 29). Three data points were used: 250,000 records, 500,000 records and 1 million records. The test data used was 20 fields wide (1 target field, 19 model input fields). Model building times were recorded with the stream setup in Figure 14. The R syntax used in the Modeler R model building node is: modelermodel <- lm(modelerdata$offer_ PROFIT_1~.,modelerData) Figure 28: SQL generation and highlighting in a SPSS Modeler stream. The nodes have turned purple to indicate those nodes have been translated into SQL and executed in-database. R integration With SPSS Modeler, users can execute R syntax from SPSS Modeler R nodes. For performance testing, the focus was on the R model building and R model scoring operations. R model building can be run natively in SPSS Modeler Server. The R syntax is parsed by SPSS Modeler and sent to the R program to process. R model scoring can either be run natively in SPSS Modeler Server (with the same technique used for model building) or with the R in-database scoring functionality. For the R in-database scoring, R is present in the database to take advantage of the fact that processing can be done in the same database system that is storing the data. By reducing data movement, performance is improved. Using R in-database techniques for R model scoring is significantly faster than native R scoring in SPSS Modeler Server. For the performance tests, the focus was on running R in-database scoring with an IBM Netezza database. Figure 29: The stream setup used to test R model building R model building scales linearly in SPSS Modeler (Figure 30). An R model build operation is approximately at the lower end of performance relative to other native SPSS Modeler model building operations. However, even with the additional overhead, performance is within 5 minutes for 1 million records.

21 IBM Software 21 Figure 30: Processing time is within 5 minutes for 1 million records R model scoring IBM ran tests to measure the performance for scoring the R Linear model [lm()] algorithm. The model used for model scoring was built against a dataset with 10,000 rows x 20 columns. The model scoring was batch scoring. Model building times were recorded with the stream setup in Figure 31. The syntax used in the R model scoring node is: result <- predict(modelermodel,newdata=modelerdata) var1 <-c(fieldname= predicted, fieldlabel=,fieldstorage= real,fieldformat=,fieldmeasure=,fieldrole= ) Figure 31: Stream used to test R model scoring The test scenario represents a customer who has their data stored in a database and wishes to score their data with an R model and write the results of the scoring operation back to the database. modelerdatamodel<-data. frame(modelerdatamodel,var1) modelerdata <- cbind(modelerdata, result)

22 22 IBM SPSS Modeler Server performance and optimization Figure 32: R in-database scoring performance results The test results show (Figure 32) that R model scoring performance can be significantly increased by using the R in-database scoring techniques available in SPSS Modeler. This performance increase mainly relates to the R in-database function that enables a reduction in the data transfer operations between the database and SPSS Modeler because data does not need to be transferred out of the database.

23 IBM Software 23 SQL pushback to improve model scoring times Certain models in SPSS Modeler Server have functions that enable the SQL to be generated, pushing back the model scoring stage to the database itself. For modeling streams that use these models, the full SQL of the scoring procedure is pushed back to the database as SQL. The models with these functions are: C5.0 C&R Tree (CART) CHAID Quest Decision List Logistic Regression Neural Net PCA Linear Regression Figure 33 shows that the model scores per second metric increases dramatically by enabling SQL pushback. Most models improve performance by about 10 times. Because the SQL generated by Logistic Regression and Neural Net is exceedingly complex, those models do not show the kind of improvement that others do. In-database algorithm support to reduce data movement Many organizations have invested heavily in a database infrastructure for predictive analytics and business intelligence systems, but these systems are often under-utilized. One of the key benefits of SPSS Modeler Server is that it enables organizations to fully utilize their investments in highperformance database systems. With SPSS Modeler Server, organizations can take advantage of algorithms that are native to the database environment along with the many additional data preparation and modeling procedures. These algorithms and procedures are native to SPSS Modeler. Database-native algorithms are often tuned to perform better on the underlying database, and users often see performance improvements from those algorithms. The following algorithms are available in SPSS Modeler for use with the respective database: IBM InfoSphere algorithms: Decision Trees, Association Rules, Demographic Clustering, Kohonen Clustering, Sequence Rules, Transform Regression, Linear Regression, Polynomial Regression, Naive Bayes, Logistic Regression, Time Series Netezza algorithms: Decision Trees, K-Means, Bayes Net, Naive Bayes, KNN, Divisive Clustering, PCA, Regression Tree, Linear Regression, Time Series, Generalized Linear Microsoft SQL Server algorithms: Decision Trees, Clustering, Association Rules, Naive Bayes, Linear Regression, Neural Network, Logistic Regression, Time Series, Sequence, Clustering Oracle algorithms: Naive Bayes, Adaptive Bayes, Support Vector Machine (SVM), Generalized Linear Models (GLM), Decision Tree, O-Cluster, k-means, Nonnegative Matrix Factorization (NMF), Apriori, Minimum Descriptor Length (MDL), Attribute Importance (AI) Figure 33: Model scores per second with and without SQL pushback. SQL pushback is a feature only available with IBM SPSS Modeler Server.

24 24 IBM SPSS Modeler Server performance and optimization SQL pushback In-database algorithm support Scoring Adapter provided Read-Write (no SQL pushback) Read only (no SQL pushback) IBM DB2 Enterprise Server Edition X X X IBM DB2 for i (formerly i5/os) X IBM DB2 for z/os X X IBM Informix IBM Infosphere Classic Federation Server for z/os IBM Netezza Data Warehouse X X X Greenplum Database X Microsoft SQL Server X X MySQL Oracle Database X X Salesforce.com SAP Hana X SAP Sybase IQ X Teradata X X X X X X SPSS Modeler Scoring Adapters SPSS Modeler Scoring Adapters can expand the scope of in-database scoring beyond those models supported by SQL pushback. The Scoring Adapters can be applied to a much wider set of models, which provides more deployment flexibility to those who work with large data sets stored in enterprise warehouses. The Scoring Adapters enable a user to install a set of UDFs that execute the model scoring operation in the native database and eliminate the need to move data from the database. IBM provides Scoring Adapters for Netezza, Teradata, IBM DB2 for Linux, UNIX and Windows, DB2 for AIX and DB2 for z/os. IBM SPSS Scoring Adapters for Netezza, Teradata and DB2 (IBM AIX and Linux only) also support scoring text analytics models in the database. The support of these models is an important feature because SQL pushback is not available for text models. Figure 34 demonstrates that Scoring Adapters can improve the performance of text models. This improvement is especially true for concept models. Figure 34: Scoring performance for text analytics concept model with native SPSS Modeler Server and SPSS Modeler Scoring Adapters.

25 IBM Software 25 Intelligent SQL generation within stream execution to improve performance SPSS Modeler Server intelligently reorders operations in the SPSS Modeler stream to maximize performance without altering results. Analysts or data miners can organize streams in a way that makes sense to them, while SPSS Modeler Server reorganizes those operations in a way that makes sense to the database. Figure 35 shows a Derive node that has an operation that cannot be carried out in the database, whereas the Select node can be pushed back to the database as indicated by its purple color. Figure 35: SPSS Modeler Server optimizes the process so that the Select operation is performed before the Derive operation, which reduces data transfer and improves performance. In-database caching IBM SPSS Modeler Server supports the ability for a user to indicate caching on a given node. Caching prevents the reading of data that has not changed. When data passes through the node the first time, the cache is filled with data. On subsequent runs, data would be read from the cache rather than the data source. Applying caching selectively means data that is changing is read at run time but data that is consistent between runs should not be read multiple times. This caching can be a useful way to ensure that memory-intensive data processing is only executed once. Normally, the cache is stored as a temporary file on the file system, but SPSS Modeler Server can also cache this data into a temporary table in the database. It can then be accessed through the many SQL optimization options available in SPSS Modeler Server and can result in even more significant performance gains. Automatically generating SQL for all nodes that are attached to the cache can improve performance even further. In Figure 36, the Merge operation is highlighted, indicating that the operation is being executed in the database from the filled database cache. Figure 36: Setting a cache on a node that is likely to be re-executed will store the data in a temporary table on the database (where possible), enabling further in-database operations from that node on.

26 26 IBM SPSS Modeler Server performance and optimization Figure 37: SPSS Modeler PSM functionality can take advantage of a system s additional CPUs to use parallelism to increase the performance of the Model building functionality. Optimizing for very large data sets SPSS Modeler Server features options associated with the selected models, which enables users to specify that they are working with VLDs, which are referred to as PSM options during benchmark testing. VLDs divide the data into smaller data sets and build one model on each data set. The most accurate models are then automatically selected and assembled to create a single final model nugget. IBM tests focused on the scalability of the VLD options and compared them with a Neural Net model. These tests demonstrated considerable time savings when working with VLD and using SPSS Modeler Server VLD options on multi-processor machines. Tests were run on building the ALM, CART and Neural Net models and used three data sets: 1 million records, 5 million records and 10 million records. A high specification Windows Server system (16 CPUs, ~130GB RAM) was used. Test results show how the Modeler PSM functionality can take advantage of a system s additional CPUs to use parallelism to increase the performance of the model building functionality (Figure 37).

27 IBM Software 27 Advanced performance optimization SPSS Modeler and SPSS Modeler Server provide a number of additional advanced capabilities that enable data miners to optimize the performance of their streams. Database bulk-loading to relieve bottlenecking Data movement is often a bottleneck in performance especially when writing to a database. SPSS Modeler provides a number of features to optimize this process for large data volumes. By default, writing to a database occurs row by row. This prevents errors and provides data security but slows performance. Enabling SPSS Modeler to commit multiple rows at a time is a good way to gain more reasonable performance, and this option is available by default. In addition to the batch committal of records, SPSS Modeler supports two types of bulk loading. One is provided through ODBC bulk loading facilities and the other uses an external bulk loading tool for a database-native solution (Figure 38). Figure 38: The DB Export: Advanced Options dialog box easily enables bulk loading to the database with ODBC or an external loader. External bulk loading scripts are provided for IBM DB2, IBM Intelligent Miner for Data, IBM Netezza Performance Server, IBM Redbrick Warehouse, Microsoft SQL Server, Oracle Data Miner and Teradata Warehouse databases. These scripts can be customized and custom scripts may be written for other databases.

28 28 IBM SPSS Modeler Server performance and optimization Database indexing Indexing database tables maintains the performance of in-database options. Correct indexing significantly affects many subsequent database operations. SPSS Modeler Server enables users to create indexes on tables that are exported from SPSS Modeler (Figure 39). Simple indexes can be created fairly easily. Users can also customize the SQL statement used to create the index (for instance, to create a BITMAP, UNIQUE, or FILLFACTOR index). Optimized joins and sorts By default, SPSS Modeler operates on certain assumptions about the state of data in the system. For example, SPSS Modeler cannot operate on the assumption that any data has already been sorted. Therefore, many operations sort data, even if such a sort is redundant. SPSS Modeler enables a user to optimize a sort or join operation by specifying any existing sorts on the data. This eliminates redundancy and improves performance. Users can also optimize the performance of SPSS Modeler with special case algorithms for joins. SPSS Modeler s default join algorithm is designed for optimized performance when joining data sets of similar size. In some very common operations, such as using a join to connect an ID in one table to a label or description from another (for example, joining a product code in a table of transactions to a product name in a look-up table), the default join is inefficient. SPSS Modeler offers an alternate join algorithm for these situations, which significantly boosts performance speed. Figure 39: Create indexes on database tables from within IBM SPSS Modeler Server to improve database performance.

29 IBM Software 29 Parallel processing to improve performance Symmetric Multi-Processor (SMP) machines are widely used and available for all platforms supported by SPSS Modeler Server. They consist of multiple CPUs that share access to the same memory, disk, network and other input and output resources. When a multi-threaded application runs on an SMP machine, threads can be distributed over the CPUs and execute truly in parallel. Application processes and individual threads can usually migrate dynamically between CPUs to balance processor load. This process is generally handled transparently by the operating system. SPSS Modeler uses a parallel data sorting algorithm to improve the performance of a number of data processing operations. Sorting is used by many SPSS Modeler operations including binning, model evaluation, merge and the sort operation itself. All of these operations benefit from the parallelization of the sort operation. The parallelized sort algorithm uses a technique called record parallelism. This technique assigns records in a round-robin to separate sorting processes. Each process sorts its own subset of records and the results are joined. Sort times can be reduced by more than 30 percent when running on multi-processor hardware and at high data volumes. Scoping and sizing SPSS Modeler Server Many factors must be considered when scoping hardware requirements for an SPSS Modeler Server installation. The breadth of operations and differences in data volumes make it difficult to estimate performance for any specific hardware configuration. Impact of CPUs on performance Obviously the core speed of any individual CPU will affect data mining performance. Almost all data mining operations, especially modeling, are depend heavily on processors, so an increase in CPU speed will produce a proportional increase in performance for many SPSS Modeler processes. The main benefits of multiple CPUs (or multicore CPUs) occur when running multiple streams. Therefore, the number of users will often be the deciding factor in determining the optimum number of CPUs. Multiple CPUs will also benefit parallelized operations, but the main benefits will be from supporting multiple users as shown in the following table. Number of users Number of CPUs For a production server that is running scheduled data mining via IBM SPSS Collaboration and Deployment Services, the number of CPUs should be determined by the number of separate processes that must run simultaneously. Maximum performance can be achieved, for example, by splitting a model scoring process over multiple CPUs or building multiple models simultaneously.

30 30 IBM SPSS Modeler Server performance and optimization Impact of disk space on performance Before addressing disk space requirements, you must understand the volume of data that is likely to be used for the actual data mining. Most organizations store many terabytes of data, especially transactional data, but this amount will rarely be used. Normally the data is aggregated, selected or sampled before it is used for analysis. While large data volumes are typically used in model scoring, the model scoring processes usually rely on operations that do not use a lot of system resources. Disk usage for data processing steps can be relatively high when you are trying to maximize performance. The user often caches data to minimize execution times, and some operations will spill to disk when physical memory is unavailable. In addition, some operations can produce a dataset larger than the raw input data, further increasing disk requirements. Given that the large data preparation steps are typically executed infrequently (it is best practice to store the results of such processing as intermediate files or tables), a conservative rule of thumb is to reserve between 3-5 times the disk space required to store the original data. Conclusion The ever-growing amount of data (size and variety) created by organizations presents opportunities and challenges for data mining. SPSS Modeler enables users to use the full range of available data (structured and unstructured) to build and deploy powerful predictive models. SPSS Modeler combines high performance, scalability, performance optimization and flexible hardware requirements to handle large and complex data mining projects easily. With the features of IBM SPSS Modeler Server, your organization can: Make the most of high performance data mining and database investments and minimize data transfer costs. Optimize the use of multiple CPUs (or multi-core CPUs) in your operating environment by using parallel processing during a number of data preparation and model-building operations. Use in-database caching, database write-back with indexing and optimized merging to join tables outside of the database. Incorporate data mining algorithms from other database vendors. The end result is that your organization can use SPSS Modeler and SPSS Modeler server to analyze larger volumes of data more efficiently and better integrate predictive analytics into your business processes. As a result, you shorten the time needed to turn your data into better business decisions that boost ROI.

31 IBM Software 31 About IBM Business Analytics IBM Business Analytics software delivers data-driven insights that help organizations work smarter and outperform their peers. This comprehensive portfolio includes solutions for business intelligence, predictive analytics and decision management, performance management and risk management. Business Analytics solutions enable companies to identify and visualize trends and patterns in such areas as customer analytics that can have a profound effect on business performance. They can compare scenarios; anticipate potential threats and opportunities; better plan, budget and forecast resources; balance risks against expected returns and work to meet regulatory requirements. By making analytics widely available, organizations can align tactical and strategic decision making to achieve business goals. For more information, see ibm.com/ business-analytics. Request a call To request a call or to ask a question, go to ibm.com/businessanalytics/contactus. An IBM representative will respond to your inquiry within two business days.

32 Copyright IBM Corporation 2014 IBM Corporation Software Group Route 100 Somers, NY Produced in the United States of America April 2014 IBM, the IBM logo, ibm.com, AIX, Cognos, DB2, InfoSphere, Intelligent MinerSPSS, TM1, and z/os are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at Copyright and trademark information at Netezza is a registered trademark of IBM International Group B.V., an IBM Company. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. The performance data discussed herein is presented as derived under specific operating conditions. Actual results may vary. It is the user s responsibility to evaluate and verify the operation of any other products or programs with IBM products and programs. THE INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANT- ABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. Please Recycle YTW03026-USEN-03