Understanding the Benefits of IBM SPSS Statistics Server

Transcription

1 IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster with Statistics Server 6 Comparing performance between the Statistics Server and the Statistics client 7 Increase analyst productivity 7 Automating jobs with Statistics Server 8 Scoring new data with Statistics Server 8 Guidelines for purchasing Statistics Server 8 Conclusion 9 Appendix A: Description of local and distributed mode 10 Appendix B: Benchmark test details 13 Appendix C: Benchmark test results 14 About SPSS, an IBM Company Introduction is robust, powerful analytical software that seamlessly scales from handling the analytical needs of a single department to hundreds of users across the enterprise. It provides all of the features of IBM SPSS Statistics, plus capabilities that deliver faster performance, more efficient processing of large datasets and enhanced security in enterprise deployments. Statistics Server s client/server architecture, its ability to take advantage of multiple processors and cores, and its advanced analytical procedures specially tuned to work with large datasets enable organizations with massive amounts of data to optimize performance on data transformations, reporting, and analytics whether data resides in a central data center or across distributed offices. In benchmark testing designed to simulate a typical production environment, we found that most analytical procedures run faster on the Statistics Server than on the Statistics client 1, including: Data transformation procedures (add files, aggregates, match files, etc.) on average, 6 times faster on the Statistics Server Sort procedure on average, 3.35 times faster on the Statistics Server Commonly used model-building procedures (regression, GLM, Mixed, nomreg, etc.) on average, 3 times faster on Statistics Server This report discusses the high-performance capabilities available with Statistics Server, provides detailed benchmarking results and addresses other important benefits such as job automation, scheduling and scoring data. 1 The results described here are based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data is presented for general guidance. Actual results will vary depending on the configuration of the Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.). For more details on the benchmarking performed, see Appendix B.

2 Performance 101: Understanding the drivers of better performance A number of parameters can affect the performance of an analytical procedure, including the number of central processing units (CPUs) or cores, the amount of random access memory (RAM), the speed and configuration of the disk drives, and the location of the data being analyzed. Number of CPUs/cores Ideally, analytical procedures should run twice as fast on two CPUs, three times as fast on three CPUs, and so on. However, such perfect scalability is rarely achieved in reality, and the performance benefits of multiple CPUs/cores vary from procedure to procedure as explained below. Degree of parallelization This is the extent to which a procedure can be parallelized or broken into multiple independent tasks. Procedures that can be easily parallelized and scheduled to run simultaneously on different CPUs/ cores benefit the most. Procedures that are inherently serial or require a lot of disk I/O for example, crosstabs and frequencies will not benefit to a great extent from multiple CPUs/cores. Parallelization overhead This is the overhead associated with breaking up a procedure into independent tasks, scheduling each task and then merging the results. As operating systems and hardware platforms differ in the way tasks are partitioned and distributed across CPUs/cores, it is reasonable to expect the parallelization overhead to vary between platforms. Memory Memory, in the context of this paper, refers to the amount of physical RAM on the machine. For faster performance, it s best to have the entire dataset that an analytical procedure executes on in RAM. Accessing data from RAM is much faster than accessing data from a disk. If the dataset cannot be held in its entirely in RAM, there is a cost associated with swapping parts of the dataset between RAM and disk. Disk drives/computer storage devices Although there are several storage device technologies and configurations, high-end hard drives spin at 10,000 to 15,000 rpm, and can achieve sustained transfer rates up to 125 MB/sec. High-speed storage devices can dramatically improve performance when doing data transformations like sorts, merges, aggregates etc. on large datasets. Accessing files over a LAN vs. WAN Simply stated, a local area network (LAN) is the network technology used within an office to access datasets. A wide area network (WAN) is the network technology used across offices to access datasets. Although the speed of the LAN and WAN will vary depending on the type of 2

3 technology and the configuration, accessing files over a LAN is anywhere from 20 to 40 times faster than accessing files over a WAN. Performance of an analytical procedure is much faster when the dataset is accessed over a LAN than when it is accessed over a WAN. Why performance is faster with Statistics Server No need to transfer datasets between distributed offices The Statistics Server, when configured with the Statistics client in distributed mode (see Appendix A for a description of distributed mode), supports client/server architecture. In this configuration, the Statistics Server is installed in the central data center, in close proximity to the data. Users across the enterprise (in central and distributed offices) use the Statistics client to connect to the Statistics Server. All of the analytical processing and data access takes place on the Statistics Server; only the results of the analysis are transferred over the network to the Statistics client. This makes the Statistics Server an ideal solution for users in remote offices or users who travel frequently and require access to analytical capabilities on the go. As the need to transfer large datasets to end users desktops is eliminated, the data transferred over the network is minimized and performance is improved. This prevents bandwidth saturation and improves performance of not only the Statistics application, but other mission-critical applications as well, including , enterprise resource planning (ERP) and customer relationship management (CRM) and other network applications. We recommend Statistics Server for organizations with distributed offices that need to access files greater than 25 MB across offices. File Size Timing in seconds to access a data file Statistics client connecting directly to the data over a WAN (T1 3.0 Mbps) Statistics client connecting to the Statistics Server at the data center over a WAN (T1 3.0 Mbps) Time saved with Statistics Server in secs 50 MB 2 min, 10 secs 4 secs 2 min, 6 secs 250 MB 10 min, 50 secs 40 secs 10 min, 10 secs 1 GB 43 min, 17 secs 80 secs 41 min, 57 secs Table 1. Comparing time to access data using the Statistics client in local mode (accessing files in the data center directly over the WAN) vs. accessing the same data using the Statistics client to connect to the Statistics Server over the WAN, with data access handled by the Statistics Server 2 2 The results are based on the assumption that the available bandwidth is 3.0 Mbps. In reality, the time saved will be greater as bandwidth is taken up by other applications such as , network backups, etc. The data presented here is for illustrative purposes only. Actual results will vary depending on the configuration, bandwidth, and latency of the WAN; therefore, organizations performing similar tests may not see identical results. 3

4 As shown in Table 1, significant time savings can be achieved with Statistics Server when accessing files in distributed offices: for example, 2 minutes for a 25 MB file, 10 minutes for a 250 MB file, and 42 minutes for a 1 GB file. Multithreading Multithreading is the technical term used to break a task into multiple tasks that can be executed in parallel. As discussed above, not all analytical procedures can take advantage of multithreading. The procedures that are multithreaded in Statistics are listed in Table 2 below. In Statistics Server, there is no limit to the number of threads supported per procedure. The number of threads can be configured automatically for a user or group, or can be set manually. Users can also set the number of threads on a per procedure basis. Procedure family Correlations Regression Data Reduction Survival Analysis Multiple Imputation Procedure Name Bivariate Partial Linear Ordinal Multinomial Logistic Factor Analysis Cox Regression Logistic Regression Impute missing values Table 2: List of multithreaded analytical procedures As shown in Appendix C, the benefits of multithreading become more pronounced as the number of variables 3 increases (wide datasets). The results of the benchmark testing show that the performance of the following commonly used analytical procedures improved significantly as the number of threads increases from 4 to 16: 4 Linear regression procedure: improved by 52 percent Factor procedure: improved by 43 percent Cox regression procedure: improved by 24 percent Correlation procedure: improved by 24 percent Additional details on the benchmark tests that demonstrate the benefits of multithreading can be found in Appendix C. 3 The term variables refers to the number of columns or predictors in your dataset. 4 The results shown are based on testing done in SPSS, an IBM Company s laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data is presented for general guidance. Actual results will vary depending on the configuration of the Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.) 4

5 Support for 64-bit computing The total amount of RAM supported depends on the processor. Theoretically, 32-bit processors are limited to accessing 4 GB of RAM. Typically, the RAM available to an application on a 32-bit machine is much lower for several reasons: Most machines with 32-bit processors are not configured with 4 GB of RAM because RAM is expensive The operating system requires some RAM as well Hence, on machines with 32-bit processors configured with the maximum amount of RAM, the RAM available to the application is approximately 2 to 3 GB. On machines with 64-bit processors, the amount of RAM supported is several multiples higher. Analytical procedures that run on large datasets will run much more slowly on a 32-bit machine than on a 64 bit machine because of the disk activity required to swap parts of the dataset into and out of RAM. SQL Pushback The Statistics Server supports the pushback of sorts and aggregates to a SQL database. When large datasets are sourced from a SQL database, SQL pushback ensures that operations that can be performed more efficiently in the database are performed there. Support for advanced analytical procedures tuned to work with large datasets with a lot of predictors Statistics Server supports advanced procedures like Naïve Bayes and the Predictor Selector algorithm that are specially designed for wide datasets with a large number of predictors. These analytical procedures are not available in the Statistics client when configured in local mode. Support for server operating systems and hardware The Statistics Server is designed to support server operating systems and hardware. Desktop operating systems, namely Windows XP and Vista, are limited to two processors or sockets 5. Server operating systems in general support a greater number of processors or sockets. As discussed above, procedures that can be parallelized run much faster on an operating system that supports a greater number of sockets or processors. Additionally, server operating systems have several sophisticated features that improve performance, scalability, and resilience. Unlike the Statistics Base client, which is limited to a maximum of four CPUs or cores, an analytical procedure performed on the Statistics Server can access an unlimited number of CPUs and cores. 5 The Windows Vista and XP do not limit the number of cores per socket. 5

6 Statistics Server is ideal for organizations with a single office that need to perform analysis on files that are greater than 100 MB Comparing performance between the Statistics Server and the Statistics client Results of specific procedures 6 run on both the Statistics Server and the Statistics client demonstrate that: Data transformation procedures (add files, aggregates, match files, etc.) run on average 6 times faster on the Statistics server Sort procedure runs on average 3.35 times faster on Statistics Server Commonly used modeling procedures such as regression, GLM, Mixed, and nomreg run on average 3 times faster on Statistics Server Rather than simply time several procedures independently, the benchmarking test was structured to simulate a typical job run in a production environment. Groups of related procedures were then assembled into test suites. This grouping was meant to reflect a certain type of analysis or data processing that a Statistics user might execute in the course of a day s work. Five test suites were developed as listed below: 1. Data transformations: add files, aggregates, case to variables, sort, etc. 2. Simple multi-threaded procedures: correlation, factor, etc. 3. Building models: GLM, mixed, nomreg 4. Data mining: trees 5. Statistical calculations: beta, srange, smod, poisson, etc. Groups of related procedures Time saved with Statistics Server Data transformations 64.95% 5.92 Sort 69.90% 3.35 Commonly used 47.52% 2.31 multi-threaded procedures (N=10M cases) Building models 62.19% 2.90 Data mining 43.98% 1.44 Statistical calculations 62.44% 2.90 AVERAGE 60.60% 2.54 Average speedup with Statistics Server Table 3: Benchmarking results for jobs run on Statistics Server and the Statistics client7 6 The results shown are based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data is presented for general guidance. Actual results will vary depending on the configuration of the Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.) 7 The results shown in Table 3 are based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data is presented for general guidance. Actual results will vary depending on the configuration of the Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.) 6

7 The results in Table 3 show that, on average, the Statistics Server is 2.54 times faster than the Statistics client (on a procedure basis), and the time saved on a typical Statistics job is 60.6 percent. Description of capability Supports client/ server architecture. Datasets don t have to be downloaded to a user s desktop. Supports multiple processors and cores Supports Server operating system and hardware Statistics Server Yes No limit to number of CPU and cores supported. Yes Statistics client configured in local mode No. All files need to be downloaded to the user s desktop. Number of threads is limited to 4. This limits the number of CPUs and cores supported to 4. No Table 4. The reasons why a job run on Statistics Server is faster than a job run on the Statistics client. Table 4 compares the capabilities of Statistics Server with those of the Statistics client configured to connect locally to illustrate why jobs can be run significantly faster using the server software. Additional information on the benchmarking tests, including the test suite procedures, dataset sizes and configuration of the Statistics Server and client, are provided in Appendix B. Analysts can run multiple analytical jobs at the same time while continuing to work on their desktops. Increase Analyst Productivity Statistics Server s high-performance capabilities enable organizations to achieve significant gains in productivity. When users are connected to a Statistics Server in distributed mode, they can initiate multiple analytical jobs concurrently. This is an important advantage over the client software, particularly when performing data transformation jobs on large datasets. Because all of the processing is done on the Statistics Server, users can continue to work on their desktops while running several jobs at the same time. Automating jobs with Statistics Server The Statistics batch facility available with Statistics Server is ideal for performing jobs that are repetitive and need to be performed at regular intervals. Efficiencies are realized as the manual tasks associated with running weekly, monthly or quarterly reports are minimized. 7

8 Additionally, when Statistics Server is used with IBM SPSS Collaboration and Deployment Services, these jobs can be scheduled automatically, leveraging this platform s content management and scheduling capabilities. Run time variables are supported, allowing the same job to be run multiple times with different input parameters. More importantly, the output of the job (the report, etc.) can be stored in the repository and accessed directly by business users through a dashboard. (A Web interface is available with Collaboration and Deployment Services.) Scoring new data with Statistics Server The Statistics Server ships with a scoring engine that allows new data to be scored. Users connected to Statistics Server in distributed mode can open one or more models created in Statistics, IBM SPSS Modeler or IBM SPSS AnswerTree, and score new data. This capability is not available with the Statistics client in local mode. Guidelines for purchasing Statistics Server The Statistics Server is especially designed for the following scenarios: Organizations with distributed offices looking to centralize their data and IT infrastructure in one or more data centers Organizations with distributed offices that need to analyze and share files greater than 25 MB across offices Organizations looking to virtualize applications and desktops using enabling technologies like Citrix Terminal Server. These servers are especially tuned to presenting applications and user interfaces and are not designed to handle the high CPU and I/O intensive work load of analytic jobs. Statistics Server ensures that the heavy processing is offloaded from the Citrix/Terminal server box and ensures better performance and availability. Organizations that need to perform analysis on large datasets (greater than 100 MB) sourced from a SQL server or a data warehouse Conclusion Statistics Server is sophisticated analytical server software that provides robust, scalable analytical capabilities when working with large datasets. It supports a client/server architecture that enables organizations to pursue a centralization strategy. Because large datasets do not have to move across offices for analysis, performance improves, resulting in greater analyst productivity and efficiency in distributed offices. 8

9 In addition, because Statistics Server is a foundational technology, organizations that invest in it can leverage it in many ways. For example, Statistics Server, when integrated with Collaboration and Deployment Services, enables them to: Automate scheduling of Statistics jobs Store the output of a Statistics job in a portal where it can be accessed by business users Deploy simplified analytical capabilities targeted to business users via a Web interface for jobs executed on Statistics Server When integrated with Modeler, Statistics Server enables organizations to: Take advantage of advanced data mining algorithms and a complementary, process-driven approach for building and scoring models Integrate advanced model management and deployment capabilities seamlessly with existing business processes Excel in today s fast-paced business environment by building and deploying many highly accurate models without requiring deep statistical expertise Appendix A: Description of local and distributed mode Local mode When running in local mode, all the analysis is performed on the user s desktop computer using the CPU resources on the desktop itself. All of the data that is being analyzed needs to be transferred to the local user s desktop (see Fig 1). If users are performing transformations on data located in a shared network resource, the transformed data must be transferred across the network to be saved on the file server or database. As the size of the data and the number of users increase, these data transfers can take up an appreciable amount of network bandwidth, adversely impacting network performance and the performance of other mission-critical applications like ERP, CRM, and that run on the network. This makes local mode more suitable for organizations with single offices and relatively smaller datasets. Figure 1. Statistics run in local mode. 9

10 Distributed mode In distributed mode, all the analysis is performed on the Statistics Server, located at the central datacenter (typically co-located with the data files). Because the analysis is performed on the Statistics Server, there is no need to transfer data to individual users desktops. As all the data transfers are localized between the Statistics Server and the file Server/database, performance is greatly improved. Only the results of the analysis typically a fraction of the size of the original data are transferred to the Statistics client. Figure 2. Statistics in distributed mode. Appendix B: Benchmark test details Configuration All the testing was done using the batch facility 8 or Statistics. Datasets were local to the Statistics Server. It is reasonable to expect similar results when using a Statistics client to connect to the Statistics Server (distributed mode). When comparing the performance between running a job using the batch facility vs. running the same job in distributed mode, there is a small overhead associated with distributed mode. This is because in distributed mode, the results of the analysis get transferred across the network from the Statistics Server to the end users machine. In batch facility, the results of the analysis are written to a disk drive/network share accessible to the Statistics Server. As the output of the analysis is typically small in size, the overhead associated with transferring this output on a properly configured network is minimal. Repeated trials To help control for the chance variation of any single test run, each test suite was repeated three times. The average time in seconds is reported. 8 Typically the client for Statistics server is the Statistics client running on a desktop computer. The Statistics Server batch facility is an alternative way to use the power of the Statistics Server. StatisticsB is a command line executable that runs on the server computer where the Statistics Server is installed. StatisticsB is intended for automated production of statistical reports. Automated production provides the ability to run analyses without user intervention. Automated production is advantageous if users are required to perform repetitive time-consuming analyses, such as weekly reports. StatisticsB takes as its input a syntax file containing the data transformation and/or analytical procedures to run, with several command line arguments to control the format of or customize the output. 10

11 Configuration of the Statistics Server CPU: 4 CPUs, Intel Xeon 3 GHz, dual core Hyper threaded RAM: 8 GB Operating system: Windows 2003 Server, 64-bit Configuration of Statistics client CPU: 1 CPU, Intel T 7500, 2.19GHz, dual core RAM: 3 GB Operating system: Windows XP, 32-bit Details on the dataset Two datasets were used: Dataset 1: Size 2.1 GB, 5 million cases, 127 variables Dataset 2: Size 3 GB, 10 million cases, 127 variables (used for simple multithreaded procedures; see table 5 for details) Groups of related procedures Statistics Server (64 bit)* Statistics Client (32 bit)** Time saved Average speedup (multiple of times faster) Data transformations ADD FILES % 9.18 AGGREGATE % 2.86 CASESTOVARS % 0.90 MATCH FILES % VARSTOCASES % 3.88 UNIFORM (Simple COMPUTE) % 5.71 Average time saved 61.44% Average speedup 6.02 Sort SORT NUMERIC % 3.94 SORT STRING % 2.87 Average time saved 69.90% Average speedup 3.35 Table 5: Benchmarking data comparing Statistics Server with the Statistics client 9 Number of threads 8 Number of threads 2 9 The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data is presented for general guidance. Actual results will vary depending on the configuration of the Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.) 11

12 Groups of related procedures Statistics Server (64 bit)* Statistics Client (32 bit)** Time saved Average speedup (multiple of times faster) Simple Multithreaded Procedures (N=10M) CORRELATION % 3.47 FACTOR % 1.56 PARTIAL CORR % 1.54 REGRESSION (120 dependent variables) % 1.94 Average time saved 47.52% Average speedup 2.31 Building Models GLM % 5.01 MIXED % 1.50 NOMREG % 2.97 REGRESSION % 3.24 Average time saved 62.19% Average speedup 2.90 Data Mining TREES % 1.44 Average time saved 43.98% Average speedup 1.44 Statistical Calculations BETA % 2.65 CFVAR & BETA % 3.39 POISSON BERNOULLI % 2.20 Average time saved 62.44% Average speedup 2.90 Total Time % 2.54 Table 5 (continued) Number of threads 8 Number of threads 2 12

13 Appendix C: Benchmark test results Number of threads Multi-threaded procedure names File Size Number of cases Number of variables Time in seconds Time saved in seconds Discriminant 351MB 200, % 5.88% Csscoxreg % 23.76% Sort 2.7GB 2,000, % 13.54% Csordinal % -8.97% Cslogistic 48MB 100, % 18.48% Linear regression 703MB 200, % 52.34% Factor 703MB 200, % 43.30% Correlation % 24.35% Partially correlated % 29.80% Nomreg % 18.47% Csselect % 1.34% TOTAL TIME Percentage time saved overall 20.76% 27.60% Table 6: Benchmarking results demonstrating performance improvements as the number of threads increases 10. As the number of threads increases from 4 to 16: The linear regression procedure improves by 52 percent The factor procedure improves by 43 percent The COX regression procedure improves by 24 percent The correlation procedure improves by 24 percent Overall, performance for the multithreaded procedures increases by percent as the number of threads increases from 4 to 8 10 The data shown is based on testing done in IBM SPSS laboratories. Although our test environments simulate typical production environments in the field, we cannot guarantee that organizations performing similar tests will see identical results. This data is presented for general guidance. Actual results will vary depending on the configuration of the Statistics Server and clients (number of CPU cores, RAM, disk speed, etc.). 13

14 About SPSS, an IBM Company SPSS, an IBM Company, is a leading global provider of predictive analytics software and solutions. The company s complete portfolio of products - data collection, statistics, modeling and deployment - captures people s attitudes and opinions, predicts outcomes of future customer interactions, and then acts on these insights by embedding analytics into business processes. IBM SPSS solutions address interconnected business objectives across an entire organization by focusing on the convergence of analytics, IT architecture and business process. Commercial, government and academic customers worldwide rely on IBM SPSS technology as a competitive advantage in attracting, retaining and growing customers, while reducing fraud and mitigating risk. SPSS was acquired by IBM in October For further information, or to reach a representative, visit Copyright IBM Corporation 2010 SPSS Inc., an IBM Company Headquarters, 233 S. Wacker Drive, 11th floor Chicago, Illinois SPSS is a registered trademark and the other SPSS products named are trademarks of SPSS Inc., an IBM Company SPSS Inc., an IBM Company. All Rights Reserved. IBM and the IBM logo are trademarks of International Business Machines Corporation in the United States, other countries or both. For a complete list of IBM trademarks, see Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. Any reference in this information to non-ibm Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. Please Recycle YTW03038USEN-00