Evaluating SPDS to Improve Performance of a Large Data Warehouse

Transcription

1 Evaluating SPDS to Improve Performance of a Large Data Warehouse Clare Somerville and Colin Harris Ministry of Social Development, Wellington, New Zealand ABSTRACT The New Zealand Ministry of Social Development data warehouse has been in production for over five years. During that time the number of analysts has increased to one hundred, and more data sources have been introduced. The hardware has been upgraded several times, and we re now running on a large HP UNIX box with 2 TB of disk, 12 GB of memory and 13 TB of online storage. For a number of reasons most of them related to performance the Oracle data replicas have been converted to SAS data sets. Though the move from Oracle to SAS greatly improved performance on the warehouse, it raised a number of issues including: how to span data sources across file systems, the use of views for security, and the ability to update and query at the same time. Scalable Performance Data Server (SPDS) was evaluated for use on the warehouse with a view to supplying the answers to these and other questions and issues, including the constant search for better performance. This paper gives some background to the warehouse and its configuration. It covers the set up and parameters of SPDS, security, backups and gives performance results on: stress testing, queries, updates, SQL code, use of views, split file systems, and other areas. BACKGROUND The Warehouse The warehouse first came into production in January At that time Oracle was the database standard for the Ministry, and so the warehouse used Oracle to store the data, and SAS was used for just about everything else. The warehouse has expanded over the years: there are more data sources, and more users from more government departments. It now acts as a management information and analysis platform for about 100 SAS analysts across five government departments. It also provides ad hoc dynamic Web reporting for around 10,000 users around the country through the Intranet. The trend in New Zealand is towards the combining of government departments into networks of related agencies, or super ministries, and the sharing of information across these departments. This warehouse is seen as complementing this move. Environment The current warehouse environment is an HP N4000 server, with six PA MHz processes, 12 GB of memory, 2 TB of available disk in an HP XP256 disk array, and 13 TB of online tape storage. The hardware is upgraded every two years. We re currently in the process of doubling the available disk to 4 TB and reassessing our memory and cpu requirements. We re running SAS 8.2 as the default on the system. THE REASON BEHIND SPDS SPDS is a SAS product, designed for use in data warehouse repositories such as ours. It has a number of features which can be used to enhance the management and performance of the warehouse, particularly when dealing with large quantities of data. It s designed to improve performance with the use of parallel processing and enhanced indexing techniques. For some time we have been progressively moving our data replicas out of Oracle and into SAS data sets. By converting the warehouse from Oracle to SAS we have gained enormously in performance. Updates and queries all run several times faster. During this conversion to SAS we found that there were some features of Oracle which we found difficult to replace in a fully SAS based system: the ability to span file systems, simultaneous update and query access, and the use of views for security. These issues, plus the constant search for better performance on the warehouse, led us to investigate SPDS. TESTING CRITERIA We were anxious to test SPDS before committing ourselves to its use in this environment. We had been unable to get a lot of information about SPDS before we began this work. It seemed that people were using it around the world, but few seemed prepared to talk about it. We had seen

2 some figures comparing SPDS to Oracle, and these showed the expected performance gain. But we already had a large performance gain by moving from Oracle to SAS. We needed to be sure that SPDS gave us gains in performance, wouldn t entail large scale hardware changes or analyst coding changes, and would assist us in the efficient management of large quantities of data. The Three SAS Issues We had three main concerns about maintaining replicas in SAS that we hoped SPDS could rectify: Spanning File Systems We have large data sources and need to be able to span across multiple file systems. We wanted SPDS to transparently and efficiently handle the spanning of data across file systems. Simultaneous Update and Query Access We need to be able to query the data and update it at the same time without either process hindering the other. Security and Views Good security on the warehouse is vital. We needed to be able to centrally control the security of multiple data sources, without revealing any passwords to users, and without any performance overhead. Other Test Areas There were a number of other considerations which we included in our test criteria: Minimal Code Changes The departments have a history of over ten years of SAS programming with many hundreds of programs in use. In addition, they maintain comprehensive documents of Business Rules for each data source, which contain sample code that can be cut and pasted into programs. Any code change which impinges on these rules requires comprehensive testing and the rewrite of Business Rules. We would not have the support of the analysts in implementing SPDS if it required large scale code changes to their programs. Hardware We had hardware which was optimised for SAS performance. We required minimal changes to the hardware configuration. Maintenance/Ease of Use We re a small warehouse team of three people. We would not have sufficient resources if SPDS required a heavy maintenance overhead to keep it running efficiently. Conversion We currently have about twelve data sources containing hundreds of gigabytes of data. We would need to be able to convert this data in a minimum amount of time, while maintaining user access. Backup and Restore We had queries before we started work on SPDS about how SPDS files were backed up and whether the backups could be handled by our normal Omniback processes. We wondered if there was any rollback facility in SPDS and how that might work. We would like to make use of any SPDS backup features, but still retain the Omniback backup systems that we have in place. Performance On a warehouse such as this there is always the search for better performance. The continued increase in the number of users, the increase in the size and complexity of the types of jobs they run, in the number of data sources and the resources used to maintain them, all require the best performance possible. With SPDS we hoped to be able to improve the query and update performance times on the machine. GETTING STARTED Set up Getting SPDS installed and running was not too difficult, and mainly involved following the SPDS Unix Installation Guide. This included: Loading the SPDS software components Updating the host services file to define a port to communicate with SPDS Setting up a separate userid to run and administer SPDS. This is a good security mechanism, as this userid has minimal rights except for managing the SPDS environment Defining SPDS users and security

3 Updating the SAS installation with additional files to allow SAS to communicate with SPDS Updating user SAS config files to access new message and executable files (to allow SAS to communicate with SPDS) Defining disk configuration Starting up the SPD server session - using the supplied rc.spds shell script Loading data into SPDS Some of these tasks are discussed in more detail below. The more difficult and time consuming part was determining parameter settings and disk configuration for optimal SPDS performance. SPD Server consists of a number of executable modules that perform different tasks. The main ones are: Spdsserv the data server Spdsnsrv the SPDS name server, to which SPDS LIBNAME domains are registered Spdsbase the LIBNAME proxy, that connects a SAS LIBNAME to the appropriate name server Spdslog the SPDS message logger Spdsaud the SPDS audit logger Psmgr password file utility SPDS also includes rc.spds, which is a shell script to start an SPD Server environment. This is customised for a specific site, and could, for example, start the name server, data server, LIBNAME proxy, message logger and audit logger. Defining SPDS Users and Security Before anyone can use SPDS, users and an access level have to be defined using the line mode utility psmgr. (There is now an experimental GUI available for this.) One or more administration usernames are defined, and then user (nonprivileged) accounts are created. We did not create usernames for every individual, but a username for a certain group of users needing particular access to a set of data. For example, the username SWFTMSD is used for all the people in the Ministry of Social Development (MSD) needing to access SWFT benefits data. Security is defined by ACLs (access control lists) using PROC SPDO from SAS. Tables, columns and rows can have security set for each username or group of usernames. When a user from MSD invokes SAS, they use an automatic SAS autoexec which assigns relevant LIBNAMEs for them. A SWFTDATA LIBNAME could be defined to point to SWFT data in SPDS using the SWFTMSD username, only allowing them access to appropriate data. They don t need to know (and we don t want them to know) how the LIBNAME works just that a LIBNAME is available for them to access the data they require. Further details are provided in the Security and Views section later in this paper. Defining Disk Configuration Most parameters can be left to default initially, and fine tuned later to improve performance. However disk configuration needs to be considered carefully beforehand, and can significantly enhance or degrade performance. SPDS data are split into three parts: metadata, data and indexes. Each of these can be placed in different areas of disk to increase performance, by allowing each area to be accessed concurrently. This contrasts to a SAS data set that must reside in one physical location. The recommendation is that each of those three elements should be put on different file systems, to spread processing as much as possible. Modern disk arrays are complex and sophisticated beasts these days, and it is well worth your while planning the optimal approach with your systems administrator or hardware engineer. Most are configured with several gigabytes of cache, multiple disk controllers and clever techniques for serving data at a faster rate. We identified different file systems for the data, metadata and indexes. We had a fourth area for the SPDS work area all separate from the SAS work areas. When we looked below the covers we found that some of the file systems shared the same physical spindles. We then rearranged the file systems so that each file system for SPDS used different spindles, and so that the SPDS and SAS work areas were on separate file systems that used different spindles again. In the end our SPDS environment was spread over an impressive number of different physical spindles: the work area over 28 spindles, the SPDS data area over 23 spindles, the indexes over 12 spindles and the metadata over 12 spindles.

4 An example of the syntax for defining an SPDS LIBNAME domain (equivalent to a SAS data library) is: libname=swft pathname=/r06/spdsmeta/swft roptions= metapath=( /r06/spdsmeta/swft ) datapath=( /s01/spdsdata/swft ) /s02/spdsdata/swft /s03/spdsdata/swft ) indexpath=( /s09/spdsindx/swft ) owner=swftmgr backup=yes ; This is set within SPDS. In this example we have spread the data, by placing the actual data over three file systems, as well as the index and metadata portions on separate file systems. All of our disk is configured as RAID level 5, which allows full redundancy. This means that if a disk fails, the system just carries on, and the faulty disk is replaced at leisure. The SPDS documentation recommends that the work library be on RAID level 0 disk (no redundancy), as this provides better performance, and there is no real data loss if a work library disk fails. These are both true, but if a RAID 0 disk fails the system must stop for it to be repaired or replaced. We have followed other recommendations to have all disk at RAID 5 level, including work libraries. This lowers performance slightly and uses 20% more disk space, but gives increased reliability for the overall system. Using SPDS from SAS We were pleasantly surprised how easy it was for a user to use SPDS from SAS. Once the SAS config file has been updated and LIBNAMEs assigned (which in our environment we do centrally) a user can simply use the new LIBNAME with any SAS code as though it was a standard SAS library. After a SPDS LIBNAME domain is defined in SPDS (see the example in the previous section), a user SAS session can issue a LIBNAME to point to that SPDS domain, for example: libname swftdata sasspds swft host= xxxxx004 user= swftmsd password= xxx ; A user does not even need to be aware that they are using SPDS. However it is far better that they are aware, as some jobs can run much faster if coded slightly differently for SPDS. There is also SPDS specific syntax that is faster for some tasks, including the new GROUP BY statement or SQL pass-through to SPDS. Performance Tuning The first choice is whether to choose 32-bit or 64- bit SPDS. We opted for 32-bit, since we were running 32-bit SAS (although they are independent), even though our operating system supports 64-bit processing. We tried the 64-bit SPDS version at a later time with no noticeable difference in performance. SPDS has a number of configuration parameters. For those that were similar to SAS configuration parameters (e.g. SORTSIZE and other memory related values) we increased the default values to try and match the high values we have for our SAS environment (since we have a large server and process large volumes of data). MAXWHTHREADS is a parameter that determines the maximum number of threads each SPDS job can use. There is a clever program that you can run (called scale.sas) that exercises SPDS with different numbers of threads, to show which is the optimum for your specific environment. Ours turned out to be 24, i.e. 24 threads spread over 6 processors, which is 4 threads per processor. When we got disappointing performance running ten jobs concurrently SAS suggested we reduce MAXWHTHREADS to 6 or 8. With the original setting of 24, we could have 240 threads running for ten jobs, or 40 per processor, and they might be overwhelming each other. Unfortunately reducing the number to either 6 or 8 made no significant difference. SPDSNETP sets the packet size for data transfer from SPDS to SAS. The default is 32K. We tried settings of 10K, 64K and 256K with no difference. The file partition size (PARTSIZE) seemed another promising value to tinker with. The default is 16 MB, meaning that a 160 MB file would be split into 10 separate data files. These could be spread across different file systems for faster access, and to allow processing in parallel. A rule of thumb recommendation is that big files should be split into approximately 100 partitions, so PARTSIZE should be set appropriately. One of our large test files was 14 GB, which by default split to 875 files. We doubled PARTSIZE to 32 MB, expecting to see a performance change, but performance was the same. This seemed to imply that it was not worth the effort trying to adjust PARTSIZE for each different file. Work Directories In our current environment the user work directories are separated off from the warehouse development team work directories. In this way, we are cushioned off from any errant users who may periodically go wild using up 100 GB of disk space. Similarly, the users are protected from the large jobs we may run, especially during the verification of the data.

5 Under SPDS we found that we could not easily separate out user and development team work directories onto separate file systems. Conversion The conversion of the data was easy. Once the LIBNAMEs were established, it was simply a matter of writing out the data to SPDS using standard SAS code. There were two ways we could do this. We could use a simple DATA step SET statement. But it is faster to create the empty data sets, with their indexes, then to use a PROC APPEND to append the data into the empty data sets. Auditing SPDS can create an audit log if desired. There s a data step in SPDS which can read this log and convert it into a SAS data set for reporting and analysis. It should be possible to then process this data set using IT Service Vision. SPANNING FILE SYSTEMS Using SAS Our largest file system allowed by the operating system is currently 128 GB in size. We have several different data source replicas on this warehouse. We have multiple data sources on one file system, and we have single data sources which span different file systems. A number of our data sources are larger than the file systems: the largest spans two 128 GB file systems, consuming around 175 GB of disk. Many of our tables are large up to 20 GB or more and the rate of growth can vary. When we set up new SAS replica systems we need to calculate the space required by individual data sets and data sources, and estimate their projected growth, in order to be able to spread them evenly across file systems. This is a less than scientific method. We then have to monitor the use of space on these file systems, to ensure that any unexpectedly large update does not consume all the space on the file system, disrupt the update and corrupt the replica. We have UNIX cron jobs which monitor the disk consumption on the SAS replica file systems on a daily basis. These jobs check the amount of space left on each file system and the warehouse support team with these details. Using SPDS Under SPDS a LIBNAME is defined for a data source. This LIBNAME defines where the data portions of the file will be held, and also where the indexes and metadata will be stored. For example: libname=swft pathname=/r06/spdsmeta/swft roptions= metapath=( /r06/spdsmeta/swft ) datapath=( /s01/spdsdata/swft ) /s02/spdsdata/swft /s03/spdsdata/swft ) indexpath=( /s09/spdsindx/swft ) owner=swftmgr backup=yes ; In this example, the data files are stored on /s01/spdsdata/swft, /s02/spdsdata/swft and /s03/spdsdata/swft directories. The indexes will be stored on /s09/spdsindx/swft, and the metadata will be on /r06/spsdmeta/swft. The SWFT data are written to /s01, /s02 and /s03. SPDS divides the data into partitions set by the PARTSIZE option, writing the first partition to /s01, the second to /s02, the third to /s03, the fourth back to /s01 and so on. The various file systems could be different sizes and from different vendors. When the file systems defined for the data files are full, it is necessary to add in additional file systems. SPDS will then skip the full file systems and use the new ones. For example, a file may be written out as several data files spanning file systems one and two. If file systems one and two become full, and file system three is added, SPDS will continue to write the data files to file system three only. The problem here is that there is a performance impact with all new partitions hitting the same file system. To solve this problem a PROC COPY must be used to relocate the older partitions of data across both the new and the old file systems in this example across file systems one, two and three. It would be nice if there was some SPDS utility which would do this for you: recognise that the file systems are full but that others are available, and rebalance the data automatically, preferably while still providing access. Summary Though SPDS does span the data across file systems, it does it in a limited sort of way. The requirement to rewrite the whole data source under certain circumstances is not a good option. SPDS should be able to dynamically spread the data across the available file systems.

6 It s worth noting that with the ability under HP-UX to now create file systems up to 1 TB that this will become less of an issue. SIMULTANEOUS UPDATE AND QUERY ACCESS Using SAS Under SAS we re unable to update and query at the same time. Data are fed to the warehouse overnight. By 8 a.m. each morning the data are up to date as at the end of the previous business day. In most cases, the updating is done at night. But there can be exceptions to this. When data feeds are interrupted, for example, then updating must take place during the day. In most cases analysts do their work during the day. But if they know they have a particularly resource intensive job to run, then they may run it during the night when the updates are taking place. The warehouse is a 24 by 7 platform, so it is important that we can update and query the data at the same time. Under SAS this is difficult. If someone is querying a table, then the update can not get the lock it needs on the data, and will fail and vice versa. Initially we investigated using SAS/SHARE. Testing of this product gave us perfectly satisfactory results it was easy to set up and maintain, simple for users to access the data, and the performance seemed fine. But when all the users were accessing the data replicas through SAS/SHARE, performance took a major hit. Users jobs which had taken four minutes were taking forty minutes - ten times as long. Performance was so bad that we had to rapidly remove SAS/SHARE and find other means for controlling simultaneous update and query access. To solve this dilemma, we have developed a UNIX shell script which checks for file handles on the data about to be updated. If a user job is found to be accessing the data, then this job is deleted. The rights on the data files are then changed to prevent further user access, and the update is run. The rights are changed back after completion of the update. Initially we did this from within a UNIX shell. Now, we lock each data file individually using an X command from within the SAS program. In this way, the period of time that each individual file is locked is minimised. It is possible for users to continue their work while we re updating, and if they re lucky they won t be killed. This is not an ideal solution, but it does work. Using SPDS One of the key reasons for testing SPDS was to get round this difficulty of simultaneous update and query access. Our testing of SPDS met with mixed results. Yes, it is possible to update and query the data at the same time under SPDS. But we ran into a problem with our update process. Our update consists of two parts: a DATA step MODIFY and an APPEND. Using the MODIFY it was possible to query the data at the same time without problem. But the APPEND puts a table lock on the file, and user query access is blocked by this. We could have recoded to remove the APPEND but this would have made it take too long. Summary SPDS does allow for simultaneous update and query access. In our case this was not achieved because of the method by which we update the data. Simultaneous update and query could not be achieved without recoding our update process, and then it would take too long to run. SECURITY AND VIEWS Using SAS The warehouse has multiple layers of security. In addition to the network security, UNIX groups and SAS data set passwords, we also use SAS views. On this warehouse, the different users from the different departments see different data sources, but they also see different views of the same data. Analysts in the department owning the data see a full view of the data including names, addresses and other identifying material. Analysts from other departments see a restricted view with no identifying data. When logging on to the warehouse server, users are reassigned a primary UNIX group, and a series of secondary groups. These groups determine the data sources which will be available to them, and the level of access to those sources: a full view with all identifying data, or a restricted view. Under UNIX, there is a limit of twenty groups which a user can belong to. Whereas this limit is a bother with the warehouse development team, it is not yet an issue with the users.

7 When the users invoke SAS, an autoexec is automatically assigned for each group of users, which includes the LIBNAMEs for each of the data sources they are allowed to access. The SAS replicas are set up in file systems with security set on them by the group ownership of the file system. All the SAS data sets have read, write and alter passwords set on them. The analysts access the data through views of the data sets which are generated for them. The LIBNAMEs point to the file systems containing the sets of views they have no access to data set passwords. These views are stored in file systems which again have security determined by UNIX group ownership. The code generating the LIBNAMEs is hidden from users by UNIX group ownership. If a user tries to describe the view, the password is masked. This system allows us to change data locations without the necessity for users to modify SAS programs. In reality, users don t know where the data are located, and few would have the skills to find it: they simply use the LIBNAMES assigned for them. The views are also used in some systems to rename variables, to give consistency across historical systems. This all works fine, except for the fact that analysts have to use views to access the data. There is an overhead of over 50% CPU time in the use of views. We use SQL views, as we found them to be faster than DATA step views. We would like to be able to manage secure data access without using views. Using SPDS Under SPDS, it s possible to control table and column level access. This security information is held in an ACL database which is encrypted, and is updated using PROC SPDO. SPDS defines a level of security, rather than creating a view through which to control security. By default, SPDS data are owned by the SPDS user who created it. This user must then grant rights to other users. These rights include read, write, alter (i.e. to delete), and control (which is to control security). PROC SPDO is used to set the security. All our SAS security is set on a group level. Similarly, in SPDS we did not create usernames for every individual, instead we created a username for a certain group of users who required access to particular sets of data. Libnames The warehouse has a standard set of LIBNAMEs set up for users, which includes department, data source, and, by implication, the level of access. The LIBNAME SWFTMSD indicates access for analysts in the MSD department to the SWFT data. The LIBNAME SWFTOTH would give access to the SWFT data by analysts in other departments who have a restricted view of the data. Under SPDS we set up separate LIBNAMES for users with different levels of access. For example: Libname SPDS user Rights Swf Swftmgr Full Swftmsd Swftmsd Full read Swftoth Swftdol Restricted read Swftoth Swfthnz Restricted read All these LIBNAMEs point to the SWFT data. The LIBNAME for SWFT is defined for SPDS in the libnames.parm file. The swftmgr user is the owner of the data and must give rights to others. Under SPDS, we still don t want users to know any passwords, so we had to devise a way to hide the LIBNAME statement that ran from their autoexec.sas. The best we could come up with was to run the LIBNAME from an AF catalog that had NOSOURCE set. As users don t have Display Manager access on the warehouse server (they run SAS/CONNECT from their desktops), they can t issue a recall statement. We had thought initially that the limitation of being able to assign only five groups within SPDS would be a major disadvantage. In reality, we found we didn t need to use these groups, so the small number of groups didn t matter. Views Using views under SAS does have a performance overhead. Under SAS this overhead is around 50% or more. For example: Job SAS Direct SAS View Extract using WHERE 1:45 2:52 (164%) Extract using IF 1:52 2:50 (152%) PROC FREQ using WHERE 1:17 2:24 (187%) PROC MEANS using WHERE 1:21 2:36 (193%) Views can be defined on SPDS tables. The performance hit is as bad, if not worse than, views on SAS data sets:

8 Job SPDS Direct SPDS View Extract using WHERE 0:40 2:08 (320%) Extract using IF 1:08 2:08 (188%) PROC FREQ using WHERE 0:15 1:45 (700%) PROC MEANS using WHERE 0:36 1:57 (325%) For these tests we used SAS SQL views on an SPDS table. Other options are possible, including an SQL PASSTHROUGH view or a view defined within SPDS. Limited testing of other options didn t increase the performance. Luckily we don t need to use views within SPDS to obtain the security we require, as ACLs provide this capability. One advantage of views from our perspective is that we could then rename the variables as required by the analysts for some systems. The disadvantage is obvious: the performance. Summary SPDS security is very good and doesn t add any performance overhead, certainly nothing like the 50% plus CPU overhead we get with views. A disadvantage of using SPDS security from our point of view is that we are unable to rename the variables. We have extensive renaming of variables in some data systems, which would require a great deal of re-programming on the part of analysts and would be met with stiff resistance from them. PERFORMANCE We have a test suite of programs which we run. It should be noted that these results are specific to our environment (data structures, hardware system settings and so on) and our processing needs. Almost all our queries require specific date selection logic, which may be quite different than queries in other organisations. For example: data extract ; set replica (where=(filedate le 01jan2002 d and (repdate is null or repdate gt 01jan2002 d))) ; run ; This code uses the date the records are created (the filedate) and the replacement date, which is the date they were updated or deleted, (the repdate) to get the current records as at 01 January For our test data we used two tables: SBEN and CLINT. SBEN was 1.6 GB (compressed) with 8.3 million observations; CLINT was 1.4 GB (compressed) with 4.9 million observations. Ideally we would have liked a more comprehensive mixture of different sizes of tables, different numbers of GB, observations, replaced records etc. But we limited ourselves to these two tables for much of the testing, because comprehensive testing with only these two tables took several weeks. Queries A number of different queries were run to test SPDS. These included a DATA step extract using a WHERE, and another using an IF. We ran PROC MEANS and PROC FREQ, plus we did joins of tables in different ways. These queries were run against SAS and against SPDS. They were run as a single stream job, and then as multiple jobs in parallel. Timings were also taken for views. A number of different parameters were changed and tested to try to improve performance. In most cases, the performance figures given here are the averaged result of multiple runs. Over 1,000 jobs were run and their results recorded and analysed. In all cases, performance times given are mm:ss. Test Suite in Single Stream We started by running our test suite of programs in a single stream first against SAS and then against SPDS. A lot of the work our analysts do involves performing an extract of the required data from the replica data sets, as at a particular point in time. Running a single stream job of a simple extract using a WHERE from SPDS took 38% of the time that an extract from SAS did. As expected, this percentage increased if an IF statement was used, since all data would be returned to SAS for sub-setting, rather than the subset happening in SPDS. Job SAS SPDS Extract using WHERE 1:45 0:40 (38%) Extract using IF 1:52 1:08 (61%)

9 Simple queries run as a single stream also showed similar promising results. A PROC FREQ with a WHERE was five times faster against SPDS; a PROC MEANS was more than twice as fast. Job SAS SPDS PROC FREQ using WHERE 1:17 0:15 (19%) PROC MEANS using WHERE 1:21 0:36 (44%) Joins looked particularly promising. We compared single step joins using a MERGE BY, with multistep joins. The single step join code was: data together ; merge data1 (where=(repdate is null) read=xxx) data2 (where=(repdate is null) read=xxx) ; by swn ; run ; The datasets are not sorted by swn, so an index is used to retrieve data in the correct order. In the multi-step joins we extracted the data, did a sort then a DATA step MERGE: data testdata1 ; set data1 (where=(repdate is null) read=xxx) ; proc sort data = testdata1 ; by swn ; data testdata2 ; set data2 (where=repdate is null) read=xxx) ; proc sort data = testdata2 ; by swn ; data together ; merge testdata1 testdata2 ; by swn ; run ; Job SAS SPDS Join single step 19:00 3:10 (17%) Join multi-step 8:30 7:30 (88%) In reality, we would never do a single step join against SAS in this way. But comparing the multistep join in SAS at eight and a half minutes, to a single step SPDS join is still very favourable more than twice as fast as SAS. However, this would require users to recode their queries. Single Stream with Views We created SAS views using SQL on both SAS and SPDS data. An extract with a WHERE took three times longer when run against a view on SPDS than against SPDS direct. This pushed the SPDS time well past the time for direct access to SAS data set. Comparing SAS views to SPDS views we do not see the same improvement in SPDS performance as we did comparing SPDS and SAS direct jobs. Job SAS View SPDS View Extract using WHERE 2:52 2:08 (74%) Extract using IF 2:50 2:08 (75%) PROC FREQ using WHERE 2:24 1:45 (73%) PROC MEANS using WHERE 2:36 1:57 (71%) It is valid for us to compare the SPDS non-view times to the SAS view access, as we can apply SPDS security in place of using SPDS views. This gives a bigger gain for SPDS. Job SAS View SPDS Direct Extract using WHERE 2:52 0:40 (23%) Extract using IF 2:50 1:08 (40%) PROC FREQ using WHERE 2:24 0:15 (10%) PROC MEANS using WHERE 2:36 0:36 (23%) SPDS direct access with security is around three to four times faster than SAS view access in single stream jobs. Test Suites in Parallel In our environment it s a lucky user who gets the machine to themselves to run a query. It was vital to our testing to run multiple jobs at the same time. We ran our standard suite of tests with five parallel jobs, and also with ten parallel jobs. The test of five consisted of our test query suite against four other jobs five jobs in parallel. The ten job test consisted of two test suites versus eight other jobs, giving the total of ten parallel jobs. These jobs are all run direct against SAS data sets or SPDS tables. Extracts with five jobs running: Job SAS SPDS Extract using WHERE 1:50 2:00 (109%) Extract using IF 2:02 6:00 (300%) Extracts with ten jobs running: Job SAS SPDS Extract using WHERE 3:10 3:20 (105%) Extract using IF 3:05 9:30 (308%) We found we could get a great deal of variation with these tests. The extract with an IF is three times longer here, but some tests showed it up to five times longer. We don t really know why we got this variation.

10 Queries with five parallel jobs: Job SAS SPDS PROC FREQ using WHERE 1:20 0:40 (33%) PROC MEANS using WHERE 1:28 3:00 (205%) Join multi-step 11:00 22:30 (205%) Join single step 8:30 (77%) Queries with ten parallel jobs: Job SAS SPDS PROC FREQ using WHERE 2:30 1:25 (57%) PROC MEANS using WHERE 3:00 3:30 (117%) Join multi-step 23:30 33:00 (140%) Join single step 13:00 (55%) In these tests the PROC FREQ looked faster against SPDS, but again we got wide variation and it could fluctuate to around the same as the SAS time. The others were consistently longer against SPDS. During the day our machine is fully used by analysts running queries: different queries (SAS jobs) using different processors. SPDS may rebalance the processing across the CPUs, but there really aren t any spare CPU cycles for it to play with. SPDS might perform better where a machine was not having the best use made of it. Then SPDS would spread the load and make better use of resources. It might also be better on larger machines, particularly if they had more, smaller CPUs, rather than the six large ones we had. SQL and SQL Pass-Through It was suggested to us that SQL and SQL passthrough to SPDS might give better performance, especially for joins. We tested this out: Job Multi-Step Single Step Data step to SAS data sets 8:30 19:00 Data step to SPDS 7:30 3:10 SQL to SPDS 25:00 SQL pass-through to SPDS 30:00 Didn t complete SQL pass-through didn t give us any performance benefits. The single step DATA step to SPDS provided the best results. Other Changes A number of parameter changes were explored, including UNIXDOMAIN=YES and NETCOMP=NO. We made a couple of changes initially which seemed to help: Job Original Change Extract using WHERE 0:52 0:44 Extract using IF 1:22 1:10 PROC FREQ using WHERE 0:16 0:16 PROC MEANS using WHERE 0:51 0:35 SAS suggested to us that instead of starting ten jobs at once that it would replicate the environment more accurately to have a staggered start. When scheduling jobs using the AT command under UNIX, jobs will automatically be started a second apart. But we tested this option, explicitly setting the jobs to start one minute after each other. Unfortunately it made little difference to our results. In place of one file system on three array groups (12 spindles), we tried splitting to three file systems on the same three array groups, i.e. one file system per array group. But this change gave us mixed results, with some tests faster and some slower. We then reduced the maximum number of threads per job from 24 to 6, to cut down on the processes running. As you can see below, this generally slowed SPDS down: Job 24 Threads (10) 6 Threads (10) Extract using WHERE 3:20 4:15 Extract using IF 13:30 10:00 PROC FREQ using WHERE 1:30 1:30 PROC MEANS using WHERE 3:50 5:25 Join multi-step 33:00 39:00 Our conclusion was we could find no useful parameters to give us any performance benefits. Updating the Data Our update process has two parts: a DATA step MODIFY and a PROC APPEND. We receive daily log files which are processed and sorted before the DATA step MODIFY step. All primary keys in the test data source consist of two fields. The second key field is a replacement date (repdate): the date at which the record was updated or deleted. It should be noted it is possible to have a duplicate repdate, but there s only ever one record for each of the first key fields with a blank repdate i.e. one current record for that key value. The MODIFY Step In the current SAS update, we use a DATA step MODIFY to match the records in the transaction log file with the records in the replica. If a match is found and the log record is a Delete, then the replacement date is set in the replica and the record is no longer current. If there s a match and the record is an Add then it reports an error. If no match is found and the log record is a Delete, then an error is reported. A record with no match which is an Add is output to a temporary file, to be added later. There is some additional error checking which takes place during the MODIFY process. Note that Update records are received as a Delete/Add combination.

11 The BY fields in the MODIFY are indexed. The skeleton MODIFY code is: data replica adds ; modify replica updatemode=nomissingcheck ; by key1 repdate ; select (_iorc_) ; * match found ; when %str((%sysrc(_sok))) do ; if flag = delete then replace replica ; else report error ; end ; * match not found ; when %str((%sysrc(_dsenmr))) do ; if flag= delete then report error ; else output adds ; end ;... The APPEND Step Following the MODIFY step the unmatched Add records are appended into the replica. Using a PROC APPEND is significantly faster than adding the records as part of the MODIFY process. Update Performance The initial tests of the update were not good at all. The update in SAS took one and a half minutes. The update in SPDS took one hour fifty five minutes this was 80 times slower than in SAS! Using SAS our large data sources may take up to an hour to update. One of our reasons for moving out of Oracle was the long update times. There is no way that we could tolerate a decrease of update performance. Our tests showed that the main performance drag was in the MODIFY. The APPEND times were comparable to SAS and were not where the problem was. The MODIFY seemed to be using the index on the first key field, but then doing a sequential scan of the repdate field. We did a number of tests to try to improve this performance. The SPDS auditing was switched off but this seemed to increase the update time in SPDS by a couple of minutes. Next we changed the MODIFY BY to use a KEY= instead. This made no difference to the performance time. We set some options: spdswdeb set to yes; spdstyp set to quick; spdssyrd set to yes but this caused a memory fault; spdsnetp set to 100,000. The spdsadd option adds multiple rows when appending. The netpacksize was set to one record only for the MODIFY rather like setting the READBUFF size when updating Oracle. Resetting the random I/O buffer from its default of 4K to 64K brought some improvement. The MODIFY step not only sets the repdate on the deleted or updated records, but is used for record error checking. This error checking helps to alert us to a situation where there might be missing or duplicate records. The error checking is considered a vital part of our update process and is not something that we could drop or exclude. We found that we could speed up the MODIFY step if we took the error checking out of this step and just retained the matching and replacing of the Delete records. The error checking was done in a separate DATA step outside the MODIFY. Another alternative was explored. Under SPDS, PROC APPEND automatically updates the record if the record already exists in the file, but this requires a unique primary key. Using this feature, it would be possible to overwrite the Delete records that needed replacing, instead of using the MODIFY step. Add records would be added under this option. Instead of using MODIFY and APPEND, we could: Check for duplicate adds and unmatched updates in a DATA step Report on any errors found Merge Add and Delete records Append Delete records overwrite and Add records are added There was a problem with this for our situation as it required the repdate to be unique, which we could not guarantee. We would need to create an additional field to get a unique key which, considering the size of some of our tables, would have disk storage implications. The additional field would also complicate the extract macros that we have set up for users to facilitate their extract of data as at the end of the business day a time which can vary on each day. SAS now say they are trying to speed up the MODIFY process. PROC APPEND performance against SPDS was as good as the SAS times, but the PROC APPEND did a default table lock on the replica table. This default lock would mean that we could not get the simultaneous update and query access which was one of the prime reasons for getting SPDS.

12 We could re-code the update to remove the APPEND, and use a DATA step MODIFY, but this would make the update run several times slower. After all the tests and changes we were able to reduce the update time from around two hours down to six minutes. This was a huge reduction, but it still left us at around four times slower than updating against SAS. It would also mean that we would have to do coding changes and potentially add an additional field to every table. Updates for three test tables were as follows: Table SAS SPDS SBEN 1:30 6:00 (400%) SDET 2:48 6:17 (224%) STAX 1:31 12:38 (833%) The size of these tables varied: Table Size (GB) Observations SBEN million SDET million STAX million SBEN was our prime test table, but as you can see from these figures, the percentage difference varied for the different tables. The tables varied in size and shape. STAX for example has many records but with a shorter record length. SBEN has fewer records but more variables. The number of replaced records also varies: it might be 90% in one table and 5% in another. Indexes SPDS is supposed to excel in the use of indexes, but this was of limited use for our requirements. Since many of our jobs use an extract of data based on the filedate and repdate fields, we added an index to these two fields. We argued that the increase in disk space required by the additional index would be worth a real increase in performance. But the results we got with this additional index were very slow. SPDS keeps a range of values in separate partitions, so it can drop partitions out of the query, and this speeds them up. In our case, the dates were evenly spread through the table, so each partition had a full range of values. SPDS therefore decided that as it was going to have to read each partition, it might as well do a parallel sequential scan. The indexes are beneficial when the data are more or less ordered along the lines of the indexes. The SPDS indexes would no doubt be very beneficial for searching for subgroups, for example by age, location, ethnicity etc. Summary Single jobs run against SPDS are two or three times faster than against SAS for most jobs. Multiple jobs running in parallel against SPDS tables slow down to the same as SAS or slower. We were unable to make any improvements on these timings. The original update times were eighty times slower against SDPS than SAS. After a substantial amount of recoding we were able to improve it to four times slower than SAS, but with a table lock which would prevent parallel query access. BACKUP AND RESTORE Using SAS We use Omniback for our backups. We have a comprehensive assortment of regular and ad hoc backups. Some are incremental, some permanent. All data coming into the warehouse is backed up permanently. All replicas are backed up weekly on an incremental basis and retained for a period of months. When the replicas were in Oracle, we had done weekly cold backups, and also exports of all tables to a separate 128 GB file system. In this way we could recover individual tables or data sources. All data are backed up to tape in an automated offsite tape library. It should be noted that Omniback manages the unmanned tape library of 13 TB of tape remotely. Using SPDS Under SPDS we would need the flexibility that we could maintain with Omniback backups of SAS data files. SPDS recommends doing a full backup of all files, including system tables then doing incremental backups during the week. On paper the SPDS backup facility looked good. A standard incremental backup generally backs up a file if it has changed at all i.e. it backs up the whole file. With SPDS incremental backup, it only backs up the records that change, instead of the whole file. The BACKUP=YES parameter is set in the libnames.parm file, and this results in an additional 17 bytes being stored for each record. Presumably this is a timestamp. This obviously does use more disk space, but would only be something like 17 MB to a file of one million rows. The catch to the SPDS backup is that it backs up to disk, and does not back up directly to tape. We

13 would have to use the SPDS backup to disk, then use the Omniback facility to backup to tape. Under Oracle we had used large amounts of disk space in order to ensure that we had the flexibility of backups and restores that we required. With SPDS it seemed like we were going back to the days of Oracle exports, where we d require additional file systems assigned to backup purposes. To restore, we would need to restore from Omniback then apply incremental SPDS backups. The SPDS restore utility incrementally restores a table to its last backup state. In our tests, a full backup of the SBEN table produced a file which was the same size as the compressed SPDS file: 1.6 GB. The backup took two minutes to run. An incremental backup for one day produced 3 logs amounting to around 10 MB, and took 10 seconds. Note that it s important to ensure that you get the same generation of meta-data, data and index components in the backup. Instead of getting SPDS to perform the backup you could get it to publish a list of files that make up the particular table or set of tables (metadata, data and indexes). This list can then be used in a backup program such as Omniback. But Omniback would not be able to perform a true incremental backup like SPDS, and would do a full backup. Summary SPDS has backup facilities but these are limited by the requirement to backup to disk instead of directly to tape. RELIABILITY Due to the nature and duration of our testing, we have been unable to accurately assess the reliability of SPDS. We did experience problems of running out of resources and memory when running over ten jobs at once. This could result in jobs sitting in limbo which we would later find still sitting there. CONCLUSION With regret, we had to conclude that it would not be a wise move to adopt SPDS on our warehouse at this time. great. But if you stress test, which more accurately reflects a normally loaded machine in the warehouse environment, then the SPDS performance is no faster for our queries, and around four times slower for updates. SPDS does have its pluses as well as it minuses. For our particular type of data queries, performance is not great. But for others it may be better. With good use of indexes for example, it may well be worth trying in your own environment. If the move is direct from Oracle to SPDS, then there is no doubt that users will reap an enormous benefit Like any product, its implementation on a new machine would be simpler than trying to make it work on a large well-established warehouse. In our case, the hardware already exists, the data are in place, and there are well-established traditions of how to do things, and certain patterns of usage. This does not help with an implementation of this kind. SPDS does offer a step forward in terms of managing large quantities of data across multiple file systems. But it is not far enough advanced to be truly efficient. There are gains in terms of security, and in not having to use views with the performance overhead that they involve. The control of secure access is simplified by SPDS. It will be interesting to see just what aspects of SPDS are incorporated into SAS Version 9 and how well they work in a large multi-user environment such as this. The purpose of this paper is not to knock SPDS. It s designed to be a constructive paper, giving the information which we were unable to obtain before we embarked on this work. Hopefully, we will save others time in answering the question of whether or not to adopt SPDS, and what it could do for them in their environment. ACKNOWLEDGEMENT We d like to acknowledge the contribution SAS played in this work. James Redman from SAS Australia worked with us on much of this testing. He came to Wellington for two weeks and brought the rain with him the whole time. He survived the rain and us, expertly managed his pregnant wife at home (is now a proud father), and gave us ongoing support after his departure. Thanks. One main reason for purchasing SPDS was to improve performance, but this is disappointing. If you test SPDS on individual jobs, then it looks

14 REFERENCES SPDS Version 3.0 UNIX Installation Guide Configuring SPDS Wide Mode on HP-UX: A Database Achievement. HP-UX White Paaper SPDS 2.0. Sugi 23, Nashville, Paper #253 SAS SPDS 3.0 On-line Documentation AUTHOR CONTACT Clare Somerville Data Warehousing Consultant Counterpoint Consulting Ltd 6 Raroa Crescent Wellington New Zealand Phone Fax: clare.somerville001@msd.govt.nz clares@counterpoint.co.nz Colin Harris Data Warehousing Consultant BI Solutions Ltd 114 Homebush Road Khandallah Wellington New Zealand Phone Fax colin.harris005@msd.govt.nz cjharris@ihug.co.nz SAS, SAS/SHARE, IT Service Vision, SAS/CONNECT and Scalable Performance Data Server are registered trademarks of SAS Institute Inc. in the USA and other countries. Oracle is a registered trademark or trademark of Oracle Corporation. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.