Using Version Control and Configuration Management in a SAS Data Warehouse Environment

Using Version Control and Configuration Management in a SAS Data Warehouse Environment Steve Morton, Applied System Knowledge Ltd Abstract: Data warehouse management involves many components in addition to the data structures themselves programs and processes for ETL; application code; tools and add-in programs; metadata for publication; to name just the obvious ones. As the data warehouse grows over time, keeping track of past versions and tracking change in the code can become a serious issue. Even before this, the use of a multi-developer team and the need to control code in a test environment creates a requirement for configuration management and version control. SAS/Warehouse Administrator software provides a strong metadata platform, but only reflects a moment-in-time picture of the warehouse and its processes. Fortunately, SAS/AF software includes a hidden gem Source Code Manager (), which provides the basic check-in/check-out, versioning and automatic archiving that we need for the most important parts of configuration management. This paper describes the techniques and disciplines used by the author on a recent major data warehouse project using alongside SAS/Warehouse Administrator software to control versions of code and metadata through system testing and release to production. Introduction - Why Version Control? When managing a data warehouse there are many components to keep track of in addition to the data structures themselves. For example:- * programs and job scripts for ETL * application code * tools and add-in programs * exported metadata for publication SAS/Warehouse Administrator software provides a framework to manage all the structural and process elements, but it is a moment-in-time view of everything - it has no concept of "last version of the data warehouse or mart". It is also focussed on the data warehouse and mart management itself - applications and SAS/AF programs or tools are outside its' domain. So to manage versions of all those elements we need to look beyond SAS/Warehouse Administrator software alone. A data warehouse environment is constantly changing - new subjects are developed at the enterprise level, new data marts and applications deployed, source systems and business rules may change over time, and new sources of data need to be integrated. Add to this the progress of any new of changed element from development through integration tests, to system tests and finally to deployment. Now add the fact that this usually happens in a multi-developer environment, with differing time-scales and priorities.

It is obvious that managing this is not trivial! Software that helps us to do this is almost essential - and that is where Version Control software comes in. Fortunately, another part of the SAS system - SAS/AF software - offers a solution. This paper outlines how configuration management and version control were implemented on a data warehouse project by the author, using SAS/AF's tools to manage the life-cycle of several of these components. In describing this, I will also show how I set up and use in combination with SAS/Warehouse Administrator. Data Warehouse Life Cycle - Wheels within Wheels Each part of a data warehouse has its own life cycle - from design, through development, unit testing, integration and systems testing to production deployment, and finally maintenance. The different parts are on somewhat different 'rhythms'. An enterprise data warehouse itself is on a relatively long change cycle - typically several months for a new subject. This reflects the time taken to create new ETL processes, using new business rules for integration, resolving links to previously existing subjects and validating data. Enterprise D.W. 1/4 1/5 1/6 1/7 1/8 1/9 1/10 1/11 1/12 1/1 1/9 Data Marts Tools and Applications In contrast, new data marts in an established data warehouse will be added in a few weeks. Their processes are dependent on the enterprise data warehouse, so they are generally simpler to implement. Timing is sometimes linked to mart applications, which may themselves be dependent on other developments such as SAS/EIS customisations. Corresponding to the enterprise data warehouse cycle are all the specifically related components:- code, process tools, extraction applications, descriptive metadata exported from the SAS/Warehouse Administrator management environment and so on. These usually synchronise with the deployment cycle of the component they support. Development tools are on yet another schedule (in this case I am referring to SAS/Warehouse Administrator add-ins, code generators and the like). Since these are used internally by the project, they

tend to be on a short design-to-deploy time scale; however, their major releases also synchronise with those of the data warehouse components. Project Data Warehouse Environment It may be useful to understand the project environment, since that gives a context for the techniques described here. It is as follows: Development: Windows desktop, NT Server OLTP and enterprise platform: OS/390 mainframe Data marts: NT Server Mainframe Production OLTP and Enterprise Data Warehouse Separate System Test environment Production NT Server Data Marts Development Server Application testing dw + test data marts SAS/SHARE server for development libraries and SAS/WA metadata Dev. PC Dev. PC Dev. PC Dev. PC Development and warehouse management environment (simplified) Unit testing is carried out on Windows or on personal OS/390 SAS/Connect sessions, as appropriate (several data sources are DB2, so requiring the OS/390 environment to run). System tests for enterprise data warehouse ETL processes are in the OS/390 batch environment. System test for any mart also runs using OS/390 jobs, to take advantage of the mainframe scheduler. These processes are all dependent on extraction from the enterprise data warehouse, so those jobs start later dependent processing on the NT server. Production is deployed on OS/390 by handing over libraries to operations team, who use mainframe configuration management (ChangeMan) to promote to production. Some specific points should be noted about these environments. 1. Libname definitions are all external to the process code (i.e. not generated by SAS/Warehouse Administrator). Site standards require all permanent data to be referenced by DD statements in JCL, and parameterised so that jobs can be transferred to production unchanged. This is actually fairly easy to work with, since the NT Server test environment can provide the same Libref's by using an include script that assigns them for each job. 2. Extensive use is made of generated code, with several process add-in tools for site-specific requirements. A supporting tools for job generation creates both the process steps for a job and an include-script to run them.

3. Code for individual SAS steps resides in SAS catalogs, so that the SAS input script for a job is a series of %include statements and comments. Input scripts are uploaded into PDS members on the mainframe, or to loose.sas files on the NT server, and then referenced as sysin input. A golden rule is that no code should ever need to be changed or re-created when moving from Unit test through System test to Production. Only the external environment (JCL or SAS run-time definitions) is changed. - the Basics Source Code Manager () is provided along with SAS/AF software. Its great advantage over external configuration management / version control tools is that it is part of the SAS environment. This means that it can work with all types of entries within SAS catalogs and SAS data sets, as well as whole catalogs or libraries. Non-SAS tools can only handle the items they can see in the host file system. This is a great help to SAS developers, since a SAS catalog is very useful to group together all the parts of a related group of processes. Rather than a large collection of loose source files and job scripts, libraries of SAS catalogs provide an organised grouping. It also allows the same version control environment to be used for SAS programs, SAS/AF frames, stored SCL lists and control tables. It also means that linked tools, such as a 'difference' comparison, can be context-sensitive. When a SAS data set is compared with a different version 'proc compare' is used, while a comparison of SCL entries uses a text-files comparison such as Microsoft's 'windiff' utility. The basic principle is this:- maintains one or more Team libraries, for each of which it records an archive. Developers who wish to change any code registered with must check out whatever they are about to change into a personal Local library. They then make changes to the local copy, testing until satisfied with results, and finally they check in the completed article. The archive then contains the previous versions of the items checked in. Check Out Management Task Local Library Team Libraries Control DB Check In Archives Using for Tools and Applications This is pretty much the 'standard' use of. It was designed to allow multiple developers to work cooperatively together on a project.

We have several distinct uses from in directly developed code. These are: SAS macros to be used in the run-time environment, batch or interactive, for the data warehouse ( a single MACROS catalog, one entry per macro) SAS/Warehouse Administrator add-in tools (the _SASWA library, various catalogs) SAS/AF application screens for the 'warehouse explorer' query tool that navigates the star schema's (a single library with several distinct catalogs, functionally separated) SCL custom overrides for SAS/EIS classes (another single library with functionally separate catalogs) Other SAS/AF application screens for data marts (likewise) Each one is registered, and maintained at the appropriate time by relevant developers. Any other team members who need a copy in their own development environment can set up their local libref for the required library. They can then Copy the whole library or selected catalogs from Team to Local to get the latest team entries. This is done when, for example, development in one area relies on tools being developed or enhanced in another area. At distinct key points during development a Version Label is defined. A Version Label is simply a reference set of specific versions of every element required to make up a working set of code at a specified level. This will have a name that includes a version number for the overall set of code - we have used a convention of component_name major_release_number.1 for the first System Test version. So for example, the second major release of the macro tools starts with MACRO TOOLS 2.1 for its System Test version. The feature Copy Version Label uses this definition to allow you to write an image of all these elements to a target library. This is then the 'master' copy which is either uploaded to the server environment (mainframe or NT) or deployed on the LAN as appropriate. Version Control Control DB Version Label Definition Archives Copy Version Label Copied Image of Version Label Send to System Test environment

Using with SAS/Warehouse Administrator software The use of with SAS/Warehouse Administrator software may not be immediately obvious - after all, there is nothing in the documentation of either that explains how to do this! However, the code generated by SAS/Warehouse Administrator benefits from version control too. An important decision when using SAS/Warehouse Administrator is: How will you use SAS/Warehouse Administrator to manage code? The choices available to you are:- 1. Write code and then register it in WA processes or 2. Generate code using built-in code generators and/or custom process library add-in's I prefer whenever possible to use option 2 - simply because I can then guarantee that the 'meta-process' accurately describes the code that runs. With option 2 one relies on any code change being replicated in the metadata as well - an easy thing to overlook or make errors in when in a hurry. When you are sure that the metadata accurately represents the process you can confidently use it for impact analysis, cross-reference reporting and other management needs. It is this type of use that I will describe here. While working on a new (or changed) process in SAS/Warehouse Administrator software, it is usual to run generated code and jobs from that environment to Unit Test each new piece of code. Testing can either take place locally on the PC, or by remote submit to a server test environment through a SAS/Connect link, depending on what resources the code needs to run. For example, extract process code that uses DB2 must run on the mainframe. The most important habit to adopt here is that generated code should always be stored into a Local library catalog entry when completing Unit Test, and the entry must then be checked in to the corresponding Team library. In this way, as Unit Test proceeds the Team library steadily builds up a collection of tested code ready for Integration Test and System Test. W A MetaData SAS/Warehouse Administrator Code Generation Check Out Management Task Local Library Team Libraries Control DB Check In Archives

It is important to have good naming conventions for your code entries so that it is easy to locate where any item of code comes from in the SAS/Warehouse Administrator environment. Each process is defined within an appropriately-named Data Group folder in SAS/Warehouse Administrator, with a corresponding source catalog for the steps. We also supported this by using an Extended Attribute on each process, which is used by an add-in tool to automatically save the generated code in the named source entry. Note, this was originally set up in a SAS 6.12 environment - in version 8 one would probably use Job Group folders instead. In this environment there is a strong link between Unit Test and Integration Test - since processes are tested incrementally, adding a new step to existing process code, the Integration Test happens quite naturally as Unit testing proceeds. This is done by creating finished job stream scripts, also checked in to a team library, and testing the script after uploading to the mainframe. Version Control for the Data Warehouse Process Code Once the data warehouse processes are ready for System Test, we reach the first major Version Control point for the process code. This is the moment when the first Version Label is defined in for the warehouse itself, as distinct from the tools that are used to develop it. Using Copy Version Label in we can create the library of code corresponding to a Version Label at any time from 's archives - so there is no need to keep any other copies. System Test begins by performing a Copy Version Label to a temporary library and then uploading the resulting code to the mainframe. This is just like the process for other application code, except that there are always at least two catalogs involved - the one containing the process steps, and the one containing the %include scripts. Incrementing Versions The aim is to fix most errors during Unit testing. However, there are bound to be a few minor alterations from the x.1 System test version before everything is ready to go to Production. To handle this, the x.2 version is created - initially identical to x.1, but as altered entries are checked in to the Team library these are updated in the x.2 version label. If we ever found something severe enough to need a separate test level we have the option of incrementing again to x.3 and so on. Moving to Production The final step in the process is going to production. In our project environment, this means handing over the final System tested version over to mainframe production control procedures, which promote the whole set of code and libraries to the production environment. Thus, the last System test version of each copied Version Label becomes Production. This is also the moment to freeze a copy of the SAS/Warehouse Administrator metadata libraries to correspond to the Production code and jobs. We did consider possibly using to 'check in' these libraries, but decided against that. It is only really necessary to keep each production version, so a simple 'copy' action for the whole warehouse environment definition works well enough.

Setup and Usage Hints and Tips Based on experience in this environment, I can offer the following tips that may save you time if implementing a similar environment. Use SAS/SHARE for both control libraries and your Team libraries, to minimise lockouts. Preallocate these in SAS/SHARE and allocate locally for each developer using, for example, libname server=xx slibref=; Use the 'proper' libname (i.e the one you will use in Production) as your Local library in the Unit test environment. This will ensure that any references to libraries, for example in %include statements or class entries, are unchanged when going to Production. Define Team libref's centrally using a script run by the Share server; use a libname that corresponds to the Team library. Define a naming standard - for example, using a prefix letter such as 'T' for team so that all team libraries appear together in the libname list. Note, the 8-character limit for libname's means that you can only use 7 characters from the Local libname for the rest of the Team library name. Let allocate libref's - that way everyone is reminded to use it, because they have to run it to access the libraries! Using 'View Differences' requires set-up of an option in the administration windows. When you do this, be sure to put quotes around the command to allow embedded blanks in the path reference. Remember also to use &1 &2 operands (also in quotes!) to identify the two comparison files in the command definition. Set Archive limits to 0 (unlimited archive). Space taken is tiny, and you do not risk losing back versions. Alternatively set it to a very high number, such as 999. 'Public' libraries and 'Team' libraries are not the same thing! If you need a public copy of what is in a Team library so that non-developers can use the application or tool that is there, create a network location and fill it by using 'Copy Version Label' feature from. Otherwise these non-developers will find unexpected changes as developers check in altered versions of various items and developers will experience frequent inability to check in when users' sessions are accessing the library. If you use alongside other AF-based SAS products (such as SAS/Warehouse Administrator software or SAS/EIS) and you are developing SCL to use in those environments, remember to specify "resident=0" option when invoking those products. If you do not, then the changed version of your tool will not be found until you close and restart that product window. This option is documented for the AF and AFA commands, but it applies equally to DW, EIS and RUNEIS commands. [Note, don't use it for your deployment environment - only development!] Pitfalls, Limitations and Wish List is very useful, but it is some way from being perfect for these tasks. Here a few things to watch out for, and some items I wish we had, in both and SAS/Warehouse Administrator software. I've tried to be realistic in my wishing. I could wish for fully integrated version control at the element level in SAS/Warehouse Administrator software - but I know this is a major amount of work, and it is already on the product developers' wish list (I believe).

Using a 'non-default' font size for your SAS session can confuse some of the dialogs, particularly the 'wizard' ones including the initial set-up wizard. Stick to using 'SAS Monospace' at 8-point size for the windows and all should be well. is written in SAS/AF, so it runs in an AF environment. If you develop and test your own SAS/AF tools within and experience a serious abend in your own code, you should close and re-start. If you do not, then you may experience a crash in as well! really does not work with SAS/EIS developments. Because some SAS/EIS 'objects' create more than one catalog entry, you would have to remember to check out all parts of the item you were about to work on. Worse, though, is that SAS/EIS has its own 'build' environment, so you cannot launch 'build' for EIS objects from 's windows. Creating a version label that is identical to an earlier one, to use as the basis for Version.release+1 increments is tedious using the point-and-click approach. really needs an admin feature to easily 'duplicate' a version label definition, which can then have individual changes applied to it. [Note, you could do this by cheating - going behind the scenes to change scm.verlabel data yourself - but this is never a good practice!] The procedure for using with SAS/Warehouse Administrator relies on manual check-out and checkin. I would prefer to be able to call these functions within a SAS/Warehouse Administrator add-in tool to automate the process. The 'experimental' version of did document an API for that might have supported this - I would like to see this return in a future version of. 's strength is dealing with SAS 'objects' (datasets, catalog entries etc.); this is also its weakness, as it does not allow one to register external files. SAS/Warehouse Administrator's scheduler support allows jobs and scripts to be written to external files only, and not to catalog entries. So one cannot use to version any jobs generated using the WA scheduler (this was not a problem for us, since we did not plan to use it, but would be a nuisance otherwise). It would be helpful if SAS/Warehouse Administrator software allowed one to directly write these as catalog entries, rather than just loose files. Conclusion Having worked with this version control environment for a while, I would not want to develop seriously without one now. Without a tool providing support, there is much that developers have to do to back up and preserve their work, and control what is released to test and production. provides this and has the advantage of being built in to the SAS environment - it costs nothing, and it supports most types of SAS program development. Steve Morton Applied System Knowledge Ltd. 51 Blandy Road Henley-on-Thames England Email: Steve.Morton@appliedsystem.co.uk SAS, SAS/AF, SAS/EIS, SAS/SHARE and SAS/Warehouse Administrator are registered trademarks of SAS Institute Inc, Cary NC, USA. All other company and product names are trademarks or registered trademarks of their respective owners.