Tips for Constructing a Data Warehouse Part 2 Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA

Transcription

1 Tips for Constructing a Data Warehouse Part 2 Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA ABSTRACT Ah, yes, data warehousing. The subject of much discussion and excitement. Within the hallowed pages of this paper, the author will provide a quick overview of data warehouse modeling concepts and then discuss in detail many of his favorite data warehousing tips. Many of these tips have been highlighted in the recurring column the author writes for the San Diego SAS Users group newsletter. The author will show how to subset and summarize SAS data sets; compressing SAS data sets; screening unwanted raw data rows when importing data; reading multiple raw data files into a SAS data sets; documenting SAS data sets; and more...! INTRODUCTION Within Tips for Constructing a Data Warehouse Part 1 of this discussion, I discussed a typical data warehouse model and provided my favorites tips for constructing a data warehouse. To repeat myself a little, a data warehouse is simply a series of highly organized, scrubbed, integrated, transformed, and normalized SAS data libraries and SAS files that contain the data you and your customers need from which to make business decisions. Do those files you have on your hard disk for the company s 2010 operating statistics, the medical analysis data for the new drug the company is researching, and the Christmas card mailing list form a data warehouse? No, they just form a cluttered hard disk. On the other hand, the monthly indirect cost, monthly work-in-process cost, monthly accounts payable, yearly travel disbursements, and weekly labor charge files separated into libraries for each company segment for each fiscal year, scrubbed, integrated, transformed, and normalized do form a data warehouse. Now it s time to discuss my favorite programming tips for making a data warehouse easier to use and to be more efficient. I will begin by repeating our data warehouse model that we are using for our illustration. OUR MODEL For illustration purposes, we will build a data warehouse from financial related data of the Fly By Night (FBN) Aeronautics Company. Their primary business is a government contract to build a probe that will orbit and map the surface of the sun. To prevent damage from the intense heat of the sun, the probe will fly by night. Our goal is to build a data warehouse that serves data for pre-made (automated) user applications and provides data for users to create ad hoc queries, analyses, and reports. The primary ad hoc tool will be SAS Enterprise Guide. FBN has three company segments, several years of historical data, and four types of data that we care about: labor, overhead, material transactions, and work-in-process. We get the labor data each week in flat files on tape stored under IBM MVS. And boy, are they big files. We get the overhead and work-in-process data each month from a database query from an Enterprise Resource Planning (ERP) system. And, we get the material transaction data as needed from an on-line IMS database. We need the labor and overhead data detailed each month (although, we receive the labor data weekly), but need the work-in-process data only as a current year-to-date total. To meet our needs, the material transactions must always be current. We also need several look-up tables containing descriptions for a variety of data entries, such as account titles, that are stored in the ERP system. Our target user platform is a PC connected to the company network via TCP/IP, which connects us to the IBM mainframe and connects us to the Unix system where the ERP system lives. DATA WAREHOUSE EFFICIENCIES So far, we have been discussing the basic ways to construct our data warehouse. Optimizing our data warehouse, both our SAS data libraries and SAS data files, can have a tremendous impact on performance and cost. Let us look at a few of my favorite tips that should work in each operating environments. Data Warehouse Efficiencies Size of SAS Data Set Buffers Delete Unneeded Variables Delete Unneeded Observations Look-Up Tables Compressing SAS Data Sets Indexes Format Eliminate Look-Up Tables Summarized Data Vs. Detail Methods for Loading Data 1

2 SIZE OF SAS DATA SET BUFFERS A factor that influences data warehouse performance is the size of the buffers SAS uses when we create a SAS data set. We can use the SAS BUFSIZE= system option to control the size of the buffers. I am indebted to Michael Raithel's book "Tuning SAS Applications in the MVS Environment" for his explanation of the BUFSIZE= option. While Michael wrote this book with the mighty mainframe in mind, the explanations are true in other operating systems. The BUFSIZE is the size in bytes that SAS uses as the page size for a data set when it is created. We specify BUFSIZE= only when we are creating a new SAS data set and it then becomes a permanent attribute of the data set. Optimizing the size of the buffers can improve execution time. It can also reduce the amount of storage space. However, there is a trade-off (as usual). Larger buffer sizes require more memory during execution. The possible values for BUFSIZE= varies with the operating system. Under MS-Windows, the value can range from 512 bytes to 16 megabytes. A value of 0 enables the SAS engine to pick a value depending on the size of the observation. Using the BUFSIZE system option is simple, as illustrated below. options bufsize=512; data ab2010.labor_dtl; set in.source; Once the size of the buffers is set, we can see the attributes in the CONTENTS procedure Engine/Host Dependent Information output (see Part 1 for an example). Here is an example where I set the size of the buffers to 2,048, 20,480, and 2,048,000 when creating a SAS data set containing 3,127,923 rows using a simple DATA step. BUFSIZE=2048 BUFSIZE=20480 BUFSIZE= Data Set Page Size 2,048 20,480 2,048,000 Number of Data Set Pages 208,531 19, First Data Page Max Obs per Page ,984 Obs in First Data Page ,947 File Size (KB) 417,063 KB 393,461 KB 392,001 KB As we can see, the BUFSIZE can have a significant impact on the file size! Check out the benchmarks below when I created a SAS data set containing 1,181,043 rows using a simple DATA step reading and filtering the SAS data set I created above. As we can see, the I/O time can vary depending on how we tweak the size of the buffers. BUFSIZE= Data Step CPU Time 2, seconds 20, seconds 2,048, seconds DELETE UNNEEDED VARIABLES Unneeded variables cost both storage space and processing time. While we do not want to eliminate something our users will need, we also do not want to keep something they will not use. Sometimes, after our data warehouse has been in use for a while, we can reevaluate how our users are using the data warehouse and determine that we can eliminate some variables from our files. Remember, unneeded variables can cost processing time, because SAS must keep track of the variables. What is more, when SAS data sets are stored on tape, the operating system must read the tape through the end of the file. Because on tape data is stored in a long sequence, observation-by-observation, variable-by-variable, bit-by-bit, the more variables we can eliminate, the better. Often, we create a SAS file for our data warehouse from one or more other SAS files. When we do so, to save space and reduce processing time (whenever the SAS files are used), we want to delete any SAS data set variables that we do not need. We use the data set DROP= option to identify which variables to delete or use the KEEP= option to identify which variables to retain. Both will accomplish the same thing, one will be easier to use than the other 2

3 depending the number of the existing variables we want to eliminate. Frequently when creating a subset or summary SAS data set or just processing a SAS data set with a procedure, we do not need all of the variables that are contained in the input data file. Thus, we should delete those variables that we do not need from the source SAS data set. Of course, we save processing time and space when we create the subset or summary file if we eliminate the unneeded variable from the input file rather than eliminate them from the resulting output file. However, we need to be sure not to delete any variables that we need to process within the DATA step or procedure that creates the subset or summary file. For example, if we need to do a WHERE selection on a variable but do not want that variable in the output SAS data set we will need to keep it on the input SAS data set but can drop it from the output SAS data set. Deleting unneeded variables can have a dramatic impact on the size of the output SAS data set. For example, a variable of only five bytes in a SAS data set of one million observations will require five million bytes, or approximately 5MB. Given the number of rows in many of today s files, this can add up to a significant amount of space. Consider the following example. proc sort data=ab2010.wip_dtl(drop=mtdotamt mtdothrs mtdtoamt mtdtohrs) out=ab2010.wip_dtl_sorted(drop=rec_type); by pbcode account prime cdate; where rec_type= 1 ; proc summary data=ab2010.wip_dtl_sorted missing; by pbcode account prime cdate; var ytdothrs ytdtohrs ytdotamt ytdtoamt; output out=ab2010.wip_sum(drop=_type_) sum=ytdothrs ytdtohrs ytdotamt ytdtoamt; Notice in the SORT procedure that we do not drop the REC_TYPE variable from the input (DATA=) SAS data set because we need it for the WHERE statement. If we had included it in the DROP= option on the input SAS data set we would have received an error that the REC_TYPE variable was not on the input SAS data set. In this example, the SUMMARY procedure will produce a SAS data set with only the character variables identified in the BY statements and the four numeric variables identified on the VAR statement (plus, of course, the _FREQ_ and _TYPE_ variables created by PROC SUMMARY). So, why bother dropping the unwanted variables during the SORT procedure? Because by dropping those variables, our intermediate ab2010.wip_dtl_sorted data set is greatly reduced in size. The SORT procedure orders SAS data set observations by the values of one or more character or numeric variables. The SORT procedure either replaces the original data set or creates a new data set. PROC SORT produces only an output data set. [In a later code example in the section on look-up tables, I show an example using KEEP=.] DELETE UNNEEDED OBSERVATIONS Unneeded observations cost both storage space and processing time. Again, while we do not want to eliminate something our users will need, we also do not want to keep something they will not use. Sometimes, after our data warehouse has been in use for a while, we can reevaluate how our users are using the data warehouse and determine that we can eliminate some observations from our files. Remember, unneeded observations can cost processing time. What is more, when SAS data sets are stored on tape, the operating system must read the tape through the end of the file. Because on tape data is stored in a long sequence, observation-by-observation, variable-by-variable, bit-by-bit, the more observations we can eliminate, the better. Any observations that are not needed in our SAS data sets just take up space. So, we want to delete them. Doing so will also make our data warehouse more user-friendly because if there are observations that our users never need, then we don t want to force them to constantly filter them out every time they use the SAS data files. We can eliminate unneeded observations using the WHERE statement when processing SAS data sets in a DATA step or procedure or with an IF statement when processing external files in a DATA step. When designing our data warehouse, we want to consider down to the observation level what are our users needs. If you will not need all of the observations in the source data or all of the observations in one SAS data set that we will use to create a subset SAS data set, then we want to get rid of those we do not need. Consider the sample DATA step below where we eliminate a particular record type observation that we determined our users never would need. 3

4 data ab2010.material_dtl_type1; set ab2010.material_dtl; where rec_type= 1 ; LOOK-UP TABLES SAS data sets containing descriptive or coded information and keyed to variables in other files can be called look-up tables. Within these look-up tables we would typically store a unique observation for each key variable. Look-up tables provide us the means to store repetitive information outside our main SAS data sets. This reduces storage space. For example, if our SAS data set contains a variable for our four-digit account and there are 100 different values for the account and the SAS data set contains 100,000 observations, we will have many observations with the same account number. This is okay, because that is the way the data is recorded. But, if we also had a variable in the SAS data set for the 30-character account description, we would have 30 extra characters per every 100,000 records, much of which would be repeated information. In contrast, if we maintained a look-up table with the 100 account numbers and their associated description (and removed the account description from the data file), our look-up table will be small and our SAS data set would be greatly reduced in size. If our source data contains variables that we have placed into a look-up table, we can choose to drop those variables when we create our SAS data sets from our source data. We can create SAS data sets as look-up tables for a variety of codes and descriptions and store them in a common SAS data library. Because our users will likely use the look-up tables with SAS files from all of our SAS data libraries, we will want to place our look-up table library on the same platform as our other SAS data libraries, or on another platform accessible to our client-server session. We will probably want to name our SAS data library with something that will suggest that the library contains our look-up tables. So, we might use the word look-up or table as a subdirectory name or part of a physical file name. We can create a look-up table in at least two ways. If we already have a listing of all the possible values and the associated descriptions, then we can manually create the needed table. Or, we can create a table by importing into a SAS data set from the source file just the key values and their descriptive information. If our existing SAS data set has duplicate rows on the key variable, then we can sort the imported file by the key value using PROC SORT and use the NODUPKEY option. SAS will eliminate all the duplicate rows, leaving us with a nice little look-up table. The sort code would look like the following: proc sort data=tables.tbl_account nodupkey; by account; Here is a tip if you are working in a client-server environment and are deciding which platforms to store you data libraries for your data and your tables. Typically, your SAS data set look-up tables will be sorted or indexed so they are ready to use. Not all operating systems sort with the same rules. For example, ASCII orders numbers greater than characters, while EBCDIC orders numbers less than characters. If we are using SAS/CONNECT between a Windows client and an MVS server and we are using the PC CPU to process and our look-up tables are on a mainframe server, the look-up tables will be sorted or indexed using the EBCDIC order. But, the SAS data set we wish to merge with the look-up table might have just been sorted by the PC CPU, which would be sorted with the ASCII order. In our SAS log we would find an error message that our look-up table was not in the proper sort order. Once we have a look-up table we can use it to add the descriptive information to our SAS data sets when we need it. We can add descriptive information from our look-up tables either with a DATA step using a MERGE statement, or using a SQL procedure join. Below is an example of a DATA step where we use a MERGE statement to add associated information from one of our trusty look-up tables. This example assumes both the data file and look-up table are sorted or index on the key variable. libname ab2010 \sas\data\library\ab2010 ; libname tables \sas\data\library\tables ; data work.labor_dtl; merge ab2010.labor_dtl(in=a) tables.tbl_account(keep=account account_title); by account; if a; Notice we use the IN= operator combined with the IF statement to make the data file the boss in the merge, so that our output SAS data set will have one row for every row of the source data file. 4

5 Below is an example using PROC SQL where we add associated information from one of our trusty look-up tables. This example assumes both the data file and look-up table are sorted or index on the key variable. proc sql; create table work.labor_dtl as select a.*, b.account_title from ab2010.labor_dtl a, tables.tbl_account b where a.account=b.account; quit; COMPRESSING SAS DATA SETS There was a day when storage space was a premium. Then disk space got cheap. As a consequence, we may have stopped using some of those good ol tricks for reducing storage space. However, in recent years, the amount of data companies have been storing has greatly increased, sometimes overwhelming our available cheap disk space. As SAS data sets get bigger, storage space becomes an issue. So, let s take a look at an old method for reducing the amount of storage space in our data warehouse: compressing data sets. To reduce storage requirements, SAS has an option to compress SAS data set observations by reducing redundancy. Compressed SAS data set observations require less I/O to process. SAS can use compression algorithms to compress SAS data sets. This is a process of reducing the amount of space needed to store a SAS data set - it does not affect the data stored within that SAS data set. Compressing a SAS data set is done simply by using the system or data set option COMPRESS=. Using the COMPRESS= system or data set option, any SAS data set created on disk will be compressed. SAS data set compression can greatly reduce the size of SAS data sets. To use the COMPRESS= system or data set option, set the option to either "YES" or "BINARY." (In newer versions of SAS, CHAR can be used as an alternative to YES with the same result.) The COMPRESS=YES value uses an algorithm that works better with SAS data sets that are primarily comprised of character variables. On the other hand, COMPRESS=BINARY uses a different algorithm that works better with SAS data sets that are primarily comprised of many variables including many numeric variables. My experience has been that COMPRESS=YES reduces the size of the SAS data set by about 50 percent. An option to use with COMPRESS= is REUSE=. Specifying this option allows SAS to reuse space within the compressed SAS data set that has been freed by deleted observations. Otherwise, SAS cannot reclaim the space made available by deleted observations. Consider the following examples, first using the COMPRESS= data set option, then with the COMPRESS= system option. data temp.wip_itd_comp(compress=binary reuse=yes); set ab2010.wip_itd; where rec_type= 1 ; options compress=yes reuse=yes; data work.wip_itd_comp; set ab2010.wip_itd; where rec_type= 1 ; After running the DATA step with the COMPRESS= set to YES or BINARY you will note a message in your SAS log that looks something like this: NOTE: The data set WORK.WIP_ITD_COMP has observations and 20 variables. NOTE: Compressing data set WORK.WIP_ITD_COMP decreased size by percent. Compressed is 935 pages; un-compressed would require 1369 pages. I thought you might want to see some benchmark results. I took four SAS data sets with differing numbers of observations and variables. I started with uncompressed SAS data sets, then compressed with normal compression, then compressed with binary compression. File1 and File3 have 20 variables, 5 of which are numeric; and File2 and File4 have 29 variables, 10 of which are numeric. 5

6 OBS CHAR NUM COMPRESS=YES REDUCTION COMPRESS=BINARY REDUCTION File1 120, % 23.72% File2 183, % 49.22% File3 1,542, % 21.02% File4 6,976, % 46.61% In my examples, COMPRESS=YES always produced better results. However, the files with more variables and more numeric variables (File2 and File4) got almost the same benefit using COMPRESS=BINARY. (Actual results may vary.) Compressing SAS data sets does have a down side: writing and reading compressed observations requires additional CPU time. The decision of whether to compress datasets hinges on whether storage space is at a premium. If not, it generally only makes sense to compress the largest data sets, given the trade-off of additional CPU time needed to read and write compressed datasets. INDEXES To avoid having to sort SAS data sets before using a BY statement, we can create one or more indexes on our SAS data sets. Indexing does not rearrange the observations in the SAS data set; it creates pointers used to locate the indexed observations. Indexing is a good user-friendliness technique, as our users will not have to worry about sorting before doing tasks, such as merging SAS data sets. However, using indexes for some tasks, such as PROC SUMMARY, may actually be slower than when using sorting SAS data sets. I found that WHERE selections process much more quickly when done on indexed variables. However, indexing SAS data sets has a trade-off. The index is a separate file in our SAS data library. The more indexes we create and the more complicated they are, the bigger the index file. Having a couple indexes and perhaps, compound indexes, can greatly increase the size of our SAS data libraries. I find indexes to be extremely useful if I have one or more variables that I BY process often. Indexes are basically files with pointers used to locate the desired rows of a SAS data set. Indexing SAS variables provides two key benefits: 1) elimination of the need to sort SAS data sets and 2) increased efficiency of WHERE processing. Eliminating the need to sort improves the usability of a data warehouse; users do not need to sort a data set prior to embarking on any BY-variable processing for variables included in an index. For example, if we create an index on variables ACCOUNT and DEPT, we can use the SAS data set as if was sorted by ACCOUNT or by DEPT. We can also create a compound index that references two or more variables at once; this allows us to access the dataset as if it were sorted by multiple variables. Creating indexes can be done with the DATASETS procedure or with a DATA step option. Here are some examples of the syntax for simple indexes, first using PROC DATASETS and then with a DATA step. proc datasets library=ab2010; modify overhead_dtl; index create account dept; data ab2010.overhead_dtl (index=(account dept)); set tape.rawdata; In the preceding examples we created two simple indexes, one for ACCOUNT and one for DEPT. In the next example, we will create a compound index on ACCOUNT and DEPT. Notice that we set the compound index to an index name KEY (in this example). We will The DATASETS procedure is a utility procedure that manages your SAS files. With PROC DATASETS, you can do the following: copy SAS files from one SAS library to another rename SAS files repair SAS files delete SAS files list the SAS files that are contained in a SAS library list the attributes of a SAS data set manipulate passwords on SAS files append SAS data sets modify attributes of SAS data sets and variables within the data sets create and delete indexes on SAS data sets create and manage audit files for SAS data sets create and delete integrity constraints on SAS data sets 6

7 never use the index name when addressing the compound index. The name for the index must be a valid SAS name and cannot be the same as any variable name or any other composite index name. data ab2010.overhead_dtl(index=(key=(account dept))); set tape.rawdata; We might then use this SAS data set in a PROC SUMMARY, without pre-sorting, like this (because we have two variables in the BY statement, the two variables must be a compound index): proc summary data=ab2010.overhead_dtl; by account dept; var amount; output out=ab2010.overhead_sum sum=amount; When we look at the output from the PROC CONTENTS after creating indexes, we will notice the number of indexes in the SAS data set is documented. The variables that are indexed and the number of unique values of each are also noted. For fun, run PROC CONTENTS on a SAS data set before and then after creating an index and take careful notice of the change in information provided. Look at Part 1 of our discussion on constructing a data warehouse in the section on documenting the data warehouse and notice the sample PROC CONTENTS references to the indexes. FORMAT It may seem obvious, but I haven t always remembered to format the variables in my data sets. Sometimes, I would let SAS use default formats. But, I found I prefer to control the format so I get what I expect. The FORMAT statement is a simple addition to the DATA step. Here is an example using the DATA step. Data mylib.myfile(label= General Ledger YTD 2004'); set in.source; format account $7. amount 14.2 bu $2. cdate mmddyy10. hours 14.2 jv $3. pool $2. tdate mmddyy10.; When we use the CONTENTS procedure to document our data warehouse (see Part 1 of our discussion), we will see the format of each variable identified, as in the example below. 7

8 ELIMINATE LOOK-UP TABLES Wait a minute, wait a minute, didn t I discuss earlier that look-up tables are a good thing? Now I m saying to eliminate them? While look-up tables are a great way to reduce storage space, having data related to our SAS data sets in separate files is not user-friendly. If we add the related data to our files and have it ready for our users and developers within our main SAS data sets, our data warehouse will be much more friendly. As easy as SAS is to merge data files and look-up tables, it s an extra step for our users. So we need to weigh the trade-offs of user-friendliness versus storage space. SUMMARIZED DATA VS. DETAIL In Part 1 of our discussion, we discussed the level of detail we keep in the data files in our data warehouse, and we discussed briefly how we could use the SUMMARY procedure to create summary level files. This topic is so important that we will now look at some details of this wonderful procedure and the different output we can achieve when summarizing detail files in our data warehouse. The SUMMARY procedure provides data summarization tools that compute descriptive statistics for variables across all observations or within groups of observations. SUMMARY PROCEDURE Frequently, PROC SUMMARY (which, by the way, is almost identical to the MEANS procedure) is used to summarize a SAS data set into a smaller SAS data set having only one record for each occurrence of the specified variable(s). Summarizing a SAS data set is very useful when we expect to use a SAS data set more than once and don't need the original level of detail. Having the data summarized into a smaller data set reduces the amount of processing time and associated cost each time the summarized data set is used and reduces the cost of storage. Also, original SAS data sets are sometimes too large to store on disk, but a summarized SAS data set made from that huge SAS data set might easily fit onto a disk. There are a few key statements we use with PROC SUMMARY. Let s look at each separately. The code examples that follow include each of these key statements. Following is an example of the PROC SUMMARY code using the CLASS statement. proc summary data=ab2010.wip_dtl missing nway; class div account journal; var amount; output out=ab2010.wip_sum sum=amount; And, here is an example of the PROC SUMMARY code using the BY statement. proc summary data=ab2010.wip_dtl; by div account journal; id div_name; var amount; output out=ab2010.wip_sum sum=amount; BY AND CLASS STATEMENT The BY and CLASS statements are used to control the order or grouping of selected variables. When we use the BY statement to identify the variables on which we want to summarize, the input file must already be sorted or indexed by the same variables. And, if we first index the input file, and if we plan to summarize on more than one variable, the index must be a compound index. The CLASS statement can be used rather than the BY statement and does not require the input file be sorted or indexed. However, the CLASS statement does require more memory to function than does the BY statement. This can be a resource issue if the input file is really big. Also, when the CLASS statement is used, SAS will drop from the output file observations with a missing or null value in any of the variables in the CLASS list. This could have disastrous results. However, losing such observations will be prevented if we use the MISSING option. All variables identified in a BY or CLASS statement should be character variables, unless they are numeric variables with few, discrete values. When we use the CLASS statement, PROC SUMMARY creates summary records for each possible level of interaction. For example, in the example above, we summarized on DIV, ACCOUNT, and JOURNAL. The CLASS 8

9 statement would produce a summary observation for each unique combination of DIV, ACCOUNT, and JOURNAL, and a summary observation for each unique combination of DIV and ACCOUNT, and a summary observation for each unique combination of DIV and JOURNAL, and so forth. The automatic variable, _TYPE_, created by PROC SUMMARY records the level of interaction. Creating summarized data sets this way can be dangerous for our users as they can easily double, triple, quadruple, etc, count the values in the data set unless they know how to select on the _TYPE_ variable. Frequently, only observation for the highest level of interaction (one observation for each unique combination of all variables in the CLASS statement) is needed. If we want observations for only the highest level of interaction and we use the CLASS statement, the NWAY option will cause PROC SUMMARY to retain only the observations with the highest level of interaction in the output file. If the BY statement is used, SAS will produce observations for only the highest level of interaction among the variables in the BY list. Therefore, the BY statement will produce the same output file as the CLASS statement with the NWAY option (unless there are observations with missing values which will be dropped when we use the CLASS statement without the MISSING option). ID STATEMENT The ID statement is used to specify any character variables not specified in the BY or CLASS variable list that we want to retain in the output SAS data set. Variables we specify with the ID statement are not summarized. Rather, the value of the ID variable on the last observation summarized for the BY or CLASS variable list is retained. This is useful when a variable must be retained in the output file, but whose value is the same for each combination of the BY or CLASS variable list. Variables listed with an ID statement could be added to the BY or CLASS variable list to produce the same result. However, doing so will create a more complicated summarization sequence that will slow the summarization process. VAR STATEMENT The VAR statement is used to specify the numeric variables that should be summarized (or retained using another output statistic). Any variable in the input data set not specified with either a BY, CLASS, ID, or VAR statement will not be included in the output data set. OUTPUT STATEMENT The OUTPUT statement is used to specify the output SAS data set and the output statistic to act upon the variables identified in the VAR variable list. Within the OUTPUT statement we specify the output libref.filename with the OUT= option. We specify the output statistic, followed by a "=" and then the variables listed in the VAR statement. If our goal is to collapse the input SAS data set into a less detailed output SAS data set, we will use the SUM output statistic. There are a variety of statistics we can retain in the summary results. But, when our objective is to create a summary file from a detail file in our data warehouse, the summary statistic should be all we need. The SAS on-line help will provide you with the other available statistics. AUTOMATIC VARIABLES PRO SUMMARY creates two variables automatically: _FREQ_ and _TYPE_. A numeric count of the number of observations from the input SAS data set summarized into each single observation in the output file is stored in the _FREQ_ variable. The _TYPE_ variable contains a numeric value identifying the level of interaction in the CLASS variables. When a BY statement is used, the _TYPE_ variable will always equal 0. When the CLASS statement is used, the _TYPE_ variable will contain 0 for a grand total observation and values of 1 through n for various levels of interaction among the CLASS variable list. Version 8 of the SAS system introduced the WAYS and LEVELS statements. Check out my WUSS 2002 paper New Ways and Means to Summarize Files for exciting details about these features. Also, see my WUSS 2000 paper Subsetting Files vs. Summarizing Files for even more examples of SAS code for summarizing files. METHODS FOR LOADING DATA Common techniques for getting data into SAS data sets include the IMPORT procedure, various SAS/Access modules, Open DataBase Connectivity (ODBC), SAS Enterprise Guide, the Excel engine, and, of course, the DATA step. Each method has its strengths, but also have weaknesses. For example, some require additional licensing. Some are not easily repeatable in a production environment. Some, like the IMPORT procedure, are not very flexible. But, the DATA step has no limits and is my favorite technique for loading my data warehouse. 9

10 The data step offers a wealth of options and abilities, and can be included in production code for repeatability. Here are my favorite aspects of the data step used to enhance my data warehouse. Reading multiple source files at once. Often I have to break up my downloads into separate files. This ability makes importing multiple files simple. Label option to add a data set title. What s a data set without a descriptive title? Always add a data set title. (Check out my WUSS 2005 article Documenting Your Data Using the Contents Procedure.) Label statement to add description to each variable. Why rely on the variable names to convey the content of the variable? Always add variable titles. INDEX= option to create variable indexes. While indexes aren t required, they sure are a good idea if our data warehouse users will be doing a lot of record selections and BY processing. Informat statement to accurately read imported data values. Some of the other methods for reading data into SAS don t allow you to set the informat, causing data input inaccuracies. Format statement to apply desired format to data set. Of course, we want to control the way our data values are displayed. DROP=/KEEP= to retain just what we need. Sometimes we need to read data items into a SAS DATA step, only to use that data to affect the data step processing, but then don t want to keep those data items. So, we drop what we don t want and keep just what we want. IF-THEN-ELSE processing to massage our data into the data set we want. We may not want all of the rows from the raw data, or we might want to conditionally modify variables. The IF-THEN-ELSE statement can be our friend. Variable assignments are easy. We might want to add a new variable or change the value of one. Here s some simple DATA step code demonstrating these capabilities. data ab2010.wip_ytd(label= General Ledger YTD 2010' index=(account) drop=hours); informat account $4. amount 15.4 hours 15.1; infile csv delimiter = ',' missover dsd lrecl=32767 ; input account amount hours; if hours = 0 then delete; rate = amount/hours; format account $4. amount 15.2 rate 5.2; label account = 'Account Description' amount = Dollar Amount rate = Pay Rate ; There is another great capability with the data step. With the popularity of Excel files, we have an easy way to get Excel files into SAS - Dynamic Data Exchange (DDE). We can use DDE within a SAS DATA step to make importing an Excel file routine and easily repeatable. (Check out my WUSS 2010 article Importing Excel Files Into SAS Using DDE.) CONCLUSION Useful data warehouses do not just happen. Careful and thoughtful planning will result in the blueprints for a good, user-friendly SAS data warehouse. If our data warehouse proves to be less than user-friendly, do as I did - spend the time to go back to the drafting board and redesign it. Users who have a good data warehouse from which to analyze the data they need will quickly recover the time spent designing or redesigning the data warehouse. REFERENCES BLKSIZE SAS Companion for the MVS Environment, First Edition, Chapter 17 Tuning SAS Applications in the MVS Environment, Michael A. Raithel, Chapter 4 SAS On-Line Documentation 95/HTML/default/a htm 10

11 CONTACT INFORMATION The views expressed herein are solely those of the author and do not necessarily represent either the views, policies, or endorsement of the Department of Defense. Your comments and questions are valued and encouraged. Contact the author at: Curtis A. Smith P.O. Box Fountain Valley, CA Fax: SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 11