Automatic Generation of Accumulated Data Matrices in a Tabulating Process

_.~ Automatic Generation of Accumulated Data Matrices in a Tabulating Process CASTILLO, Jesus CASTRO, Alejandro de SANTOS, Angel Departamento de Estadistica - Comunidad de Madrid Informatica Comunidad de Madrid Abstract The SAS system offers great possibilities in order to get information in a tabular form. The procedure PROC TABULATE is the most adecuated for these kind of tasks. Its syntax is simple and powerful. With a few lines of code, tables with a complex structure can be obtained. However, the output produced by this procedure is not a data set but a report. Its manipulation for another SAS procedure is difficult. The work presented in SEUGI'94 has the purpose to develop an automatized procedure that allows to get an accumulated data matrix as a SAS data set starting from a table definition. Introduction A basic aspect in the difussion policy of an organism that produces statistics information is the format in which is distributed. The possibility of manipulating the data has a great importance. A priority purpose elaborating the tables of the Census of 1991 at "Departamento de Estadistica de la Comunidad de Madrid" was to get data matrices. PROC TABULATE does not include an option that allows the output is a SAS data set. This forced to write SAS programs for each table in order to get a SAS data set. This method allowed to solve some problems that appeared in the tables: the use of different where clauses in the same table, and the accumulation of a subset of the values of a variable. Figure 1 shows the second case. 121._" _~ i_.~'~ -~'_~;_-- - - ~ - ",.- ''''''-~ -'-'.

AGE <20 I 21-40 J >40 I 41-60 I 61-80 I >80 t t t t t t 1 2 3+4+5 3 4 5 Figure 1. Accumulation of a subset of the values of a variable. None of these cases is solved using PROC TABULATE. How to Get Accumulated Data Matrices The experience obtained in the development of SAS programs to get accumulated data matrices allowed to elaborate a systematic for its realization. The study of the different kinds of tables that the Department generates, allowed to identify those parts that are necessary to get in a separated way using accumulation procedures. Basically, the characteristics that lead to this conclusion are two: If in a statistics table different where clauses are applied, it is essential to execute different accumulating procedures. If the table incorporates concatenation in some of its dimension: page, row or column, is convenient but not necessary to execute different accumulating procedures. For instance, if we have a data set with information about the population of "Comunidad de Madrid", it is possible to propose the following table: CITY Acebeda, La Ajalvir SEX AGE Female I Male ~20 I 21-40 I 41-60 I 61-80 I >80 A B Madrid DISTRICT QUARTER Centro Palacio Arganzuela Acacias C D Figure 2. Table with different where clauses and concatenation. 122

~., i-?h-:0~~""'~:t~~;;;''':.1-' h,-:~-~-::&:l~;:~of':.':;:~~:;-':-?-:''" :"-~.~- - <-', ~. In the table of figure 2, four parts can be identified: A: population classified by CITY and SEX. B: population classified by CITY and AGE. c: population of the City of Madrid classified by DISTRICT, QUARTER and SEX. D: population of the City of Madrid classified by DISTRICT, QUARTER and AGE. Different where clauses are applied in the row dimension: crossing DISTRICT and QUARTER is only for the City of Madrid. Two variables: SEX and AGE appear in a concatenated way in the column dimension. The process to get the accumulated data matrix is based on obtaining separately every part of the table, and their composition in a unique data matrix. The accumulation procedure used is PROC SUMMARY, that allows to get accumulated data sets. Strategy of Development Programming using SAS macro language has allowed to develop a tool that starting from a table definition can obtain an accumulated data matrix. The basic element is the definition of the table whose matrix is obtained. The structure of the page, row and column dimensions is defined by a syntax designed for this purpose. The definition grammar "is more simple than the PROC TABULATE one, nevertheless it allows to define most of the tables that are elaborated at the Statistics Department. This syntax allows to apply more than a where clause in the same table, and the accumulation of a subset of the values of a variable. We could not renounce to these advantages in the definition of a table. The structure of a table is defined by the identification of the different groups that compose the page, row and column dimensions. One group is defined by the combination of class variables. Each group is able to have a where clause associated that restricts the information that is manipulated. Naturally, it is possible to require different statistics for each group. 123 -..

Let's see an example with different groups that compose a table: CITY SEX AGE All 1 Female I Male ~20 I 21-40 I >40 I 41-60 I 61-80 I >80 Acebeda, La t t t t t t Ajalvir 1 2 3+4+5 3 4 5 Madrid DISTRICT QUARTER Centro Palacio Arganzuela Acacias Figure 3. Identification of groups in a table. The row dimension is composed by two groups: Cities of "Comunidad de Madrid". The syntax of its definition is very simple: CITY Districts and quarters for the city of Madrid. It is only necessary to process the information referred to Madrid. This group has a where clause associated. The syntax of its definition is as follows: DISTRICT QUARTER FILTRO: CITY='Madrid' The structure of the column dimension is different. At first sight, two groups appear. One refers to the sex and the other one to the age: Accumulation of the population according to the sex. An ALL is required in the group. The syntax of definition is as follows: TOTAL SEX The columns that refer to the age, have some problems. The list of values of the AGE variable is as follows:, \. 1 : ~ 20 2 : 21-40 3 : 41-60 124

4 : 61-80 5 : > 80 A column appears in the table that really is the accumulation of a subset of the values of the AGE variable: 3+4+5 : > 40 Two groups can be defined to solve this problem: Accumulation of the populations according to the age, till 40 years. AGE FILTRO: AGE < = 2 Accumulation of the population according to the age, starting from 40 years. An ALL is required (> 40). TOTAL AGE FILTRO: AGE > = 3 The macro that obtains the data matrix is as follows: %mda ( sasuser.data, sasuser. matrix, I * input data set *1 1* data matrix to get *1 1* there is no page dimension *1 1* groups that compose the row dimension *1 CITY + DISTRICT QUARTER FILTRO: CITY = 'Madrid', 1* groups that compose the column dimension *1 TOTAL SEX + AGE TOTAL AGE FILTRO: AGE < = 2 + FILTRO: AGE> = 3) ; The table is defined by 5 groups: 2 in the row dimension and three in the column dimension. It is necessary to realize 2 x 3 = 6 accumulation procedures. Every one is solved by a PROC SUMMARY that where clauses are applied on associated to every zone that composes the table. Then, the required observations are selected (the _TYPE_variable identifies the different accumulation levels). Next, using a PROC TRANSPOSE, a data matrix of every zone is obtained. With the composition of the data matrices associated to every zone, a data matrix of the required table is obtained. The algorithm definition development of a table structure has been really important. The definition grammar of the page, row and column dimensions is designed by graphs that 125 - _ ",." ~ ~ ~ ~._. J '".. '.. _...

express the recognized process behaviour. These graphs are simi~ar to the ones used in compilers theory to define a grammar. Its programming has been very simple. The changes in the definition syntax do not involve difficult modifications in the programs. Figure 4 shows the graph that defines the page dimension syntax. o A 7 \t 0 0 0 ~ -+-Hn --'$"-- ~ ~(-TOT.--'-$ _:AL_> FUTRO: ~ S Y O~~O f1i!l"ro: 0~( --=-Hn_ o Figure 4. Graph that defines the page dimension syntax How the Statistics Technician Defines Tables The statistics technician that wishes to defme a table to get its data matrix, can use the macro directly. However, this is not usual. The tables definition is realized by a PC application., \. This product allows not only to defme tables, but also to manipulate the definitions: to group tables according to study areas, add, erase or update definitions, to define data sets, etc.. This application writes the SAS code that call the macro. This interface is not developed using SAS. It was important that the application could 126

be installed in a lot of PC's without licensed software problems. The table execution can be realized in the same computer where the application is installed, or in another computer. Even in another operating system. If SAS is not available in the same computer than the definition interface, the code generated is transferred to the computer where SAS is available. This application includes other possibilities to facilitate the work to the user. It is possible to work directly with SAS data sets, DBF and ASCII files. The data matrix can be obtained in any of these types. They are options that facilitate the work very much to an user that does not know how to programme in SAS. To a Second Version Most tables that Statistics Department need are obtained using this macro. However, there are some aspects that nowadays are not included. It is sometimes necessary to manipulate the information before or after the macro is executed. The macros that nowadays are developed, compose the main core of a product that continues incorporating new options. To get ready a second version, the following aspects are being considered: Definition of a general where clause associated to the table. Its incorporation is very simple. Possibility of defining formats. We can distinguish three kind of formats: referred to description, that associate labels to each value of a variable. about grouping, that allow to group several values of a variable using the same description. edition formats, that define the ch~lfacteristics length, number of decimals, etc.. of the cells of a table, such as its Information about the list of values of a variable contributes with a lot of information to the system. It will be likely to get tables where all the possible combinations will take place. PRINTMISS option in a TABLE statement in PROC TABULATE works in a similar way. If we have two variables, A and B define as follows: A = 1,2,3 B = 1,2 and a data set with four observations: 127

A B 1 1 1 2 3 1 1 2 the statement TABLE A * B; in a PROC TABULATE produces the following columns: A=l and B=l A=l and B=2 A=3 and B=l If option PRINTMISS is specified, a new column is added: A=3 and B=2.;~ 1 with missing values in every cell. This combination did not appear in the data set. However, there is no column for A = 2. It will be likely to get all the possible combinations of class variables, although the information in the data set does not allow to deduce these cases. Two new columns will take place: A=2andB=1 A=2 and B=2 with missing values in all the cells. An aspect that has not been considered is the variable generation. It is a key point to solve. The accumulating information process is based on PROC SUMMARY. The information cannot be obtained if it is not supplied directly by this procedure, such as the mean, the minimum, etc.. It is not possible to get percentages, addings, etc.. DATA step, PROC SQL and PROC COMPUTAB in SAS/ETS will be the key to solve this problem. The final aim of the developed tool is to get accumulated data matrices as SAS data sets. To obtain the table as a report is the following step. PROC REPORT will generate it, acceding to the formats. The definition syntax of a table using a TABLE statement in a PROC TABULATE allows to define tables with a complex structure. If the tables to develop recommend this in the future, it will be necessary to modify the present definition grammar to make it similar to the TABLE statement syntax. 128

"'''''O~'''''''''=", ;;;'''''~C<,~5".~'o.,~."t.;:':'X':ct~:""""",, Conclusions The key point of this work is the development of an automatized procedure to obtain accumulated data matrices. It is much more important to generate a data set with accumulated information than a report. This data set can be transferred directly to a data base or spreadsheet. It is accesible with any SAS procedure. Any program can manipulate these data. The important is that the generated information is a SAS data set. From this point, everything is much easier. The effort is condensed in the table design, not in their development. As yet, the situation was the opposite. The maintenance cost and the developing time decrease enormously, and the analysis capacity of the information that is distributed is much greater on the part of the user. References SAS and SAS/ETS are registered trademarks of SAS Institute Inc., Cary, NC, USA. Departamento de Estadistica - Comunidad de Madrid Informatica Comunidad de Madrid Principe de Vergara, 132-6a 28002 Madrid Spain Tfn.- +34-1-580.23.43 Fax.- +34-1-563.82.45, \.