Automatic Generation of Accumulated Data Matrices in a Tabulating Process

Similar documents
Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Producing Structured Clinical Trial Reports Using SAS: A Company Solution

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Chapter 2 The Data Table. Chapter Table of Contents

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

EXST SAS Lab Lab #4: Data input and dataset modifications

Storing and Using a List of Values in a Macro Variable

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

Improving Maintenance and Performance of SQL queries


There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

SPSS The Basics. Jennifer Thach RHS Assessment Office March 3 rd, 2014

The Art of Designing HOLAP Databases Mark Moorman, SAS Institute Inc., Cary NC

Business Process Management. Prof. Corrado Cerruti General Management Course

Beginning Tutorials. Web Publishing in SAS Software. Prepared by. International SAS Training and Consulting A SAS Institute Quality Partner

Simulate PRELOADFMT Option in PROC FREQ Ajay Gupta, PPD, Morrisville, NC

9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

Competent Data Management - a key component

Streamlining Reports: A Look into Ad Hoc and Standardized Processes James Jenson, US Bancorp, Saint Paul, MN

SPSS (Statistical Package for the Social Sciences)

Defining a Validation Process for End-user (Data Manager / Statisticians) SAS Programs

TECHNIQUES FOR BUILDING A SUCCESSFUL WEB ENABLED APPLICATION USING SAS/INTRNET SOFTWARE

SAS Programming Tips, Tricks, and Techniques

Scatter Chart. Segmented Bar Chart. Overlay Chart

SPSS: Getting Started. For Windows

Chapter 2 Introduction to SPSS

Quantrix & Excel: 3 Key Differences A QUANTRIX WHITE PAPER

Charting LibQUAL+(TM) Data. Jeff Stark Training & Development Services Texas A&M University Libraries Texas A&M University

SAS Analyst for Windows Tutorial

Supplementary Materials for Chapter 15 - Analysing Data

EXCEL SOLVER TUTORIAL

Text Analytics Illustrated with a Simple Data Set

Query Optimization Approach in SQL to prepare Data Sets for Data Mining Analysis

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

STEP TWO: Highlight the data set, then select DATA PIVOT TABLE

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

Parallel Data Preparation with the DS2 Programming Language

Salary. Cumulative Frequency

Additional sources Compilation of sources:

How To Merge Multiple Reports In Jonas With Excel

As noted in previous chapters, crime analysis relies heavily on computer

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Making the Output Delivery System (ODS) Work for You William Fehlner, SAS Institute (Canada) Inc., Toronto, Ontario

A Closer Look at PROC SQL s FEEDBACK Option Kenneth W. Borowiak, PPD, Inc., Morrisville, NC

9.1 SAS. SQL Query Window. User s Guide

Taming the PROC TRANSPOSE

DESCRIPTIVE STATISTICS & DATA PRESENTATION*

Anyone Can Learn PROC TABULATE

LabVIEW Day 6: Saving Files and Making Sub vis

SUGI 29 Systems Architecture. Paper

Modifying Colors and Symbols in ArcMap

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

Using Excel for Statistics Tips and Warnings

How to Use SDTM Definition and ADaM Specifications Documents. to Facilitate SAS Programming

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Utilizing Clinical SAS Report Templates Sunil Kumar Gupta Gupta Programming, Thousand Oaks, CA

An automatic predictive datamining tool. Data Preparation Propensity to Buy v1.05

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Development Period Observed Payments

Is it statistically significant? The chi-square test

The Query Builder: The Swiss Army Knife of SAS Enterprise Guide

Using SPSS, Chapter 2: Descriptive Statistics

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe

Generating Randomization Schedules Using SAS Programming Chunqin Deng and Julia Graz, PPD, Inc., Research Triangle Park, North Carolina

Chapter 9 Creating Reports in Excel

This book serves as a guide for those interested in using IBM SPSS

Creating Word Tables using PROC REPORT and ODS RTF

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Section 1 Spreadsheet Design

03 The full syllabus. 03 The full syllabus continued. For more information visit PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

Downloading Your Financial Statements to Excel

Listings and Patient Summaries in Excel (SAS and Excel, an excellent partnership)

PharmaSUG Paper QT26

Extending the Metadata Security Audit Reporting Capabilities of the Audit and Performance Measurement Package October 2010

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Effective Use of SQL in SAS Programming

Excel Tutorial. Bio 150B Excel Tutorial 1

Experiences in Using Academic Data for BI Dashboard Development

Simulating Chi-Square Test Using Excel

Step 3: Go to Column C. Use the function AVERAGE to calculate the mean values of n = 5. Column C is the column of the means.

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Learning Objectives. Definition of OLAP Data cubes OLAP operations MDX OLAP servers

ODS for PRINT, REPORT and TABULATE

Utilizing Clinical SAS Report Templates with ODS Sunil Kumar Gupta, Gupta Programming, Simi Valley, CA

Database Programming with PL/SQL: Learning Objectives

Spreadsheet software for linear regression analysis

New York State Department of Financial Services

Linear Algebra and TI 89

a presentation by Kirk Paul Lafler SAS Consultant, Author, and Trainer

Graphing Parabolas With Microsoft Excel

An Introduction to SAS/SHARE, By Example

A terminology model approach for defining and managing statistical metadata

Excel & Visual Basic for Applications (VBA)

Transcription:

_.~ Automatic Generation of Accumulated Data Matrices in a Tabulating Process CASTILLO, Jesus CASTRO, Alejandro de SANTOS, Angel Departamento de Estadistica - Comunidad de Madrid Informatica Comunidad de Madrid Abstract The SAS system offers great possibilities in order to get information in a tabular form. The procedure PROC TABULATE is the most adecuated for these kind of tasks. Its syntax is simple and powerful. With a few lines of code, tables with a complex structure can be obtained. However, the output produced by this procedure is not a data set but a report. Its manipulation for another SAS procedure is difficult. The work presented in SEUGI'94 has the purpose to develop an automatized procedure that allows to get an accumulated data matrix as a SAS data set starting from a table definition. Introduction A basic aspect in the difussion policy of an organism that produces statistics information is the format in which is distributed. The possibility of manipulating the data has a great importance. A priority purpose elaborating the tables of the Census of 1991 at "Departamento de Estadistica de la Comunidad de Madrid" was to get data matrices. PROC TABULATE does not include an option that allows the output is a SAS data set. This forced to write SAS programs for each table in order to get a SAS data set. This method allowed to solve some problems that appeared in the tables: the use of different where clauses in the same table, and the accumulation of a subset of the values of a variable. Figure 1 shows the second case. 121._" _~ i_.~'~ -~'_~;_-- - - ~ - ",.- ''''''-~ -'-'.

AGE <20 I 21-40 J >40 I 41-60 I 61-80 I >80 t t t t t t 1 2 3+4+5 3 4 5 Figure 1. Accumulation of a subset of the values of a variable. None of these cases is solved using PROC TABULATE. How to Get Accumulated Data Matrices The experience obtained in the development of SAS programs to get accumulated data matrices allowed to elaborate a systematic for its realization. The study of the different kinds of tables that the Department generates, allowed to identify those parts that are necessary to get in a separated way using accumulation procedures. Basically, the characteristics that lead to this conclusion are two: If in a statistics table different where clauses are applied, it is essential to execute different accumulating procedures. If the table incorporates concatenation in some of its dimension: page, row or column, is convenient but not necessary to execute different accumulating procedures. For instance, if we have a data set with information about the population of "Comunidad de Madrid", it is possible to propose the following table: CITY Acebeda, La Ajalvir SEX AGE Female I Male ~20 I 21-40 I 41-60 I 61-80 I >80 A B Madrid DISTRICT QUARTER Centro Palacio Arganzuela Acacias C D Figure 2. Table with different where clauses and concatenation. 122

~., i-?h-:0~~""'~:t~~;;;''':.1-' h,-:~-~-::&:l~;:~of':.':;:~~:;-':-?-:''" :"-~.~- - <-', ~. In the table of figure 2, four parts can be identified: A: population classified by CITY and SEX. B: population classified by CITY and AGE. c: population of the City of Madrid classified by DISTRICT, QUARTER and SEX. D: population of the City of Madrid classified by DISTRICT, QUARTER and AGE. Different where clauses are applied in the row dimension: crossing DISTRICT and QUARTER is only for the City of Madrid. Two variables: SEX and AGE appear in a concatenated way in the column dimension. The process to get the accumulated data matrix is based on obtaining separately every part of the table, and their composition in a unique data matrix. The accumulation procedure used is PROC SUMMARY, that allows to get accumulated data sets. Strategy of Development Programming using SAS macro language has allowed to develop a tool that starting from a table definition can obtain an accumulated data matrix. The basic element is the definition of the table whose matrix is obtained. The structure of the page, row and column dimensions is defined by a syntax designed for this purpose. The definition grammar "is more simple than the PROC TABULATE one, nevertheless it allows to define most of the tables that are elaborated at the Statistics Department. This syntax allows to apply more than a where clause in the same table, and the accumulation of a subset of the values of a variable. We could not renounce to these advantages in the definition of a table. The structure of a table is defined by the identification of the different groups that compose the page, row and column dimensions. One group is defined by the combination of class variables. Each group is able to have a where clause associated that restricts the information that is manipulated. Naturally, it is possible to require different statistics for each group. 123 -..

Let's see an example with different groups that compose a table: CITY SEX AGE All 1 Female I Male ~20 I 21-40 I >40 I 41-60 I 61-80 I >80 Acebeda, La t t t t t t Ajalvir 1 2 3+4+5 3 4 5 Madrid DISTRICT QUARTER Centro Palacio Arganzuela Acacias Figure 3. Identification of groups in a table. The row dimension is composed by two groups: Cities of "Comunidad de Madrid". The syntax of its definition is very simple: CITY Districts and quarters for the city of Madrid. It is only necessary to process the information referred to Madrid. This group has a where clause associated. The syntax of its definition is as follows: DISTRICT QUARTER FILTRO: CITY='Madrid' The structure of the column dimension is different. At first sight, two groups appear. One refers to the sex and the other one to the age: Accumulation of the population according to the sex. An ALL is required in the group. The syntax of definition is as follows: TOTAL SEX The columns that refer to the age, have some problems. The list of values of the AGE variable is as follows:, \. 1 : ~ 20 2 : 21-40 3 : 41-60 124

4 : 61-80 5 : > 80 A column appears in the table that really is the accumulation of a subset of the values of the AGE variable: 3+4+5 : > 40 Two groups can be defined to solve this problem: Accumulation of the populations according to the age, till 40 years. AGE FILTRO: AGE < = 2 Accumulation of the population according to the age, starting from 40 years. An ALL is required (> 40). TOTAL AGE FILTRO: AGE > = 3 The macro that obtains the data matrix is as follows: %mda ( sasuser.data, sasuser. matrix, I * input data set *1 1* data matrix to get *1 1* there is no page dimension *1 1* groups that compose the row dimension *1 CITY + DISTRICT QUARTER FILTRO: CITY = 'Madrid', 1* groups that compose the column dimension *1 TOTAL SEX + AGE TOTAL AGE FILTRO: AGE < = 2 + FILTRO: AGE> = 3) ; The table is defined by 5 groups: 2 in the row dimension and three in the column dimension. It is necessary to realize 2 x 3 = 6 accumulation procedures. Every one is solved by a PROC SUMMARY that where clauses are applied on associated to every zone that composes the table. Then, the required observations are selected (the _TYPE_variable identifies the different accumulation levels). Next, using a PROC TRANSPOSE, a data matrix of every zone is obtained. With the composition of the data matrices associated to every zone, a data matrix of the required table is obtained. The algorithm definition development of a table structure has been really important. The definition grammar of the page, row and column dimensions is designed by graphs that 125 - _ ",." ~ ~ ~ ~._. J '".. '.. _...

express the recognized process behaviour. These graphs are simi~ar to the ones used in compilers theory to define a grammar. Its programming has been very simple. The changes in the definition syntax do not involve difficult modifications in the programs. Figure 4 shows the graph that defines the page dimension syntax. o A 7 \t 0 0 0 ~ -+-Hn --'$"-- ~ ~(-TOT.--'-$ _:AL_> FUTRO: ~ S Y O~~O f1i!l"ro: 0~( --=-Hn_ o Figure 4. Graph that defines the page dimension syntax How the Statistics Technician Defines Tables The statistics technician that wishes to defme a table to get its data matrix, can use the macro directly. However, this is not usual. The tables definition is realized by a PC application., \. This product allows not only to defme tables, but also to manipulate the definitions: to group tables according to study areas, add, erase or update definitions, to define data sets, etc.. This application writes the SAS code that call the macro. This interface is not developed using SAS. It was important that the application could 126

be installed in a lot of PC's without licensed software problems. The table execution can be realized in the same computer where the application is installed, or in another computer. Even in another operating system. If SAS is not available in the same computer than the definition interface, the code generated is transferred to the computer where SAS is available. This application includes other possibilities to facilitate the work to the user. It is possible to work directly with SAS data sets, DBF and ASCII files. The data matrix can be obtained in any of these types. They are options that facilitate the work very much to an user that does not know how to programme in SAS. To a Second Version Most tables that Statistics Department need are obtained using this macro. However, there are some aspects that nowadays are not included. It is sometimes necessary to manipulate the information before or after the macro is executed. The macros that nowadays are developed, compose the main core of a product that continues incorporating new options. To get ready a second version, the following aspects are being considered: Definition of a general where clause associated to the table. Its incorporation is very simple. Possibility of defining formats. We can distinguish three kind of formats: referred to description, that associate labels to each value of a variable. about grouping, that allow to group several values of a variable using the same description. edition formats, that define the ch~lfacteristics length, number of decimals, etc.. of the cells of a table, such as its Information about the list of values of a variable contributes with a lot of information to the system. It will be likely to get tables where all the possible combinations will take place. PRINTMISS option in a TABLE statement in PROC TABULATE works in a similar way. If we have two variables, A and B define as follows: A = 1,2,3 B = 1,2 and a data set with four observations: 127

A B 1 1 1 2 3 1 1 2 the statement TABLE A * B; in a PROC TABULATE produces the following columns: A=l and B=l A=l and B=2 A=3 and B=l If option PRINTMISS is specified, a new column is added: A=3 and B=2.;~ 1 with missing values in every cell. This combination did not appear in the data set. However, there is no column for A = 2. It will be likely to get all the possible combinations of class variables, although the information in the data set does not allow to deduce these cases. Two new columns will take place: A=2andB=1 A=2 and B=2 with missing values in all the cells. An aspect that has not been considered is the variable generation. It is a key point to solve. The accumulating information process is based on PROC SUMMARY. The information cannot be obtained if it is not supplied directly by this procedure, such as the mean, the minimum, etc.. It is not possible to get percentages, addings, etc.. DATA step, PROC SQL and PROC COMPUTAB in SAS/ETS will be the key to solve this problem. The final aim of the developed tool is to get accumulated data matrices as SAS data sets. To obtain the table as a report is the following step. PROC REPORT will generate it, acceding to the formats. The definition syntax of a table using a TABLE statement in a PROC TABULATE allows to define tables with a complex structure. If the tables to develop recommend this in the future, it will be necessary to modify the present definition grammar to make it similar to the TABLE statement syntax. 128

"'''''O~'''''''''=", ;;;'''''~C<,~5".~'o.,~."t.;:':'X':ct~:""""",, Conclusions The key point of this work is the development of an automatized procedure to obtain accumulated data matrices. It is much more important to generate a data set with accumulated information than a report. This data set can be transferred directly to a data base or spreadsheet. It is accesible with any SAS procedure. Any program can manipulate these data. The important is that the generated information is a SAS data set. From this point, everything is much easier. The effort is condensed in the table design, not in their development. As yet, the situation was the opposite. The maintenance cost and the developing time decrease enormously, and the analysis capacity of the information that is distributed is much greater on the part of the user. References SAS and SAS/ETS are registered trademarks of SAS Institute Inc., Cary, NC, USA. Departamento de Estadistica - Comunidad de Madrid Informatica Comunidad de Madrid Principe de Vergara, 132-6a 28002 Madrid Spain Tfn.- +34-1-580.23.43 Fax.- +34-1-563.82.45, \.