Preparing Real World Data in Excel Sheets for Statistical Analysis



Similar documents
A Macro to Create Data Definition Documents

Eliminating Tedium by Building Applications that Use SQL Generated SAS Code Segments

REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets

Importing Excel File using Microsoft Access in SAS Ajay Gupta, PPD Inc, Morrisville, NC

Managing Tables in Microsoft SQL Server using SAS

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

A Method for Cleaning Clinical Trial Analysis Data Sets

Managing very large EXCEL files using the XLS engine John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc., Ridgefield, CT

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

Flat Pack Data: Converting and ZIPping SAS Data for Delivery

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

An Approach to Creating Archives That Minimizes Storage Requirements

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Choosing the Best Method to Create an Excel Report Romain Miralles, Clinovo, Sunnyvale, CA

SAS and Electronic Mail: Send faster, and DEFINITELY more efficiently

How to Reduce the Disk Space Required by a SAS Data Set

Search and Replace in SAS Data Sets thru GUI

CDW DATA QUALITY INITIATIVE

Analyzing the Server Log

SAS/ACCESS 9.3 Interface to PC Files

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

Integrating SAS and Excel: an Overview and Comparison of Three Methods for Using SAS to Create and Access Data in Excel

ABSTRACT THE ISSUE AT HAND THE RECIPE FOR BUILDING THE SYSTEM THE TEAM REQUIREMENTS. Paper DM

ABSTRACT INTRODUCTION SAS AND EXCEL CAPABILITIES SAS AND EXCEL STRUCTURES

It s not the Yellow Brick Road but the SAS PC FILES SERVER will take you Down the LIBNAME PATH= to Using the 64-Bit Excel Workbooks.

Same Data Different Attributes: Cloning Issues with Data Sets Brian Varney, Experis Business Analytics, Portage, MI

Effective Use of SQL in SAS Programming

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

SAS Programming Tips, Tricks, and Techniques

Combining SAS LIBNAME and VBA Macro to Import Excel file in an Intriguing, Efficient way Ajay Gupta, PPD Inc, Morrisville, NC

Table Lookups: From IF-THEN to Key-Indexing

Providing Historical Control Data for Toxicology With the SAS Stored Process Web Application

Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC

PROC SQL: Tips and Translations for Data Step Users Susan P Marcella, ExxonMobil Biomedical Sciences, Inc. Gail Jorgensen, Palisades Research, Inc.

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Microsoft' Excel & Access Integration

SUGI 29 Coders' Corner

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

How To Write A Clinical Trial In Sas

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

Top 10 Things to Know about WRDS

Do It Yourself (DIY) Data: Creating a Searchable Data Set of Available Classrooms using SAS Enterprise BI Server

Tips, Tricks, and Techniques from the Experts

Customized Excel Output Using the Excel Libname Harry Droogendyk, Stratia Consulting Inc., Lynden, ON

How to Use SDTM Definition and ADaM Specifications Documents. to Facilitate SAS Programming

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

SAS ODS HTML + PROC Report = Fantastic Output Girish K. Narayandas, OptumInsight, Eden Prairie, MN

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

AN ANIMATED GUIDE: SENDING SAS FILE TO EXCEL

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

A Recursive SAS Macro to Automate Importing Multiple Excel Worksheets into SAS Data Sets

Using DDE and SAS/Macro for Automated Excel Report Consolidation and Generation

Alternatives to Merging SAS Data Sets But Be Careful

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Writing cleaner and more powerful SAS code using macros. Patrick Breheny

OBJECT_EXIST: A Macro to Check if a Specified Object Exists Jim Johnson, Independent Consultant, North Wales, PA

You have got SASMAIL!

Writing Data with Excel Libname Engine

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

Storing and Using a List of Values in a Macro Variable

Create an Excel report using SAS : A comparison of the different techniques

Introduction to Microsoft Access 2003

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Importing Excel Files Into SAS Using DDE Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA

Web Reporting by Combining the Best of HTML and SAS

A database is a collection of data organised in a manner that allows access, retrieval, and use of that data.

Imelda C. Go, South Carolina Department of Education, Columbia, SC

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

How To Create A Powerpoint Intelligence Report In A Pivot Table In A Powerpoints.Com

- Suresh Khanal. Microsoft Excel Short Questions and Answers 1

Tips and Tricks for Creating Multi-Sheet Microsoft Excel Workbooks the Easy Way with SAS. Vincent DelGobbo, SAS Institute Inc.

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

SAS Credit Scoring for Banking 4.3

PharmaSUG Paper QT26

Project Request and Tracking Using SAS/IntrNet Software Steven Beakley, LabOne, Inc., Lenexa, Kansas

ACCESS Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818)

Leveraging the SAS Open Metadata Architecture Ray Helm & Yolanda Howard, University of Kansas, Lawrence, KS

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

Simply Accounting Intelligence Tips and Tricks Booklet Vol. 1

PharmaSUG Paper AD08

Getting Started Guide

Developing an On-Demand Web Report Platform Using Stored Processes and SAS Web Application Server

Improving Your Relationship with SAS Enterprise Guide

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

5. Crea+ng SAS Datasets from external files. GIORGIO RUSSOLILLO - Cours de prépara+on à la cer+fica+on SAS «Base Programming»

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

AN INTRODUCTION TO MACRO VARIABLES AND MACRO PROGRAMS Mike S. Zdeb, New York State Department of Health

Overview. NT Event Log. CHAPTER 8 Enhancements for SAS Users under Windows NT

Getting Started Guide SAGE ACCPAC INTELLIGENCE

THE INTRICATE ORGANIZATION AND MANAGEMENT OF CLINICAL RESEARCH LABORATORY SAMPLES USING SAS/AF"

Let There Be Highlights: Data-driven Cell, Row and Column Highlights in %TAB2HTM and %DS2HTM Output. Matthew Flynn and Ray Pass

Microsoft Office 2010: Access 2010, Excel 2010, Lync 2010 learning assets

Ad Hoc Advanced Table of Contents

SPSS: Getting Started. For Windows

ABSTRACT INTRODUCTION THE MAPPING FILE GENERAL INFORMATION

Transcription:

Paper DM03 Preparing Real World Data in Excel Sheets for Statistical Analysis Volker Harm, Bayer Schering Pharma AG, Berlin, Germany ABSTRACT This paper collects a set of techniques of importing Excel sheets and creating keys to prepare the data for statistical analysis using proc import and macro variables. The data stem from a real world problem that arose due to a possible production or transport problem. Main topics that came up were: How to name lists of variables in a set of sheets? How to preserve contents of labels in a readable form? How to import character columns with some numerical values? How to eliminate superfluous blanks and line feeds? How to rename variables in automated way? INTRODUCTION To assess factors that affect the quality of our products data were given to us in a manually collected Excel sheet. Goal of one of the projects was to collect and merge all available related logistic and production data. The data stem from various systems, which all have an Excel interface. As SAS is our system for statistical analysis, my task was to import the Excel sheets and to merge the data sets to get a consistent set of data, so that a multivariate approach as Principal Component Analysis or Logistic Regression could be applied. WHAT DOES REAL WORLD MEAN? In all we had ten Excel workbooks containing 14 Excel sheets with up to 120 columns and up to 900 lines. Some workbooks contained manually collected data. This means you have to be aware of typos, comments in numerical columns, line feeds and other special characters in cells. Especially the individual layout of header lines in Excel sheets require special attention. Sheets that were delivered from Excel interfaces of the various operational systems give further challenges: Some variables that mean the same have different names in different systems or may have different formats. Real World also means that there are lots of variables, where you do not exactly understand the meaning of the content. So you have to be careful to preserve the information, which is given in column headers. THE NAMING SCHEME The first step in preserving this information was to develop a naming scheme for the variables of the resulting merged data set. The variable names in this data set must reflect the workbook they come from, the sheet and the variable name in a way that our statistician could easily identify it while doing the analysis. The workbooks were delivered from the operational function and had typical long names as #Status_4 mit Produktions- und Versanddaten 20070413.xls. So at first we translated the ten workbook names into a list tp01 to tp10 and added another digit for the number of the sheet in the workbook. So tp041 is a typical name for a data set of an imported Excel sheet. Due to the same reason we did not use column headers as variable names, but used the automatic numbering provided by proc import while importing Excel sheets resulting in variable names F1-Fn. IMPORTING THE EXCEL SHEETS When we started importing the data, it was clear to choose a method that is quite repeatable and makes almost no assumptions about the structure of the data. Therefore we decided to use proc import after also considering the Import Wizard or the External File Interface (EFI). IMPORTING THE DATA To get a consistent structure of code little macros were used. For each Excel sheet we started importing the data part: %macro ImportData (fnm =, sheet =, drange =); * creates data set work.tp_d that contains the data of the specified Excel; * sheet; * fnm - filename of Excel worbook; * sheet - name of Excel sheet (in quotes); 1

* drange - range of data in Excel sheet (in quotes); * Global macro variables: * pjd - contains the name of directory of the Excel workbooks; proc import dbms = EXCEL2000 replace datafile = "&pjd.\&fnm." out = work.tp_d (label = "Data from &fnm."); sheet = &sheet.; range = &drange.; getnames = no; mixed = yes; By specifying the range in the Excel sheet appropriately only the data not the header information was imported. The header information was handled separately as the content of the header in most cases is not suited to form reasonable variable names. Instead we set the option getnames to no to get an automatic numbering of the variables (F1-Fn). Importing columns that contain both character and numeric data is a bit tricky. If SAS decides the column to be of type numeric character entry entries are read in as missing, which seems reasonable. You have to take care of. But the other way round, if SAS decides a column to be of type character, strings, which can be interpreted as numeric, are also read in as missing. The second case caused a real problem. Main key variables in our data are the batch numbers. There are two types of batches, filling batches and sales batches. Filling batch numbers consist of five digits (e.g. 55000). Sales batches, which are produced from filling batches normally have numbers consisting of five digits followed by a character (e.g. 55000A). But in some cases the whole filling batch becomes a sales batch and the sales batch number is the same as the filling batch number. SAS sees a numerical value, sets it to missing, and the batch is lost. The remedy is setting the option mixed to yes. Then SAS sets the variable to be of character type, if it finds character values, and converts the numerical values to character. IMPORTING THE HEADER The next step is to import the column headers. We put them in a separate data set. The idea is to use them later on as labels for the variable in the resulting data set. %macro ImportHeader (fnm =, sheet =, hrange =); * creates data set work.tp_h that contains the column headers of the; * specified Excel sheet; * fnm - filename of Excel workbook; * sheet - name of Excel sheet (in quotes); * hrange - range of header in Excel sheet (in quotes); * Global macro variables: * pjd - contains the name of directory of the Excel workbooks; proc import dbms = EXCEL2000 replace datafile = "&pjd.\&fnm." out = work.tp_h (label = "Column headers from &fnm."); sheet = &sheet.; range = &hrange.; getnames = no; The specified header range hrange should consist of only one line to get a data set with only one observation. SETTING THE LABELS The following macro combines the two temporary data sets and creates the raw import data set for each Excel sheet. %macro SetLabels (dsn =); * creates data set work.&dsn.; * line feeds and multiple are removed from the column headers; * contents of tp_h is used to create a attrib statement for tp_d; * dsn - name of the resulting data set; data _null_; set work.tp_h end = last; array vars _all_; 2

attrib labels length = $20000; retain labels " "; do i = 1 to dim(vars); vars[i] = compbl(translate(vars[i], " ", byte(10))); labels = trim(left(labels)) ' ' vname(vars[i]) ' LABEL="' vars[i] '"'; if last then call symput("labels", trim(left(labels))); data work.&dsn.; set work.tp_d; attrib &labels.; The resulting data set now has variable names F1 to Fn, which is quite convenient for printing a compact list, and a description of the variable in the variable label, if you want to know. For convenience line feeds, which can occur in Excel cells and really destroy your output, are removed as well as multiple blanks. PREPARING THE DATA FOR STATISTICAL ANALYSIS So far we have a set of robust methods to get the Excel data into SAS. In the following some steps to make the data more appropriate for statistical analysis will be described. CONVERTING CHARACTER VARIABLES TO NUMERICAL VARIABLES The use of option mixed leads to a number of character variables, which are actually numerical variables. If you use the following little macro utilizing the input function, the conversion is quite straightforward. %macro Char2Num (dsn =, var =, format = best8.); * converts a character variable to a numerical with same name and label; ; * dsn - data set name; * var - name of variable to be converted; * format - format of the new variable; data tmp; set &dsn.; call symput ('Label', vlabel (&var.)); NumValue = input (&var., &format.); drop &var.; data &dsn.; set tmp; label &var. = &label.; &var. = NumValue; drop NumValue; proc datasets nolist; delete tmp; This macro can easily be extended to convert more than one variable in a data set. If you have a lot of variables to be converted, this may be more efficient. Converting character variables to numerical variables can be done analogously by using the put function. COMPRESSING CHARACTER VARIABLES Later during the statistical analysis results are mainly presented by printing lists. As mentioned above, it happens often that line feeds and multiple blanks appear in Excel cells. The technique we used to compress character values when setting the labels can be applied to all character values of the data set. And we went even a step further. If you reduce the length of the values, the length of the variables remains the same. So if it happens that there are really masses of blanks in your character variables, you can reduce the length of the variables by constructing a length statement evaluating the actual length of the character values. Here is the code we used: %macro CompressCharVars (dsn =); * replace line feeds by blank; * reduce length of variables; 3

; * dsn name of data set; data ds; set &dsn. end = last; array ac _character_; do i = 1 to dim(ac); ac[i] = compbl(translate(ac[i],' ',byte(10))); ac[i] = left(trim(ac[i])); if last then call symput ('NoOfCharVars', dim(ac)); * get names of character variables; proc contents data = work.tp011 out = work.cont noprint; data work.ccont; set work.cont; where type = 2; n = _n_; keep name type n; proc sort data = ccont; by descending n; * build attrib statement; %let e = ; %do i = 1 %to &NoOfCharVars.; data work.ccont; set work.ccont; call symput ('var', name); if n = &i. then delete; * get length of variable; * create ATTRIB statement in e; data _null_; set ds (keep = _character_); array ac _all_; retain l1-l&noofcharvars.; array an l1-l&noofcharvars.; do i = 1 to &NoOfCharVars.; an[i]=max(an[i],length(ac[i])); attrib d length = $2000; do i = 1 to &NoOfCharVars.; d = compress( "&var.") " length = $" compress(l&i.) " format = $" compress(l&i.) ". "; call symput ('d', d); %let e = &e. &d.; % data &dsn.; attrib &e.; set ds; proc datasets nolist; delete ds cont ccont; RENAMING VARIABLES Until now our naming scheme was quite handy. But if we want to merge the different data sets, you see there are a lot of duplicate variable names. To make the variable names unique they need an identifier of the data set they come 4

from. In this way we had a complex renaming task, which we accomplished by the following macro adding the data set name as a prefix to the variable name: %macro RenameVars (lib = WORK, dsn =); * adds the data set name as prefix to variable names; * retrieves the number of variables in the data set and the variable names; * from dictionay tables; * lib - library where the data set resides; * dsn - data set name; proc sql noprint; select nvar into :num_vars from dictionary.tables where libname = upcase ("&lib") and memname = upcase ("&dsn"); select distinct (NAME) into :var1-:var%trim(%left (&num_vars)) from dictionary.columns where libname = upcase ("&lib") and memname = upcase ("&dsn"); proc datasets library = &lib; modify &dsn.; rename %do i = 1 %to &num_vars; &&var&i = &dsn._&&var&i. % ; In this way variable F1 in data set tp011 is renamed to tp011_f1. IDENTIFYING THE KEYS The final step before delivering to the statisticians is to merge all the data sets by unique keys. In finding the keys SAS cannot help you. If you have them, find short mnemonic name for them and rename the appropriate variables. Then the merge can start. RESULTS All the techniques we used are not new and were mainly collected from papers of various SAS user groups. The challenge was to apply them consistently to the given complex set of data. CONCLUSION The really difficult questions about the quality of the data, the exact representation of keys, abundant information and more could not be addressed in this paper and will be specific for any new task of preparing Excel data, but after this experience I think we have very valuable tool box to hold the course in the next project. REFERENCES Ravi, Prasad (2003). Renaming All Variables in a SAS Data Set using the Information from PROC SQL s Dictionary Tables, Paper 118-28, SUGI 28 SAS Institute, Inc. (2006). Base SAS 9.1.3 Procedures Guide, Second Edition: The Import Procedure. cary, NC: SAS Institute Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Volker Harm Bayer Schering Pharma AG Müllerstraße 178 13342 Berlin Work Phone: +49-30-468-11208 Fax: +49-30-468-91208 Email: volker.harm@bayerhealthcare.com Web: http://www.bayerscheringpharma.de 5

Brand and product names are trademarks of their respective companies. 6