Paper 74881-2011 Creating SAS Datasets from Varied Sources Mansi Singh and Sofia Shamas, MaxisIT Inc, NJ



Similar documents
A Macro to Create Data Definition Documents

Using DDE and SAS/Macro for Automated Excel Report Consolidation and Generation

Using SAS DDE to Control Excel

Importing Excel Files Into SAS Using DDE Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA

Reading Delimited Text Files into SAS 9 TS-673

Technical Paper. Reading Delimited Text Files into SAS 9

ABSTRACT INTRODUCTION SAS AND EXCEL CAPABILITIES SAS AND EXCEL STRUCTURES

EXST SAS Lab Lab #4: Data input and dataset modifications

Importing Data into SAS

Combining SAS LIBNAME and VBA Macro to Import Excel file in an Intriguing, Efficient way Ajay Gupta, PPD Inc, Morrisville, NC

SUGI 29 Coders' Corner

How To Write A Clinical Trial In Sas

Different Approaches to Maintaining Excel Reports

Choosing the Best Method to Create an Excel Report Romain Miralles, Clinovo, Sunnyvale, CA

ABSTRACT INTRODUCTION FILE IMPORT WIZARD

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Create an Excel report using SAS : A comparison of the different techniques

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets

WHAT DO YOU DO WHEN YOU CAN NOT USE THE SDD ADVANCED LOADER

SAS Tips and Tricks. Disclaimer: I am not an expert in SAS. These are just a few tricks I have picked up along the way.

Importing Excel File using Microsoft Access in SAS Ajay Gupta, PPD Inc, Morrisville, NC

A Recursive SAS Macro to Automate Importing Multiple Excel Worksheets into SAS Data Sets

AN ANIMATED GUIDE: SENDING SAS FILE TO EXCEL

EXTRACTING DATA FROM PDF FILES

Automated distribution of SAS results Jacques Pagé, Les Services Conseils HARDY, Quebec, Qc

SPSS for Windows importing and exporting data

Listings and Patient Summaries in Excel (SAS and Excel, an excellent partnership)

Customized Excel Output Using the Excel Libname Harry Droogendyk, Stratia Consulting Inc., Lynden, ON

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

Downloading Your Financial Statements to Excel

SAS and Microsoft Excel for Tracking and Managing Clinical Trial Data: Methods and Applications for Information Delivery

A Scheme for Automation of Telecom Data Processing for Business Application

A Method for Cleaning Clinical Trial Analysis Data Sets

OneTouch 4.0 with OmniPage OCR Features. Mini Guide

SAS and Electronic Mail: Send faster, and DEFINITELY more efficiently

SAS Office Analytics: An Application In Practice

ABSTRACT INTRODUCTION THE MAPPING FILE GENERAL INFORMATION

SAS Hints. data _null_; infile testit pad missover lrecl=3; input answer $3.; put answer=; run; May 30, 2008

Create Your Customized Case Report Form (CRF) Tracking System Tikiri Karunasundera, Medpace Inc., Cincinnati, Ohio

R FOR SAS AND SPSS USERS. Bob Muenchen

Music to My Ears: Using SAS to Deal with External Files (and My ipod)

Importing from Tab-Delimited Files

Encoding the Password

Converting Excel Spreadsheets or Comma Separated Values files into Database File or Geodatabases for use in the USGS Metadata Wizard

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

An Overview of REDCap, a secure web-based application for Electronic Data Capture

There are various ways to find data using the Hennepin County GIS Open Data site:

Spelling Checker Utility in SAS using VBA Macro and SAS Functions Ajay Gupta, PPD, Morrisville, NC

PUT, DBLOAD AND ODE: Three Ways to Export Data to Excel and Why To Use Them

Creating Dynamic Reports Using Data Exchange to Excel

SAS/ACCESS 9.3 Interface to PC Files

PharmaSUG Paper QT26

Instant Interactive SAS Log Window Analyzer

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

11.5 E-THESIS SUBMISSION PROCEDURE (RESEARCH DEGREES)

Import and Output XML Files with SAS Yi Zhao Merck Sharp & Dohme Corp, Upper Gwynedd, Pennsylvania

Importing Data from a Dat or Text File into SPSS

Flat Pack Data: Converting and ZIPping SAS Data for Delivery

Creating Raw Data Files Using SAS. Transcript

Using Microsoft Excel for Data Presentation Peter Godard and Cyndi Williamson, SRI International, Menlo Park, CA

Importing and Exporting Databases in Oasis montaj

File by OCR Manual. Updated December 9, 2008

How to easily convert clinical data to CDISC SDTM

Data Export User Guide

It s not the Yellow Brick Road but the SAS PC FILES SERVER will take you Down the LIBNAME PATH= to Using the 64-Bit Excel Workbooks.

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

Integrating SAS and Excel: an Overview and Comparison of Three Methods for Using SAS to Create and Access Data in Excel

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Qlik REST Connector Installation and User Guide

Eliminating Tedium by Building Applications that Use SQL Generated SAS Code Segments

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

SAS UNIX-Space Analyzer A handy tool for UNIX SAS Administrators Airaha Chelvakkanthan Manickam, Cognizant Technology Solutions, Teaneck, NJ

SAS Lesson 2: More Ways to Input Data

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Automation of Large SAS Processes with and Text Message Notification Seva Kumar, JPMorgan Chase, Seattle, WA

5. Crea+ng SAS Datasets from external files. GIORGIO RUSSOLILLO - Cours de prépara+on à la cer+fica+on SAS «Base Programming»

Demand for Analysis-Ready Data Sets. An Introduction to Banking and Credit Card Analytics

Let There Be Highlights: Data-driven Cell, Row and Column Highlights in %TAB2HTM and %DS2HTM Output. Matthew Flynn and Ray Pass

PharmaSUG 2014 Paper CC23. Need to Review or Deliver Outputs on a Rolling Basis? Just Apply the Filter! Tom Santopoli, Accenture, Berwyn, PA

Chapter 2 The Data Table. Chapter Table of Contents

Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager

Clinical Trial Data Integration: The Strategy, Benefits, and Logistics of Integrating Across a Compound

Opening a Database in Avery DesignPro 4.0 using ODBC

Getting started with the Stata

How To Write A File System On A Microsoft Office (Windows) (Windows 2.3) (For Windows 2) (Minorode) (Orchestra) (Powerpoint) (Xls) (

Integrating SAS and Microsoft Office for Analysis and Reporting of Hearing Loss in Occupational Health Management

Mail Merge Creating Mailing Labels 3/23/2011

SOAL-SOAL MICROSOFT EXCEL 1. The box on the chart that contains the name of each individual record is called the. A. cell B. title C. axis D.

Paper FF-014. Tips for Moving to SAS Enterprise Guide on Unix Patricia Hettinger, Consultant, Oak Brook, IL

SAS Visual Analytics dashboard for pollution analysis

Automate Data Integration Processes for Pharmaceutical Data Warehouse

Search help. More on Office.com: images templates

DocuSign Quick Start Guide. Using the Bulk Recipient Feature. Overview. Table of Contents

Mail 2 ZOS FTPSweeper

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Taming the PROC TRANSPOSE

Spreadsheet File Transfer User Guide. FR 2915 Report of Foreign (Non-U.S.) Currency Deposits

Preparing Real World Data in Excel Sheets for Statistical Analysis

Transcription:

Paper 788-0 Creating SAS Datasets from Varied Sources Mansi Singh and Sofia Shamas, MaxisIT Inc, NJ ABSTRACT Often SAS programmers find themselves dealing with data coming from multiple sources and usually in different formats. Steps have to be taken to logically relate the process and convert the variety of data into SAS data sets before it can be analyzed. Since these sources do not follow a similar pattern, this paper is to serve as a collection of examples illustrating the conversion of data coming from various sources, such as extensible markup language (XML), comma separated values (CSV), Microsoft excel (XLS), or tab delimited (TXT) files to SAS data sets. INTRODUCTION Often data comes from a variety of sources. These different formats of data have to be put together in a cohesive way so that it can be used for further analysis. Often this responsibility falls on the shoulders of a programmer. Every programmer has their own unique way to programming and tackling this issue. The optimal way depends upon the needs of the project and programmer's preference. There are various tools at our disposal such as IMPORT procedure, IMPORT WIZARD, and the DATA STEP which help programmers convert data coming from different sources to SAS data sets. Although these are very useful and widely used tools, these methods come with some limitations. PROC IMPORT gives no control over the field attributes as it scans the input file to automatically determine name, type and ideal length of the variables. DATA STEP INFILE can be more programming-intensive if there are more variables in the data set. DATA STEP INFILE even though being a more primitive approach, increases a programmer s control over the data. It allows programmer to be precise in variable definitions by specifying variable names and their attributes as the file is read through the INPUT statement. One can also do data manipulations directly within the same DATA STEP, which cannot be done using PROC IMPORT. CDISC procedure is used for XML files based on ODM structure which gives user more control over the metadata content. In this paper, DATA STEP - INFILE method will be used to convert,. Comma Separated Value (CSV) file to SAS data set.. Tab delimited (TXT) file to SAS data set.. Microsoft Excel (XLS) file to SAS data set. PROC CDISC will be used to convert,. extensible Markup Language (XML) file for ODM structure to SAS data set. SASHELP.SHOES is used as the data source to illustrate these conversions. CONVERTING CSV FILE TO SAS DATA SET Comma Separated Values (CSV) file is used to store tabular data in which numbers and text are stored in plain-text form. Plain text in such files is delimited by a symbol (comma). Traditionally, lines in the text file represent rows in a table, and commas separate the columns. CSV files are a common medium of data transfer especially when dealing with external vendors. The code used to create the SAS data set will be split into three main steps. And as we go along, main features of the code are explained. Let s take a look at the CSV file SHOES.CSV as an example, which will be later converted to a SAS data set.

STEP I: CREATE THE VARIABLE NAMES Programmers are familiar with the way traditional DATA STEP and INFILE is used to read external files. Even though DATA STEP gives more control to the programmer when reading the data into SAS, it can be a tedious job to type all the variable names, particularly if the data contains a lot of variables. This part of the code illustrates an innovative way of reading and creating variable names for a data set. *Creating dataset with variable names from csv file; data all_attb; infile "&path\shoes.csv" pad firstobs= obs= lrecl=767; *Reading in the line containing variable names; length randstr $500.; *Storing variable names as one random string; input @ randstr $char500.; *Creating variables needed in the dataset; array a{*} $50. a-a7; do i= to dim(a); a{i}=scan(randstr,i,','); The INFILE statement reads in the CSV file. Row line where the variable names are stored within the file is read through FIRSTOBS option. A character variable, RANDSTR is created that will contain the variable names as one long string separated by the file delimiter, which in this case is a comma (, ). ARRAY is used to create the number of variables that will be in the data set. In this example, there are seven variables, so the ARRAY dimension ranges from 7. SCAN function is used to read the string created in step. It scans for each variable that was stored in the character string separated by the delimiter and then creates individual variables. Here seven different variable names are being created. STEP II: MACRO VARIABLES THAT STORE THE INFORMATION FOR INPUT AND ATTRIB STATEMENTS So now we need to define the attributes for the variable names created in STEP I. It can again be tiresome to type all the variable names and its attributes. This part of the code demonstrates how this can be achieved in a more efficient way. *Creating the strings for INPUT & ATTRIB statements; data _null_; set all_attb; length inpt attb $000. label $5.; array a{*} $50. a-a7; do i= to dim(a);

*Creating the string for INPUT and ATTRIB statements; *For character variables; if i^= then do; inpt=trim(inpt) ' ' trim(a{i}) ' $ '; var=a{i}; length='length=$5.'; if i= then label='label="region" '; else if i= then label='label="product" '; else if i= then label='label="subsidiary" '; else if i=5 then label='label="total Sales" '; else if i=6 then label='label="total Inventory" '; else if i=7 then label='label="total Returns" '; attb=trim(attb) ' ' trim(var) ' ' trim(length) ' ' trim(label); *For numeric variables; else do; inpt=trim(inpt) ' ' trim(a{i}) ' '; var=a{i}; length='length=8.'; if i= then label='label="number of Stores" '; attb=trim(attb) ' ' trim(var) ' ' trim(length) ' ' trim(label); call symput("inpt", trim(inpt)); call symput("attb", trim(attb)); Creates the string of variable names for the INPUT statement along with the type identifier ($) for character variables in the data set. For each variable defined by the ARRAY, a variable containing the length information and a variable containing the label information is created using IF-ELSE logic. These variables are then concatenated together to create the information needed for the ATTRIB statement. The logic used in and is then repeated for numeric type variables. The variables created for the INPUT and ATTRIB statements are then converted into macro variables. These macro variables (INPT and ATTB) will be used in the next step. STEP III: READ IN DATA FROM CSV FILE This part of the code uses all the information and variables created in STEP I and STEP II to read in the data. *Read in all the data from csv file; data shoes; infile "&path\shoes.csv" delimiter=',' pad missover firstobs= lrecl=767; attrib &attb; input &inpt; This INFILE statement will now read in the data from the CSV file. FIRSTOBS option points to the row line where the data exists. The macro variables (INPT and ATTB) created in the previous step are now used for INPUT and ATTRIB statements to create the final SAS data set.

CONVERTING TXT FILE TO SAS DATA SET A tab delimited (TXT) file is a plain text file which uses a tab stop as a separator between the data fields. Each line of the text file is a record of the data table. TXT is a widely supported file format which is often used to move data between various sources. The code used to create the SAS data set will be split into three main steps. And as we go along, main features of the code are explained. Let s take a look at the TXT file SHOES.TXT as an example, which will be later converted to a SAS data set. STEP I: CREATE THE VARIABLE NAMES This step is similar to STEP I of the CSV file conversion process. The only difference is when the delimiter for TXT file is defined. *Creating dataset with variable names from txt file; data all_attb; infile "&path\shoes.txt" dsd firstobs= obs= lrecl=767;... *Creating variables needed in the dataset; array a{*} $50. a-a7; do i= to dim(a); a{i}=scan(randstr,i,'09'x); The INFILE statement reads in the TXT file. Row line where the variable names are stored within the file is read through FIRSTOBS option. SCAN function is used to read the string created in previous steps. It scans for each variable that was stored in the character string separated by the delimiter (TAB). STEP II: MACRO VARIABLES THAT STORE THE INFORMATION FOR INPUT AND ATTRIB STATEMENTS This step is similar to STEP II of the CSV file conversion process. STEP III: READ IN DATA FROM TXT FILE This part of the code uses all the information and variables created in STEP I and STEP II to read in the data. *Read in all the data from txt file; data shoes; infile "&path\shoes.txt" delimiter='09'x dsd missover firstobs= lrecl=767; attrib &attb; input &inpt;

This INFILE statement will now read in the data from the TXT file. FIRSTOBS option points to the row line where the data exists. The macro variables (INPT and ATTB) created in the previous step are now used for INPUT and ATTRIB statements to create the final SAS data set. CONVERTING XLS FILE TO SAS DATA SET Microsoft Excel (XLS) file is a plain text file which uses a tab stop as a separator between the data fields. XLS is regularly available application for capturing and transmitting data. The XML format of XLS files are not used in this paper for example purposes. The code used to create the SAS data set will be split into three main steps. And as we go along, main features of the code are explained. Let s take a look at the XLS file SHOES.XLS as an example, which will be later converted to a SAS data set. STEP I: CREATE THE VARIABLE NAMES The overall method is similar to STEP I of the CSV file conversion process. But additional steps need to be taken for XLS files, to read in the variable names correctly. Another difference is when the delimiter for XLS file is defined. *Creating dataset with variable names from xls file; options noxwait noxsync; x "start excel"; filename runxls dde 'excel system'; data _null_; file runxls; put '[file-open("location\shoes.xls")]'; *Creating xls file reference; filename dbxls dde "excel location\shoes.xls!shoes" notab; data all_attb; infile dbxls dsd pad dlm='09'x missover firstobs= obs= lrecl=767; *Reading in the line containing variable names; length randstr $00.; *Storing variable names as one random string; input randstr $00.; *Creating variables needed in the dataset; array a{*} $50. a-a7; do i= to dim(a); a{i}=scan(randstr,i,'09'x); 5 6 5

. 5 6 X command will open a DOS command window without closing the current SAS session to start an excel session. FILENAME statement using DDE option will create a file reference to start excel. PUT statement along with DATA _NULL_ will open SHOES.XLS. The FILENAME statement along with DDE option will create a file reference for SHOES.XLS. INFILE statement reads in the XLS file. Row line where the variable names are stored within the file is read through FIRSTOBS option. A character variable, RANDSTR is created that will contain the variable names as one long string separated by the file delimiter, which in this case is tab stop (TAB). ARRAY is used to create the number of variables that will be in the data set. In this example, there are seven variables, so the ARRAY dimension ranges from 7. SCAN function is used to read the string created in step. It scans for each variable that was stored in the character string separated by the delimiter and creates individual variables. Here seven different variable names are being created. STEP II: MACRO VARIABLES THAT STORE THE INFORMATION FOR INPUT AND ATTRIB STATEMENTS This step is similar to STEP II of the CSV file conversion process. STEP III: READ IN DATA FROM XLS FILE This part of the code uses all the information and variables created in STEP I and STEP II to read in the data. *Read in all the data from xls file; data shoes; infile dbxls dsd pad dlm='09'x missover firstobs= lrecl=767; attrib &attb; input &inpt; This INFILE statement will now read in the data from the XLS file. FIRSTOBS option points to the row line where the data exists. The macro variables (INPT and ATTB) created in the previous step are now used for INPUT and ATTRIB statements to create the final SAS data set. CONVERTING XML FILE TO SAS DATA SET extensible Markup Language (XML) file is a tag based language. XML files are gaining popularity as a medium of data exchange between dissimilar platforms. XML files are designed to just store and transport information, and are not used for any kind of data manipulation. Transferring data from XML file requires identifying the primary keys and the data set variable names in order to create a SAS data set. If XML file has an arbitrary structure (not ODM) then XML MAP is needed in the LIBNAME reference. PROC CDISC may not work for such files. In this paper, XML file based on ODM structure is used as an example. Let s take a look at the XML file SHOES.XML as an example, which will later be converted to a SAS data set using PROC CDISC. 6

Here, PROC CDISC is used to produce a SAS data set from XML file. It creates columns in the data set and formats them according to the attributes defined in the ODM metadata. Main features of the code are explained. *Using PROC CDSIC to get data from XML file; filename xmlinput "location\shoes.xml"; libname fmtlib "location"; libname outlib "location"; proc cdisc model =odm read =xmlinput formatactive =yes formatnoreplace=no formatlibrary =fmtlib; odm odmversion ="." odmminimumkeyset=no longnames =yes; clinicaldata out =outlib.outdata name='shoes'; The FILENAME statement assigns the file reference to the physical location of the XML file. LIBNAME statement defines the location of the format library and the location where the final output data set will be stored. MODEL parameter describes the name of the supported CDISC model (ODM). The location of source XML file is specified with READ parameter, specified earlier within the FILENAME statement. FORMAT parameters define the formats associated with the variables. ODMVERSION parameter within PROC CDISC defines the version of the ODM model. The current version supported is.. ODMMINIMUMKEYSET=NO imports all KEYSET fields in the SAS data set. LONGNAMES=YES enables the variable names to be characters long, and the blanks with be replaced with an underscore (_). OUT parameter within CLINICALDATA specifies the name of the final SAS data set that is being created. CONCLUSION One of the fundamental strength of SAS is the ability to convert raw data into analytical information that can be used for further analysis. SAS has the flexibility to read data from practically any format. But receiving data from different sources and often combining them together can sometimes be a time consuming process. This tedious process can be made simple using some of the above mentioned examples, which in turn can save a significant amount of time; help automate part of the process thereby reducing the likelihood of errors. 7

REFERENCES SAS Online Docs. http://support.sas.com Diane Hall. Using an Excel Spreadsheet with PC SAS, no Gymnastics Required!. http://www.sas.com/proceedings/sugi0/0-0.pdf Gary McQuown. PROC IMPORT with a Twist. http://www.sas.com/proceedings/sugi0/08-0.pdf Elena Valkanova. Exploring SAS PROC CDISC Model=ODM and Its Limitations. http://www.pharmasug.org/cd/papers/po/po07.pdf Miriam Cisternas. Ricardo Cisternas. Reading and Writing XML files from SAS. http://www.sas.com/proceedings/sugi9/9-9.pdf ACKNOWLEDGEMENT Our gratitude goes to Carey Smoak and Mario Widel at Roche Molecular Diagnostics for providing us with invaluable comments regarding this paper. CONTACT INFORMATION Your comments and questions are highly valued and encouraged. Contact the authors at: Mansi Singh Sofia Shamas MaxisIT Inc. MaxisIT Inc. 0 Main Street 0 Main Street Metuchen, NJ 0880 Metuchen, NJ 0880 mansi.singh@maxisit.com sofia.shamas@maxisit.com TRADEMARKS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 8