STAT 3304/5304 Introduction to Statistical Computing. Understanding DATA Step Processing

Similar documents
B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

Reading Delimited Text Files into SAS 9 TS-673

Technical Paper. Reading Delimited Text Files into SAS 9

5. Crea+ng SAS Datasets from external files. GIORGIO RUSSOLILLO - Cours de prépara+on à la cer+fica+on SAS «Base Programming»

Nine Steps to Get Started using SAS Macros

Writing Control Structures

SAS Certified Base Programmer for SAS 9 A SAS Certification Questions and Answers with explanation

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Introduction to SAS Informats and Formats

SAS Tips and Tricks. Disclaimer: I am not an expert in SAS. These are just a few tricks I have picked up along the way.

Symbol Tables. Introduction

Chapter 5 Programming Statements. Chapter Table of Contents

Everything you wanted to know about MERGE but were afraid to ask

AN INTRODUCTION TO MACRO VARIABLES AND MACRO PROGRAMS Mike S. Zdeb, New York State Department of Health

Salesforce Classic Guide for iphone

Programming Idioms Using the SET Statement

Embedded SQL programming

9-26 MISSOVER, TRUNCOVER,

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

1 Checking Values of Character Variables

The SAS Data step/macro Interface

Deposit Direct. Getting Started Guide

Writing cleaner and more powerful SAS code using macros. Patrick Breheny

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

That Mysterious Colon (:) Haiping Luo, Dept. of Veterans Affairs, Washington, DC

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Computer Math

The Power of CALL SYMPUT DATA Step Interface by Examples Yunchao (Susan) Tian, Social & Scientific Systems, Inc., Silver Spring, MD

Descriptive Statistics Categorical Variables

The Program Data Vector As an Aid to DATA step Reasoning Marianne Whitlock, Kennett Square, PA

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

PharmaSUG Paper MS05

To Err is Human; to Debug, Divine Roger Staum, SAS Institute, New York, NY

1. Base Programming. GIORGIO RUSSOLILLO - Cours de prépara+on à la cer+fica+on SAS «Base Programming»

How to test and debug an ASP.NET application

Using FILEVAR= to read multiple external files in a DATA Step

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Tips, Tricks, and Techniques from the Experts

Appendix K Introduction to Microsoft Visual C++ 6.0

PL/SQL Overview. Basic Structure and Syntax of PL/SQL

Transforming SAS Data Sets Using Arrays. Introduction

CPSC 2800 Linux Hands-on Lab #7 on Linux Utilities. Project 7-1

EXST SAS Lab Lab #4: Data input and dataset modifications

Storing and Using a List of Values in a Macro Variable

Oracle Database: Develop PL/SQL Program Units

Data Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type

Package uptimerobot. October 22, 2015

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Creating Raw Data Files Using SAS. Transcript

What is a Loop? Pretest Loops in C++ Types of Loop Testing. Count-controlled loops. Loops can be...

Instant Interactive SAS Log Window Analyzer

Before You Begin... 2 Running SAS in Batch Mode... 2 Printing the Output of Your Program... 3 SAS Statements and Syntax... 3

Microsoft Office. Mail Merge in Microsoft Word

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Training/Internship Brochure Advanced Clinical SAS Programming Full Time 6 months Program

Memory Systems. Static Random Access Memory (SRAM) Cell

PCI-SIG ENGINEERING CHANGE REQUEST

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

Chapter One Introduction to Programming

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

TECHNOLOGY Computer Programming II Grade: 9-12 Standard 2: Technology and Society Interaction

Process: Self Service

Introduction to Minitab Macros. Types of Minitab Macros. Objectives. Local Macros. Global Macros

Handling Exceptions. Copyright 2006, Oracle. All rights reserved. Oracle Database 10g: PL/SQL Fundamentals 8-1

Getting started with the Stata

Together with SAP MaxDB database tools, you can use third-party backup tools to backup and restore data. You can use third-party backup tools for the

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

PROC SUMMARY Options Beyond the Basics Susmita Pattnaik, PPD Inc, Morrisville, NC

Intro to Embedded SQL Programming for ILE RPG Developers

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

Raima Database Manager Version 14.0 In-memory Database Engine

A Method for Cleaning Clinical Trial Analysis Data Sets

Exploit SAS Enterprise BI Server to Manage Your Batch Scheduling Needs

Importing Data into SAS

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Algorithm & Flowchart & Pseudo code. Staff Incharge: S.Sasirekha

ALTIBASE HDB Patch Notes

All Colleagues Landing Page

Repetition Using the End of File Condition

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

PART-A Questions. 2. How does an enumerated statement differ from a typedef statement?

Analyzing & Optimizing T-SQL Query Performance Part1: using SET and DBCC. Kevin Kline Senior Product Architect for SQL Server Quest Software

Comdial Network Management System User Instructions

Scanning The Job Log for Errors & Notes. Devendra Patel, Information Services

3 IDE (Integrated Development Environment)

How To Understand The Error Codes On A Crystal Reports Print Engine

ARIZONA CTE CAREER PREPARATION STANDARDS & MEASUREMENT CRITERIA SOFTWARE DEVELOPMENT,

Performing Simple Calculations Using the Status Bar

Import and Export User Guide. PowerSchool 7.x Student Information System

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

Database Programming with PL/SQL: Learning Objectives

3.GETTING STARTED WITH ORACLE8i

5. CHANGING STRUCTURE AND DATA

9 Control Statements. 9.1 Introduction. 9.2 Objectives. 9.3 Statements

SQL Server Database Coding Standards and Guidelines

2: Entering Data. Open SPSS and follow along as your read this description.

Handling Exceptions. Schedule: Timing Topic 45 minutes Lecture 20 minutes Practice 65 minutes Total

Distributed R for Big Data

SUGI 29 Coders' Corner

Transcription:

STAT 3304/5304 Introduction to Statistical Computing Understanding DATA Step Processing

Source of Data The DATA step s function is in general to get the data in shape for later PROCs and DATA steps. SAS PROCs can only read SAS datasets, but we might have some other type of file to process. You can use a DATA step to read raw data into a SAS data set from multiple sources: In-stream data: Cards/ Datalines / Input External file: Infile/ Input Database management system (DBMS): SAS access to DBMS (Oracle/SQL etc.) 1

Understanding DATA Step Understanding how the program operates can help you to anticipate how variables will be create and processed, to plan your modifications, and to interpret and debug program errors. It also gives you useful strategies for preventing and correcting common DATA step errors. To read the raw data file, the DATA step must give the following instructions to the SAS system: reference the external text file to be read name the SAS data set identify the external file describe the data values to be read 2

SAS DATA step By definition, a SAS dataset has a built in descriptor that keeps track of names and attributes of each of the datasets columns, so that later steps dont have to remember as many details. In the DATA step, we don t always have well defined data, and the DATA step gives us the power to read and write virtually any kind of file and do calculations and computations on a single row of data. When you submit a DATA step, SAS processes the DATA step and then creates a new SAS data set. As in many computer languages, the DATA step is first processed by a compiler and later, the compiled program is then executed. 3

SAS DATA step A SAS DATA step is processed in two phases: Compilation phase and execution phase. Each statement is scanned for syntax errors. Most syntax errors prevent further processing of the DATA step. If the DATA step compiles successfully, then the execution phase begins. 4

SAS DATA step A DATA step executes once for each observation in the input data set, unless otherwise directed. The following diagram shows the flow of DATA step processing for reading raw data. 5

Compilation phase Input buffer: an area of memory, is created to hold a record from the external file. The input buffer is created only when raw data is read, not when a SAS data set is read. Then the Program Data Vector (PDV) is created. The PDV is the area of memory where SAS software builds a data set, one observation at a time. 6

Compilation phase During the compile phase, SAS takes the following steps: 1. SAS creates a program data vector (PDV) containing the automatic variables N and ERROR 2. SAS scans each statement in the DATA step looking for syntax errors, such as missing semicolons and invalid statements. 3. When SAS compiles the INPUT statement, SAS adds a position to the PDV for each variable in the input data set. SAS gets the variable names and attributes, such as type and length, from the input data set. 4. SAS also adds a position to the PDV for any variables that are created in the DATA step. The attributes of each of these variables are determined by the expression in the statement. 5. SAS completes the compile phase at the bottom of the DATA step, and it is then that SAS makes the descriptor portion of the SAS data set. The output data set does not yet contain any observations, because SAS has not yet begun executing the program. When the compile phase is complete, that s when SAS starts the execution phase. 7

Logical Program Data Vector (PDV) The DATA step refines data, and as such, a second memory area is needed for: Inputting and input formatting (informatting) desired variables Revising existing values Computing new variables System indicators and flags This second area in memory is called the Logical Program Data Vector (PDV). When the compiler processes the DATA step, it needs to define a slot for each variable referenced in the program. 8

Logical Program Data Vector (PDV) These PDV slots will be defined in the order referenced in the program, and each variable has the following attributes: Relative variable number Position in the dataset Name Data type Length in bytes Informat Format Variable label Flags to indicate dropping and retaining of variables 9

Logical Program Data Vector (PDV) The program data vector contains two automatic pseudo variables that can be used for processing but which are not written to the data set as part of an observation. N counts the number of times that the DATA step begins to execute. ERROR signals the occurrence of an error that is caused by the data during execution. The default value is 0, which means there is no error. When on or more errors occur, the value is set to 1. 10

Syntax Checking During the compilation phase, SAS also scans each statement in the DATA step, looking for syntax errors. Syntax errors include missing or misspelled keywords invalid variable names missing or invalid punctuation invalid options 11

Execution phase Example: data temp; infile datalines; input ID SBP DBP SEX $ AGE WT; wtkg = WT/2.2; datalines; 1 120 80 M 15 115 2 130 70 F 25 180 3 140 100 M 89 170 4 120 80 F 30 150 5 125 80 F 20 110 run; During the execution phase, SAS takes the following steps: 1. The DATA step executes once for each observation in the input data set. 2. Initializing variables: the value of N is 1 and the value of ERROR is 0 because there are no data error. The remaining variables are initialized to missing. 12

Execution phase 3. INFILE statement: The INFILE statement identifies the location of the raw data. 4. INPUT statement: The INPUT statement reads a record into the input buffer. Then, the raw data is read and assigned to the program data vector. 5. The assignment statement executes to compute the first value of wtkg. 5. At the end of the first iteration of the DATA step, the values in the program data vector are written to the output data set temp as the first observation. 6. The value of the automatic variable N is increased to 2, and control returns to the top of the DATA step. The automatic variable ERROR retains its value of 0, since SAS has still not encountered an error. All other variable values, are reset to missing. 7. As the INPUT statement executes, the values from the second observation are written to the program data vector. 8. The assignment statement executes again to compute the value for wtkg for the second observation. 9. At the bottom of the DATA step, the values in the program data vector are written to the output data set temp as the second observation. 13

Execution phase 10. The DATA step works like a loop, repetitively executing statements to read data values and create observations one by one. 11. The execution phase continues in this manner until the end-of-file marker is reached in the raw data file. 12. When there are no more records in the raw data file to be read, the data portion of the new data set is complete. 13. At the end of the execution phase, the SAS log confirms that the raw data file was read, and it displays the number of observations and variables in the data set. 14

Debugging a DATA step Diagnosing Errors in the Compilation Phase: Many errors are detected during the compilation phase, including misspelled keywords and data set names missing semicolons unbalanced quotation marks invalid options. During the compilation phase, SAS can interpret some syntax errors (such as the keyword DATA misspelled as DAAT). 15

Debugging a DATA step If SAS cannot interpret the error, SAS prints the word ERROR followed by an error message in the log compiles but does not execute the step where the error occurred, and prints the following message to warn you: NOTE: The SAS System stopped processing this step because of errors. Some errors are explained fully by the message that SAS prints; other error messages are not as easy to interpret. 16

Debugging a DATA step As you have seen, errors can occur in the compilation phase, resulting in a DATA step that is compiled but not executed. Errors can also occur during the execution phase. When SAS detects an error in the execution phase, the following can occur, depending on the type of error: A note, warning, or error message is displayed in the log. The values that are stored in the program data vector are displayed in the log. The processing of the step either continues or stops. 17

Debugging a DATA step When no observations are written to the data set, you should check to see whether your DATA step was completely executed. Most likely, a syntax error or another error is being detected at the beginning of the execution phase. An invalid data message indicates that the program executed, but the data is not acceptable. Typically, the message indicates that a variable s type has been incorrectly identified in the INPUT statement, or that the raw data file contains some invalid data value(s). 18

Testing Your Programs Writing a NULL Data Set After you write or edit a DATA step, you can compile and execute your program without creating a data set. This enables you to detect the most common errors and saves you development time. A simple way to test a DATA step is to specify the keyword NULL as the data set name in the DATA statement. When you submit the DATA step, no data set is created, but any compilation or execution errors are written to the log after the values of the variables are read and verified. After correcting any errors, you can replace NULL with the name of the data set that you want to create. Limiting Observations Remember that you can use the OBS= option in the INFILE statement to limit the number of observations that are read or created during the execution of the DATA step. 19

Testing Your Programs PUT statement When the source of program errors is not apparent, you can use the PUT statement to examine variable values and to print your own message in the log. For diagnostic purposes, you can use IF-THEN/ELSE statements to conditionally check for values. General form, simple PUT statement: PUT specification(s); where each specification specifies what is written, how it is written, and where it is written. This can include a character string one or more data set variables the automatic variables N and ERROR the automatic variable ALL 20