LEADS AND LAGS IN SAS

Similar documents
LEADS AND LAGS: HANDLING QUEUES IN THE SAS DATA STEP

Leads and Lags: Static and Dynamic Queues in the SAS DATA STEP

Tips to Use Character String Functions in Record Lookup

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Finding National Best Bid and Best Offer

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

Alternatives to Merging SAS Data Sets But Be Careful

Traditional Conjoint Analysis with Excel

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

SUGI 29 Coders' Corner

Teamstudio USER GUIDE

Software License Registration Guide

Banner Employee Self-Service Web Time Entry. Student Workers User s Guide

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Using SQL Queries in Crystal Reports

Directions for the Well Allocation Deck Upload spreadsheet

Subsetting Observations from Large SAS Data Sets

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

UNDERSTANDING YOUR DHL INVOICE

Procurement Planning

AN ANIMATED GUIDE: SENDING SAS FILE TO EXCEL

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Sensex Realized Volatility Index

Automating client deployment

Everything you wanted to know about MERGE but were afraid to ask

COGNOS Query Studio Ad Hoc Reporting

High Availability Configuration

Management Reporter Integration Guide for Microsoft Dynamics AX

Designing Adhoc Reports

Annex IV.5. Inventory Management Methods. Definitions and Interpretation

Chapter 23 File Management (FM)

Salary. Cumulative Frequency

QUICK START GUIDE RESOURCE MANAGERS. Last Updated: 04/27/2012

Dimension Technology Solutions Team 2

ProDoc Tech Tip Creating and Using Supplemental Forms

ABSTRACT INTRODUCTION FILE IMPORT WIZARD

ORACLE USER PRODUCTIVITY KIT USAGE TRACKING ADMINISTRATION & REPORTING RELEASE 3.6 PART NO. E

Microsoft Excel 2007 Consolidate Data & Analyze with Pivot Table Windows XP

Directions for the AP Invoice Upload Spreadsheet

Managing Communications using InTouch. applicable to onwards

PROJECTS SCHEDULING AND COST CONTROLS

User Guide for TASKE Desktop

Smart Web. User Guide. Amcom Software, Inc.

Novell ZENworks 10 Configuration Management SP3

Learning Management System (LMS) Guide for Administrators

Designing and Implementing Forms 34

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

DataPA OpenAnalytics End User Training

Accounts Receivable System Administration Manual

Applications Development

South Dakota Board of Regents. Web Time Entry. Student. Training Manual & User s Guide

User's Manual Dennis Baggott and Sons

Using InstallAware 7. To Patch Software Products. August 2007

Contents COMBO SCREEN FOR THEPATRON EDGE ONLINE...1 TICKET/EVENT BUNDLES...11 INDEX...71

Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC

Time Management II. June 5, Copyright 2008, Jason Paul Kazarian. All rights reserved.

Document Management System (DMS) Release 4.5 User Guide

Online shopping store

Sign Inventory and Management (SIM) Program Introduction

Sage HRMS 2014 Sage HRMS Payroll Getting Started Guide. October 2013

SAS Task Manager 2.2. User s Guide. SAS Documentation

Microsoft Windows PowerShell v2 For Administrators

Oracle Primavera. P6 Resource Leveling Demo Script

MyFaxCentral User Administration Guide

Sugar Open Source Installation Guide. Version 4.5.1

Excel for Data Cleaning and Management

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Paper Hot Links: Creating Embedded URLs using ODS Jonathan Squire, C 2 RA (Cambridge Clinical Research Associates), Andover, MA

PeopleSoft Query Training

SnapLogic Salesforce Snap Reference

MICROSOFT ACCESS 2003 TUTORIAL

Unemployment Insurance Data Validation Operations Guide

HR Pro v4.0 User-Defined Audit Log Setup

Wave Analytics Platform Setup Guide

Microfinance Credit Risk Dashboard User Guide

HP Quality Center. Upgrade Preparation Guide

DeltaV Workstation Operating System Licensing Implications with Backup and Recovery

Course 20461C: Querying Microsoft SQL Server Duration: 35 hours

SmarterMail User Guide

EMC Documentum Webtop

Sage 200 v5.10 What s New At a Glance

SENDING S WITH MAIL MERGE

Welcome to MaxMobile. Introduction. System Requirements. MaxMobile 10.5 for Windows Mobile Pocket PC

Comparing share-price performance of a stock

Using simulation to calculate the NPV of a project

Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1

Virtual Contact Center

FALL ENROLLMENT: A COMMON TOPIC OF CONVERSATION AMONG HIGHER EDUCATION LEADERS

Overview - ecommerce Integration 2. How it Works 2. CHECKLIST: Before Beginning 2. ecommerce Integration Frequently Asked Questions 3

Grain Stocks Estimates: Can Anything Explain the Market Surprises of Recent Years? Scott H. Irwin

IBM Endpoint Manager Version 9.2. Patch Management for SUSE Linux Enterprise User's Guide

Creating Accessible PDF Documents with Adobe Acrobat 7.0 A Guide for Publishing PDF Documents for Use by People with Disabilities

Editor Manual for SharePoint Version December 2005

PAsecureID Fall 2011 (PreK Grade 12)

RECOVER ( 8 ) Maintenance Procedures RECOVER ( 8 )

An Introduction to Using the Command Line Interface (CLI) to Work with Files and Directories

Dell KACE K1000 System Management Appliance Version 5.4. Service Desk Administrator Guide

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

Transcription:

LEDS ND LGS IN SS Mark Keintz, Wharton Research Data Services, University of Pennsylvania STRCT nalysis of time series data often requires use of lagged (and occasionally lead) values of one or more analysis variable For the SS user, the central operational task is typically getting lagged (lead) values for each time point in the data set While SS has long provided a LG function, it has no analogous lead function an especially significant problem in the case of large data series This paper will () review the lag function, in particular the non-intuitive implications of its queue-oriented basis and () demonstrate efficient ways to generate leads, without the common recourse to data re-sorting INTRODUCTION SS has a LG function (and a related DIF) function intended to provide data values from preceding records in a data set If may be something as simplistic as last month s price in a monthly stock price data set (a regular time series), or as variable as the most recent sales volume of a product which sold only occasionally (irregular time series) The presentation will show the benefit of the queue-management orientation of the LG function in addressing time series that are simple, sorted, or irregular In the absence of a lead function, this presentation will show simple SS scripts to produce lead values under similar conditions THE LG FUNCTION RETRIEVING HISTORY The term lag function suggests the retrieval of data via looking back some user-specified number of periods or observations For instance, consider the task of producing -month and -month price returns for this monthly stock price file (created from the sashelpstocks data see appendix for creation of data set SMPLE): Table Five Rows From SMPLE (re-sorted from sashelpstocks) Obs DTE STOCK CLOSE 5 0UG86 0SEP86 0OCT86 0NOV86 0DEC86 $875 $50 $6 $7 $000 This program below uses the LG and LG ( record lag) functions to compare the current close to its immediate predecessor and it third prior predecessor:

Example : Simple Creation of Lagged Values data example; set sample; close_=lag(close); close_=lag(close); if close_ ^= then return_ = close/close_ - ; if close_ ^= then return_ = close/close_ - ; which yields the following data in the first 5 rows: Table Example Results from using LG and LG Obs DTE STOCK CLOSE CLOSE_ CLOSE_ RETURN_ RETURN_ 5 0UG86 0SEP86 0OCT86 0NOV86 0DEC86 $875 $50 $6 $7 $000 875 50 6 7 875 50-00 -008 008-0056 -008-008 LGS RE QUEUES NOT LOOK CKS t this point LG functions have all the appearance of simply looking back by one (or ) observations, with the additional feature of imputing missing values when looking back beyond the beginning of the data set ut actually the lag function instructs SS to construct a fifo (first-in/first-out) queue with () as many entries as the length of the lag period, and () the queue elements initialized to missing values Every time the lag function is executed, the oldest entry is retrieved (and removed) from the queue and a new entry is added The significance of this distinction becomes evident when the LG function is executed conditionally, as in the treatment of Y groups below s an illustration, consider observations through 7 generated by Example program, and presented in Table This shows the last two cases for and the first four for For the first observation (obs ), the lagged value of the closing stock price (CLOSE_=80) is taken from the series Of course, it should be a missing value, as should all the shaded cells Table The Problem of Y Groups for Lags Obs DTE STOCK CLOSE CLOSE_ CLOSE_ RETURN_ RETURN_ 0NOV05 $8890 888 806 0086 00 0DEC05 $80 8890 80-0075 005 0UG86 $00 80 888-070 -079

Table The Problem of Y Groups for Lags Obs DTE STOCK CLOSE CLOSE_ CLOSE_ RETURN_ RETURN_ 5 0SEP86 $950 00 8890-05 -078 6 0OCT86 $05 950 80 008-075 7 0NOV86 $00 05 00 06 0000 The natural way to address this problem is to use a Y statement in SS and avoid executing lags when the observation in hand is the first for a given stock Example is such a program (dealing with CLOSE_ only for illustration purposes), and its results are in Table Example : naïve implementation of lags for Y groups data example; set sample; by stock; if firststock=0 then close_=lag(close); else close_=; if close_ ^= Then return_= close/close_ ; format ret: 6; Table Example Results of "if firststock then close_=lag(close)" Obs STOCK DTE CLOSE CLOSE_ RETURN_ 5 6 7 0NOV05 0DEC05 0UG86 0SEP86 0OCT86 0NOV86 $8890 $80 $00 $950 $05 $00 888 8890 80 950 05 0086-0075 -076 008 06 This fixes the first record, setting both CLOSE_ and RETURN_ to missing values ut look at the second record (Obs 5) CLOSE_ has a value of 80, taken not from the first record, but rather from the last record, generating an erroneous value for RETURN_ as well In other words, CLOSE_ did not come from the prior record, but rather it came from the queue, whose contents were most recently updated prior to the first record LGS FOR Y GROUPS The usual fix is to unconditionally execute a lag, and then reset the result when necessary Example shows just such a solution (described by Howard Schrier see References) It uses the IFN function instead of an IF statement - because IFN executes the embedded lag regardless of the status of firststock (the condition being tested)

IFN keeps the lagged value only when the tested condition is true ccommodating Y groups for lags longer than one period requires comparing lagged values of the Y-variable to the current values ( lag(stock)=stock ) Example : robust implementation of lags for Y groups data example; set sample; by stock; close_ = ifn(firststock=0,lag(close),); close_ = ifn(lag(stock)=stock,lag(close),) ; if close_ ^= then RETURN_ = close/close_ - ; if close_ ^= then RETURN_ = close/close_ - ; format ret: 6; The data set EXMPLE now has missing values for the appropriate records at the start of the monthly records Table 5 Result of Robust Lag Implementation for Y Groups Obs STOCK DTE CLOSE CLOSE_ CLOSE_ RETURN_ RETURN_ 5 6 7 0NOV05 0DEC05 0UG86 0SEP86 0OCT86 0NOV86 $8890 $80 $00 $950 $05 $00 888 8890 00 950 05 806 80 00 0086-0075 -05 008 06 00 005 0000 MULTIPLE LGS MENS MULTIPLE QUEUES While Y-groups benefit from avoiding conditional execution of lags, there are times when conditional lags are the best solution In the data set SLES below (see ppendix for generation of SLES) are monthly sales by product ut a given product month combination is only present when sales are reported s a result each month has a varying number of records, depending on which product sales were reported Table 6 shows 5 records for January 00, but only records for February 00 Unlike the data in Sample, this time series is not regular, so comparing (say) sales of product in January (observation ) to February (7) would imply a LG5, while comparing March (0) to February would need a LG

Table 6 Data Set SLES (first obs) OS MONTH PROD SLES 5 6 7 8 9 0 C D X D X C D X 0 7 0 6 8 7 6 The solution to this task is to use conditional lags, with one queue for each product The program to do so is surprisingly simple: Example : LGS for Irregular Time Series data example; set sales; select (product); when ('') change_rate=sales-lag(sales)/(month-lag(month)); when ('') change_rate=sales-lag(sales)/(month-lag(month)); when ('C') change_rate=sales-lag(sales)/(month-lag(month)); when ('D') change_rate=sales-lag(sales)/(month-lag(month)); otherwise; end; Example generates four queues, one for each of the products through D ecause a given queue is updated only when the specified product is in hand (the when clauses), the output of the lag function must come from an observation having the same PRODUCT as the current observation, no matter how far back it may be in the data set LEDS HOW TO LOOK HED SS does not offer a lead function s a result many SS programmers sort a data set in descending order and then apply lag functions to create lead values Often the data are sorted a second time, back to original order, before any analysis is done In the case of large data sets, this is a costly technique 5

There is a much simpler way, through the use of extra SET statements in combination with the FIRSTOS parameter and other elements of the SET statement The following program generates both a one-month and threemonth lead of the data from Sample Example 5: Simple generation of one-record and three-record leads data lead_example5; set sample; if eof=0 then set sample (firstobs= keep=close rename=(close=led)) end=eof; else lead=; if eof=0 then set sample (firstobs= keep=close rename=(close=led)) end=eof; else lead=; The use of multiple SET statements to generate leads makes use of two features of the SET statement and two data set name parameters: SET Feature : Multiple SET statements reading the same data set do not read interleaved records Instead they produce separate streams of data It is as if the three SET statements above were reading from three different data sets In fact the log from Example 5 displays these notes reporting three incoming streams of data: NOTE: There were 699 observations read from the data set WORKSMPLE NOTE: There were 698 observations read from the data set WORKSMPLE NOTE: There were 696 observations read from the data set WORKSMPLE Data Set Name Parameter (FIRSTOS): Using the FIRSTOS= parameter provides a way to look ahead in a data set For instance the second SET has FIRSTOS= (the third has FIRSTOS= ), so that it starts reading from record () while the first SET statement is reading from record This provides a way to synchronize leads with any given current record Data Set Name Parameter (RENME, and KEEP): ll three SET statements read in the same variable (CLOSE), yet a single variable can t have more than one value at a time In this case the original value would be overwritten by the subsequent SETs To avoid this problem CLOSE is renamed (to LED and LED) in the additional SET statements, resulting in three variables: CLOSE (from the current record, LED (one period lead) and LED (three period lead) To avoid overwriting any other variables, only variables to be renamed should be in the KEEP= parameter SET Feature (EOF=): Using the IF EOFx= in tandem with the END=EOFx parameter avoids a premature end of the data step Ordinarily a data step with three SET statements stops when any one of the SETs attempts to go beyond the end of input In this case, the third SET ( FIRSTOS= ) would stop the DT step while the first would have unread records remaining The way to work around this is to prevent unwanted attempts at reading beyond the end of data The end= parameter generates a dummy variable indicating whether the record in hand is the last incoming record The program can test its value and stop reading each data input stream when it is exhausted That s why the log notes above report differing numbers of observations read The last records in the resulting data set are below, with lead values as expected: 6

Table 7: Simple Leads Obs STOCK DTE CLOSE LED LED 696 697 698 699 Microsoft Microsoft Microsoft Microsoft 0SEP05 0OCT05 0NOV05 0DEC05 $57 $570 $768 $65 $570 $768 $65 $65 > LEDS FOR Y GROUPS Just as in the case of lags, generating lead in the presence of Y group requires a little extra code singleperiod lead is relatively easy if the current record is the last in a Y group, reset the lead to missing That test is shown after the second SET statement below ut in the case of leads beyond one period, a little extra is needed namely a test comparing the current value of the Y-variable (STOCK) to its value in the lead period That s done in the code below for the three-period lead by reading in (and renaming) the STOCK variable in the third SET statement, comparing it to the current STOCK value, and resetting LED to missing when needed Example 6: Generating Leads for y Groups data example6; set sample; by stock; if eof=0 then set sample (firstobs= keep=close rename=(close=led)) end=eof; if laststock then lead=; if eof=0 then set sample (firstobs= keep=stock close rename=(stock=stock close=led)) end=eof; if stock ^= stock the lead=; drop stock; The result for the last four observations and the first three observations are below, with LED set to missing for the final, and LED for the last observations Table 8 Leads With y Groups Obs STOCK DTE CLOSE LED LED 0 0SEP05 0OCT05 0NOV05 0DEC05 $80 $888 $8890 $80 $888 $8890 $80 $80 7

Table 8 Leads With y Groups Obs STOCK DTE CLOSE LED LED 5 0UG86 0SEP86 $00 $950 $950 $05 $00 $00 GENERTING MULTIPLE LED QUEUES Generating lags for the SLES data above required the utilization of multiple queues a LG function for each product This resolved the problem of varying distances between successive records for a given product Developing leads for such irregular time series requires the same approach one queue for each product However, instead of depending on the several LG functions to manage separate queues, generating leads require a collection filtered SET statements The relatively simple program below demonstrates: data sales_lead; set sales; Example 7: Generating Leads for Irregular Time Series if product='' and eofa=0 then set sales (where=(product='') firstobs= keep=product sales rename=(sales=led)) end=eofa; else if product='' and eofb=0 then set sales (where=(product='') firstobs= keep=product sales rename=(sales=led)) end=eofb; else if product='c' and eofc=0 then set sales (where=(product='c') firstobs= keep=product sales rename=(sales=led)) end=eofc; else if product='d' and eofd=0 then set sales (where=(product='d') firstobs= keep=product sales rename=(sales=led)) end=eofd; else lead=; The logic of Example 7 is straightforward If the current product is (if PRODUCT= ) and the set of PROD- UCT records is not finished (if eofa=0), then read the next product record, renaming its SLES variable to LED The technique for reading only product records is to use the WHERE= data set name parameter Most important to this technique is the fact that the WHERE filter is honored prior to the FIRSTOS parameter So FIRSTOS= means the second product record The first 8 product and records are as follows, with the LED value always the same as the SLES values for the next identical product 8

Table 9 Lead produced by Independent Queues Obs MONTH PRODUCT SLES LED 6 7 0 5 9 5 0 6 5 5 6 5 5 6 0 CONCLUSIONS While at first glance the queue management character of the LG function may seem counterintuitive, this property offers robust techniques to deal with a variety of situations, including Y groups and irregularly spaced time series The technique for accommodating those structures is relatively simple In addition, the use of multiple SET statements produces the equivalent capability in generating leads, all without the need for extra sorting of the data set REFERENCES: Matlapudi, njan, and J Daniel Knapp (00) Please Don t Lag ehind LG In the Proceedings of the North East SS Users Group Schreier, Howard (undated) Conditional Lags Don t Have to be Treacherous URL as or 7//0: http://wwwhowlescom/saspapers/ccpdf CKNOWLEDGMENTS SS and all other SS Institute Inc product or service names are registered trademarks or trademarks of SS Institute Inc in the US and other countries indicates US registration CONTCT INFORMTION This is a work in progress Your comments and questions are valued and encouraged Please contact the author at: uthor Name: Mark Keintz Company: Wharton Research Data Services ddress: 05 St Leonard s Court 89 Chestnut St Philadelphia, P 90 Work Phone: 5898-60 Fax: 557607 Email: mkeintz@whartonupenedu 9

0

PPENDIX: CRETION OF SMPLE DT SETS FROM THE SSHELP LIRRY /*Sample : Monthly Stock Data for,, Microsoft for ug 986 - Dec 005 */ proc sort data=sashelpstocks out=sample; by stock date; /*Sample : Irregular Series: Monthly Sales by Product, */ data SLES; do MONTH= to ; do PRODUCT='','','C','D','X'; if ranuni(098098)< 09 then do; SLES =ceil(0*ranuni(0598067)); output; end; end; end;