Subsetting Observations from Large SAS Data Sets



Similar documents
Table Lookups: From IF-THEN to Key-Indexing

THE POWER OF PROC FORMAT

Alternative Methods for Sorting Large Files without leaving a Big Disk Space Footprint

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

What You re Missing About Missing Values

Managing Clinical Trials Data using SAS Software

Leads and Lags: Static and Dynamic Queues in the SAS DATA STEP

Parallel Data Preparation with the DS2 Programming Language

Introduction to Criteria-based Deduplication of Records, continued SESUG 2012

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Agility Database Scalability Testing

Using PROC RANK and PROC UNIVARIATE to Rank or Decile Variables

Salary. Cumulative Frequency

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Optimizing System Performance by Monitoring UNIX Server with SAS

Accomplish Optimal I/O Performance on SAS 9.3 with

Finding National Best Bid and Best Offer

Everything you wanted to know about MERGE but were afraid to ask

Normalizing SAS Datasets Using User Define Formats

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Performing Queries Using PROC SQL (1)

Understanding the Benefits of IBM SPSS Statistics Server

LLamasoft K2 Enterprise 8.1 System Requirements

Improve application performance and scalability with Adobe ColdFusion 9

Top Ten Reasons to Use PROC SQL

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance

Alternatives to Merging SAS Data Sets But Be Careful

Innovative Techniques and Tools to Detect Data Quality Problems

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

Ten Things You Should Know About PROC FORMAT Jack Shoemaker, Accordant Health Services, Greensboro, NC

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Exploring DATA Step Merges and PROC SQL Joins

Optimizing the Performance of Your Longview Application

Intelligent Query and Reporting against DB2. Jens Dahl Mikkelsen SAS Institute A/S

SAS Programming Tips, Tricks, and Techniques

Effective Use of SQL in SAS Programming

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

SAS Data Set Encryption Options

Chapter 13: Query Processing. Basic Steps in Query Processing

Benchmarking Cassandra on Violin

Comparison of Windows IaaS Environments

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

A Survey of Shared File Systems

PROC SQL for SQL Die-hards Jessica Bennett, Advance America, Spartanburg, SC Barbara Ross, Flexshopper LLC, Boca Raton, FL

Paper Hot Links: Creating Embedded URLs using ODS Jonathan Squire, C 2 RA (Cambridge Clinical Research Associates), Andover, MA

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

System Requirements. SAS Profitability Management Deployment

HP reference configuration for entry-level SAS Grid Manager solutions

SAS 9.4 PC Files Server

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

FHE DEFINITIVE GUIDE. ^phihri^^lv JEFFREY GARBUS. Joe Celko. Alvin Chang. PLAMEN ratchev JONES & BARTLETT LEARN IN G. y ti rvrrtuttnrr i t i r

The MAX5 Advantage: Clients Benefit running Microsoft SQL Server Data Warehouse (Workloads) on IBM BladeCenter HX5 with IBM MAX5.

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Dup, Dedup, DUPOUT - New in PROC SORT Heidi Markovitz, Federal Reserve Board of Governors, Washington, DC

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Running a Workflow on a PowerCenter Grid

SAP Business One Hardware Requirements Guide

Oracle Database In- Memory Op4on in Ac4on

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Rethinking SIMD Vectorization for In-Memory Databases

SAS University Edition: Installation Guide for Linux

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Integrated Grid Solutions. and Greenplum

SQL Pass-Through and the ODBC Interface

Maximum performance, minimal risk for data warehousing

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

1. Technical requirements 2. Installing Microsoft SQL Server Configuring the server settings

Katie Minten Ronk, Steve First, David Beam Systems Seminar Consultants, Inc., Madison, WI

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Accelerating Server Storage Performance on Lenovo ThinkServer

A Method for Cleaning Clinical Trial Analysis Data Sets

Transcription:

Subsetting Observations from Large SAS Data Sets Christopher J. Bost, MDRC, New York, NY ABSTRACT This paper reviews four techniques to subset observations from large SAS data sets: MERGE, PROC SQL, user-defined format, and DATA step hash object. Pros and cons of each technique are noted. Test results on a SAS data set with 100 million observations are summarized. INTRODUCTION Multiple techniques can be used to subset observations from large SAS data sets. Techniques vary in complexity, efficiency, and run times. Understanding the pros and cons of each can help the programmer determine the best technique for a given situation. This paper focuses on subsetting sample members from a much larger data set. The number of conditions (i.e., sample member IDs) precludes use of a WHERE statement. SAMPLE DATA SETS Three SAS data sets are used in this paper: SMALL, LARGE, and SUBSET. Data set SMALL Obs ID 1 2 2 4 3 5 Data set SMALL contains one variable and three observations. There is one observation for each ID (i.e., no duplicates). Each observation represents a sample member. Subset all observations with these IDs from data set LARGE. Data set LARGE Obs ID N 1 1 1 2 2 1 3 2 2 4 3 1 5 3 2 6 3 3 7 4 1 8 4 2 9 4 3 10 5 1 Data set LARGE contains two variables and ten observations. There is more than one observation for some IDs (i.e., duplicates). Each observation represents a unique event. Occurrences of each ID are numbered sequentially by the variable N. Observations with ID values that occur in data set SMALL are highlighted. Data set SUBSET Obs ID N 1 2 1 2 2 2 3 4 1 4 4 2 5 4 3 6 5 1 Data set SUBSET contains two variables and six observations. It contains all variables in data set LARGE and all observations in data set LARGE with an ID value that occurs in data set SMALL. 1

SUBSETTING OBSERVATIONS WITH MERGE Use a match-merge to combine data sets SMALL and LARGE, and keep only observations where the ID occurs in both. For example: proc sort data=small; proc sort data=large; merge small(in=ins) large(in=inl); if ins and inl; Sort data sets SMALL and LARGE by ID with PROC SORT. Merge the sorted data sets by ID. Use the IN= data set option to create variables that store information about the origin of each observation. SAS creates temporary variables INS and INL that have the value 1 when the current value of the BY variable contributed to the output data set at least once and that have the value 0 when it did not. Use an IF-THEN statement to output only those observations with an ID in both data sets SMALL and LARGE to data set SUBSET. Pros: easy to use; well-known technique Cons: Data set LARGE must be in sort order, sorted with PROC SORT, or indexed SUBSETTING OBSERVATIONS WITH PROC SQL Use PROC SQL to perform an inner join of rows (observations) in tables (data sets) LARGE and SMALL where the ID occurs in SMALL. For example: proc sql; create table subset as select large.* from large, small where large.id = small.id; quit; Use CREATE TABLE to create data set SUBSET. Use SELECT LARGE.* to retrieve all variables from data set LARGE. Specify FROM LARGE, SMALL to read observations from data sets LARGE and SMALL. Specify WHERE LARGE.ID = SMALL.ID to include observations with an ID in data set LARGE that matches an ID in data set SMALL. Pros: easy to use; well-known technique; data set LARGE does not have to be in sort order, sorted with PROC SORT, or indexed (although an index might improve performance) Cons: SAS may run PROC SORT in the background 2

SUBSETTING OBSERVATIONS WITH A FORMAT Use PROC FORMAT to create a temporary format that assigns all ID values in data set SMALL the same label. Then use a WHERE statement to subset observations from data set LARGE where the formatted value of ID equals that label. The format serves, in effect, as a lookup table. For example: proc format; value $key '2','4','5' = 'Y' other = 'N' ; set large; where put(id,$key.) = 'Y' ; Use PROC FORMAT to create a temporary character format named $KEY. The values 2, 4, and 5 will be formatted as Y. All other values will be formatted as N. Create data set SUBSET with the DATA statement. Read observations from data set LARGE with the SET statement. Load only those observations where the value of ID, formatted with $KEY. via the PUT function, equals Y with the WHERE statement. Observations with ID values 2, 4, and 5 are written to data set SUBSET. INPUT CONTROL DATA SET It is not practical to specify large numbers of ID values when creating a format. The FORMAT procedure, however, can create formats from an input control data set a SAS data set that stores the information needed to create a format. For example: Input control data set FMT Obs fmtname type label start HLO 1 key C Y 2 2 key C Y 4 3 key C Y 5 4 key C N 5 O Data set FMT has the five variables required for an input control data set: FMTNAME (the name of the format); TYPE (the type of format [ C for character or N for numeric]); LABEL (the formatted value); START (the starting value to be formatted); and HLO (a special variable needed for formats that use High, Low, and/or Other). Data set FMT has four observations, one for each value to be formatted. All observations have the value key for FMTNAME and the value C for TYPE. This input control data set will create one character format named $KEY. The first observation formats values that start with 2 as Y. (The ending value is not specified; it defaults to the starting value.) The second and third observations format values that start with 4 and 5 as Y as well. The fourth observation has the value O for HLO, indicating that any Other values not specified should be formatted as N. (The value 5 for START is ignored when HLO equals O.) BUILDING AN INPUT CONTROL DATA SET FROM DATA SET SMALL Build an input control data set that assigns the label Y to all IDs in data set SMALL. For example: proc sort data=small nodupkey out=ids; 3

data fmt(rename=(id=start)); retain fmtname 'key' type 'C' label 'Y' ; set ids end=eof; output; if eof then do; HLO = 'O' ; /* format all Other */ label = 'N' ; /* values as 'N' */ output; end; First, sort data set SMALL with PROC SORT. Use the NODUPKEY option to delete any observations with duplicate ID values. (An input control data set should not have any duplicate values, just like a VALUE statement in PROC FORMAT.) Use the OUT= option to store the sorted observations in data set IDS. These observations contain the unique ID values to subset from data set LARGE. Next, start to create data set FMT, the input control data set, with the DATA statement. Use the RETAIN statement to create the required variables FMTNAME, TYPE, and LABEL. The value key is assigned to FMTNAME; the value value C is assigned to TYPE; and the value Y is assigned to LABEL. Values are retained across all observations. Read observations from data set IDS with the SET statement. Create an end-of-file variable named EOF that equals 1 when SAS reaches the last observation and that equals 0 for all other observations. Output each observation that is read to data set FMT. Use an IF-THEN statement to tell SAS to output an additional observation to data set FMT after reading all observations from data set IDS (i.e., after it reaches the end of the file, or EOF). The variable HLO (High/Low/Other) is created and assigned the value O for Other. The value N is assigned to label all other values not specified. Output the observation to data set FMT. At the end of the DATA step, rename ID to START (i.e., one of the variables required in an input control data set) as specified on the data set option. These statements create input control data set FMT printed on page 3. CREATING THE FORMAT FROM THE INPUT CONTROL DATA SET Use PROC FORMAT to create the format from the input control data set. For example: proc format cntlin=fmt fmtlib; Use PROC FORMAT to create formats. Use the CNTLIN= option to specify the input control data set FMT. Use the FMTLIB option to print the format (i.e., for illustrative purposes only). Format $KEY has the following specification: FORMAT NAME: $KEY LENGTH: 1 NUMBER OF VALUES: 4 MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 1 FUZZ: 0 START END LABEL (VER. V7 V8 05JUL2006:09:44:05) 2 4 5 **OTHER** 2 4 5 **OTHER** Y Y Y N The format $KEY. labels all ID values in data set SMALL as Y, and all other ID values as N. 4

SUBSETTING OBSERVATIONS WITH THE FORMAT Finally, subset observations in data set LARGE where the value of ID formatted with $KEY. equals Y : set large; where put(id,$key.) = 'Y' ; Pros: speed; data set LARGE does not have to be in sort order, sorted with PROC SORT, or indexed Cons: requires multiple steps; not an intuitive or obvious technique SUBSETTING OBSERVATIONS WITH THE DATA STEP HASH OBJECT Use the DATA step hash object, an in-memory lookup table, to store all ID values in data set SMALL. Process data set LARGE and output any observations where the ID value is found in the hash table. For example: if _N_ = 1 then do; declare hash h(dataset:'small'); h.definekey('id'); h.definedone(); end; set large; if h.find() = 0 then output; Create data set SUBSET with the DATA statement. In the first iteration of the DATA step (i.e., _N_ = 1), use the DECLARE statement to create a hash object named H. Use the DATASET: argument tag to load the contents of data set SMALL into the hash object. Use the DEFINEKEY method to define ID as the key variable for the hash object. Use the DEFINEDONE method to indicate that all definitions are complete. Close the DO loop with an END; statement. Read observations from data set LARGE with the SET statement. Use the FIND method to determine whether the key (i.e., the current value of ID from data set LARGE) is stored in the hash object. A return code of 0 indicates that it was found. Output all observations where the ID is found in the hash object. Pros: speed; efficiency; data set LARGE does not have to be in sort order, sorted with PROC SORT, or indexed Cons: hash table must fit in memory; keys must be unique TEST RESULTS The above techniques were evaluated using SAS 9.1.3 Service Pack 3 on an IBM xseries 445 server with four Intel Xeon processors running at 3.0 GHz and 16 GB of RAM, under Windows Server 2003, Enterprise Edition. Files were on a NetApp FAS3050c storage system with data files and SASWORK files on separate filer heads. Data set SMALL contains 10,000 observations and 1 variable (85K) Data set LARGE contains 100,000,000 observations and 101 variables (22 GB with COMPRESS=BINARY) Techniques, respective run times (i.e., times for the entire SAS program to complete), and additional notes are summarized in the following table: 5

Technique Run Time Notes MERGE 2 hr, 36 min SORT: 2 hr, 17 min MERGE: 0 hr, 20 min PROC SQL 0 hr, 16 min Format 0 hr, 17 min FORMAT: 0.1 sec DATA step: 17 min Hash object 0 hr, 17 min The PROC SQL technique, the format technique, and the hash object technique had comparable run times. The MERGE technique was dramatically slower (i.e., due to the PROC SORT step). EFFICIENCY SAS reports real time (i.e., the clock time that it takes for a step to run) and CPU time (the amount of time that the processor[s] are being used). The difference between real time and CPU time is the amount of time that the processors are idle, waiting for data from the I/O subsystem. The MERGE technique (i.e., including the SORT) took 2 hours, 36 minutes, approximately 51 minutes of which were CPU time. The PROC SQL technique took 16 minutes, approximately 7 minutes of which were CPU time. The format technique took 17 minutes, approximately 7 minutes of which were CPU time. The hash object technique took 17 minutes, approximately 8 minutes of which were CPU time. The hash object technique had the highest ratio of CPU time/real time. CONCLUSIONS Different techniques to subset observations from large SAS data sets may entail significantly different run times. In our tests, the SQL technique, format technique, and hash technique worked faster than the MERGE technique. Performance may vary based on the operating environment and files used. Test multiple techniques where time permits to see which works best. REFERENCES Bilenas, Jonas V. 2005. The Power of PROC FORMAT. Cary, NC: SAS Institute Inc. Secosky, Jason and Bloom, Janice. 2006. Getting Started with the DATA Step Hash Object. http://support.sas.com/rnd/base/topics/datastep/dot/hash-getting-started.pdf (accessed June 22, 2006). ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. The author appreciates the help of Janice Bloom and Jason Secosky in reviewing this paper. CONTACT INFORMATION Christopher J. Bost Director, Research Technology Unit MDRC 16 East 34 th Street, 19 th Floor New York, NY 10016 (212) 340-8613 telephone (212) 684-0832 fax christopher.bost@mdrc.org www.mdrc.org 6