Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA



Similar documents
Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Table Lookups: From IF-THEN to Key-Indexing

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

SAS Views The Best of Both Worlds

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Demystifying PROC SQL Join Algorithms Kirk Paul Lafler, Software Intelligence Corporation

UNIX Comes to the Rescue: A Comparison between UNIX SAS and PC SAS

You have got SASMAIL!

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Advanced Tutorials. Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

Top Ten SAS Performance Tuning Techniques

SAS Programming Tips, Tricks, and Techniques

Subsetting Observations from Large SAS Data Sets

Using the SQL Procedure

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Parallel Data Preparation with the DS2 Programming Language

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Published. Technical Bulletin: Use and Configuration of Quanterix Database Backup Scripts 1. PURPOSE 2. REFERENCES 3.

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA

SAS University Edition: Installation Guide for Linux

MAS 500 Intelligence Tips and Tricks Booklet Vol. 1

PharmaSUG Paper AD11

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Overview. NT Event Log. CHAPTER 8 Enhancements for SAS Users under Windows NT

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Alternative Methods for Sorting Large Files without leaving a Big Disk Space Footprint

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Managing Clinical Trials Data using SAS Software

Five Little Known, But Highly Valuable, PROC SQL Programming Techniques. a presentation by Kirk Paul Lafler

Producing Listings and Reports Using SAS and Crystal Reports Krishna (Balakrishna) Dandamudi, PharmaNet - SPS, Kennett Square, PA

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

THE POWER OF PROC FORMAT

ARIS Education Package Process Design & Analysis Installation Guide. Version 7.2. Installation Guide

Creating HTML Output with Output Delivery System

SAS Data Set Encryption Options

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

Importing Excel File using Microsoft Access in SAS Ajay Gupta, PPD Inc, Morrisville, NC

SAS University Edition: Installation Guide for Windows

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

# or ## - how to reference SQL server temporary tables? Xiaoqiang Wang, CHERP, Pittsburgh, PA

It s not the Yellow Brick Road but the SAS PC FILES SERVER will take you Down the LIBNAME PATH= to Using the 64-Bit Excel Workbooks.

Be a More Productive Cross-Platform SAS Programmer Using Enterprise Guide

PharmaSUG Paper QT26

Managing Tables in Microsoft SQL Server using SAS

SAS Client-Server Development: Through Thick and Thin and Version 8

Improving Maintenance and Performance of SQL queries

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

A Method for Cleaning Clinical Trial Analysis Data Sets

An Oracle White Paper December Advanced Network Compression

Choosing the Best Method to Create an Excel Report Romain Miralles, Clinovo, Sunnyvale, CA

Normalizing SAS Datasets Using User Define Formats

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager

Introduction to Criteria-based Deduplication of Records, continued SESUG 2012

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

SAS 9.3 Foundation for Microsoft Windows

Normalized EditChecks Automated Tracking (N.E.A.T.) A SAS solution to improve clinical data cleaning

SEO - Access Logs After Excel Fails...

Cleaning Up Your Outlook Mailbox and Keeping It That Way ;-) Mailbox Cleanup. Quicklinks >>

Paper FF-014. Tips for Moving to SAS Enterprise Guide on Unix Patricia Hettinger, Consultant, Oak Brook, IL

Downloading, Configuring, and Using the Free SAS University Edition Software

SAS Office Analytics: An Application In Practice

EnterpriseLink Benefits

Storing and Using a List of Values in a Macro Variable

OS/390 SAS/MXG Computer Performance Reports in HTML Format

ABSTRACT THE ISSUE AT HAND THE RECIPE FOR BUILDING THE SYSTEM THE TEAM REQUIREMENTS. Paper DM

A Performance Analysis of Distributed Indexing using Terrier

Simply Accounting Intelligence Tips and Tricks Booklet Vol. 1

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform

Need for Speed in Large Datasets The Trio of SAS INDICES, PROC SQL and WHERE CLAUSE is the Answer, continued

Make it SASsy: Using SAS to Generate Personalized, Stylized, and Automated Lisa Walter, Cardinal Health, Dublin, OH

Flat Pack Data: Converting and ZIPping SAS Data for Delivery

SAS ODS HTML + PROC Report = Fantastic Output Girish K. Narayandas, OptumInsight, Eden Prairie, MN

2015 Workshops for Professors

Catalog Creator by On-site Custom Software

Dup, Dedup, DUPOUT - New in PROC SORT Heidi Markovitz, Federal Reserve Board of Governors, Washington, DC

Web Service for Observer. Installation Manual. Part No Revision A

Intelligent Query and Reporting against DB2. Jens Dahl Mikkelsen SAS Institute A/S

V16 Pro - What s New?

Transcription:

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA ABSTRACT When we work on millions of records, with hundreds of variables, it is crucial how we are processing our data. To make SAS really ROCK, we need to pay more attention to SAS program efficiency, since a single data step or some SQL query may take a few hours in dealing with such large datasets. In this paper, we present a few practical efficient techniques and hands-on tips in handling large datasets, including the application of INDEX, separating one single step into multi-step to improve efficiency, the classic Where vs. If statement, some tips in joining large datasets in PROC SQL etc. To see the efficiency of those techniques, we also provide for each case with experimental example output, how much for the time-resource consuming, "apple-to-apple" comparison between the processes with and without those techniques. With those tips in our large data practice, we can save a lot of space and time, SAS ROCKS! Keywords: data analysis, large data manipulation, efficient techniques tips, create index, data mining INTRODUCTION Efficiency in SAS programming, has been traditionally defined as the optimization of space (computer resources etc.) and time (cpu process time, data I/O time, programmer time etc). It has been more and more crucial since large datasets are all over the place nowadays. When we are sitting in front of a big dataset, with millions of records, hundreds of variables, how do we play with it? Every single step may take a few hours to complete if we don t deal with it carefully. In particular, during the data preparation, or model testing stage, we are torturing ourselves if a single testing process takes hours and we have to go back and forth testing several times. In this paper, we provide a few efficient techniques to help handle those situations carefully, making our SAS program the most efficient. WORK ON ONLY WHAT YOU NEED Example: we want to sort a large dataset with 10 million of records, there are altogether 20 variables (in fact we just need 5 variables). Select only those 5 variables needed proc sort data=data_in(keep=var1-var5); by var1-var4; Real Time: 53.59 seconds CPU Time: 40.09 seconds Include some unnecessary variables proc sort data=data_in(keep=var1-var10); by var1-var4; Real Time: 8:59.72 minutes CPU Time: 1:54.64 minutes Furthermore, if we include all those 20 variables to sort, there is still no response after waiting 25 minutes. We can see the processing time is not linearly proportional to the number of variables. With more variables included in sort procedure, it takes multiple more processing time. MULTI-STEP V.S. SINGLE STEP Example: we want to sort a bigger dataset with 50 million records, with all 10 variables (var1-var10) needed. In our sort procedure, we need to sort with nodupkey on var1-var3. In a single sort step, the code is relatively simpler than the multi-step, in which we need to split the big dataset into two smaller parts first, sort with nodupkey separately, then combine together and sort with nudupkey again. 1

Method I: Single-Step sort with nodupkey Single-Step Sort proc sort data=data_in nodupkey; Time-Consuming Real Time: 73:31.08 minutes CPU Time: 8:44.89 minutes Method II: Multi-Step sort with nodupkey Multi-Step Sort proc sort data=data_in(firstobs=1 obs=25000000) out=out1 nodupkey; proc sort data=data_in(firstobs=25000001) out=out2 nodupkey; proc append base=out1 data=out2 force; proc sort data=out1 out=data_out nodupkey; Total Time Consuming: Time-Consuming Real Time: 18:20.50 minutes CPU Time: 3:18.21 minutes Real Time: 20:02.22 minutes CPU Time: 3:51.10 minutes Real Time: 37.79 seconds CPU Time: 11.18 seconds Real Time: 1:17.91 minutes CPU Time: 55.95 seconds Real Time: < 41 minutes CPU Time: < 9 minutes We can easily see the big time difference between the multi-step and single-step, instead of waiting 74 minutes in single-step, we can finish the same work within 41 minutes in multi-step. What a difference! We believe an efficient programmer should not be stingy on SAS codes, the extra coding work can be easily traded off by saving us a lot of time. INDEX & WHERE > WHERE > IF Example: we are still using the same previous dataset with 50 million records, 10 variables, and we want to find a subset satisfying certain condition (var1= key ). IF statement data data_out1; set data_in; if var1= key ; Real Time: 63.40 seconds CPU Time: 63.21 seconds Where statement data data_out2; set data_in (where=(var1= key )); Real Time: 14.42 seconds CPU Time: 14.34 seconds We can see the significant time reducing in the where-statement. INDEX is usually applied in optimization with where-statement, or by-statement. To see why the where-statement is faster than the if-statement: by where-statement in data step, if the condition (var1= key ) is not satisfied, the record will not read into Program Data Vector (PDV), therefore it saves us a lot of unnecessary reading time. We can create INDEX by using the simple code as the following: proc sql; create index keyvar1 on data_in; 2

quit; To check whether the INDEX has played a role in optimization, we can use the following option to check the log output: options msglevel=i; To understand when to use index, the rule of thumb is, the subset data should only be a small portion of the whole dataset, as long as the subset data is less than 20% of the whole dataset, it will improve the performance. WORK ON SMALL SAMPLE FIRST TO TEST THE WHOLE PROCESS After we have tested each process (either a data step, a procedure or some SQL query), and there are several of those in the whole process, we want to test our program as a whole, for instance, for the following process flow: 1. data one; set two(where=(var1= key )); 2. proc sql 3. proc glmselect To test the whole process, we can simply choose a smaller subset for the testing purpose. The SAS procedure SURVEYSELECT can help us get a well-distributed sample subset. proc surveyselect data=data_in seed=1234567890 method=srs n=100 out=data_out; To make life easier, we can just use obs options in the data step: data one; set two (obs=100 where=(var1= key )); SAVING SPACES FOR LARGE DATASETS There are quite a few SUGI papers with detailed investigations for how we save data storage spaces. Since we don t want to get into too much theory, we provide a few more practical techniques to save spaces in particular for large datasets. options compress=yes; This trivial option set-up can save us a lot of spaces. Especially for temporary files in work folder, a 5 GB compressed dataset can easily take more than 20 GB to store, a few of those uncompressed files can easily take altogether more than 100 GB of our disk spaces. Some people may argue for the disadvantage of more cpu processing time by using this compress option. The key is to consider which one is more worthwhile: one side gives us a little bit more processing time: from 30 seconds to 1 minute; the other side gives us the risk of running out of temporary folder spaces. In fact, we can even set up the compress option in the SAS configuration file: sasv9.cfg. The configuration file is usually located at :\Program Files\SAS\SASFoundation\9.2\nls\en\SASV9.CFG, we can open that file in a plain text editor like Notepad, and add a line of code: -COMPRESS=YES Then all the files generated in the work folder(and also any permanent file) will be in compressed format. This way we can easily control the temporary work folder spaces, especially when there are several SAS programmers working on the same server, we don t need to check them one by one. Change numerical variables to character variables. This option is useful if we have a large dataset with large numerical values in some column. Due to the binary (0/1) expression for the numerical numbers, even the shortest length number takes 2 or 3 bytes; meanwhile, a single character takes only 1 byte of storage space. The other useful situation for this conversion is when we have a huge numerical value, say over 32 integer digits, there will usually be some rounding issue for SAS to handle those huge values (SAS has a limitation for number of integer digits to display). After switching to character variables, you can easily get rid of this concern. 3

SEND EMAIL NOTICE WHEN IT S DONE The last but not the least, in some cases that we have to wait some time for our program to finish running, either during the initial data preparation, model testing, or final product running, we can simply put the following SAS code at the end of the program, asking SAS to send us an email whenever it s done. Checking email is much easier than logging into SAS server to check if the program finish or not, especially with help from the mobile technology, checking email on various mobile phones is much more convenient than before. filename mymail email "your company email address" cc="your gmail address" subject="sas Task finished"; attach= directory\any file'; data _null_; file mymail; Note: you may need to set up your email account appropriately if it s on remote SAS server. If interested, go to recommended readings for more details. CONCLUSION In this paper, we provide a few practical techniques in dealing with large data analysis. For each technique mentioned, it might seem trivial to many SAS programmers; but combining them altogether, they can be very powerful! Keeping those techniques in mind, we will enjoy more fun, less painful experience, in our long SAS journey (there is no shortcut to be a SAS expert!) ACKNOWLEDGMENTS We would like to thank our colleagues Alice Xie, Bisser Roussanov, Richard Umstaetter, Roger Yeh and Shelly Teh et al. for the various help in our daily SAS large data practice. Also we would like to thank our Vice-President Shawn Hushman for the trust and various SAS training support. REFERENCE 1. The Use and Abuse of the Program Data Vector, Jim Johnson, Proceedings of the 2003 Conference of the Pharmaceutical Industry SAS Users Group, Cary, NC: SAS Institute, Inc., 601-610. 2. THE BASICS OF USING SAS INDEXES, MICHAEL A. RAITHEL, SAS Users Group International (SUGI), Proceedings 30, Tutorials. 3. KIRK S KORNER, Quick & Simple tips, Kirk Paul Lafler, Software Intelligence Corporation. 4. A FASTER INDEX FOR SORTED SAS DATASETS, Mark Keintz, SAS Global Forum 2009, Applications Development. 5. USING SAS INDEXES WITH LARGE DATABASES, Alex Vinokurov, Lawrence Helbers, NESUG 15, Beginning Tutorials. 6. KEEPING YOUR DATA IN STEP - UTILIZING EFFICIENCIES, Michael G. Sadof, SUGI 24, Advanced Tutorials. 7. ARE YOUR SAS PROGRAMS RUNNING YOU? Marje Fecht, Larry Stewart, Proceedings of the 2008 SAS Global Forum, Paper 164-2008 RECOMMENDED READING SAS Tips I learnt while at Oxford, Philip Mason, SUGI 26, Advanced Tutorials. You ve Got Mail E-mailing Messages and Output Using SAS EMAIL Engine, Jeanina Worden, Philip Jones, SUGI 29, Posters, to cover the syntax of the FILENAME and FILE statements to automatically send custom e- mails and files, using the filename email access method. Tutorial to learn more details how to send email via SAS step by step: http://www.dumblittledoctor.com/sas_tutorial_home.php, How to send an email in SAS part-i and part-ii. SAS Coding Tips and Techniques, http://www.sconsig.com/sastip.htm 4

CONTACT INFORMATION Your comments and questions are very valued and encouraged. Please contact our author at: Name: Dr. Shilong Kuang Enterprise: Kelley Blue Book, Inc. Address: 195 Technology Drive City, State ZIP: Irvine, CA,92618 E-mail: shilong.kuang@gmail.com Web: SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5