Different ways of calculating percentiles using SAS Arun Akkinapalli, ebay Inc, San Jose CA



Similar documents
Top Ten SAS DBMS Performance Boosters for 2009 Howard Plemmons, SAS Institute Inc., Cary, NC

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Managing Tables in Microsoft SQL Server using SAS

SAS PASSTHRU to Microsoft SQL Server using ODBC Nina L. Werner, Madison, WI

4 Other useful features on the course web page. 5 Accessing SAS

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Effective Use of SQL in SAS Programming

MWSUG Paper S111

# or ## - how to reference SQL server temporary tables? Xiaoqiang Wang, CHERP, Pittsburgh, PA

Paper FF-014. Tips for Moving to SAS Enterprise Guide on Unix Patricia Hettinger, Consultant, Oak Brook, IL

EXST SAS Lab Lab #4: Data input and dataset modifications

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

SQL Pass-Through and the ODBC Interface

Taming the PROC TRANSPOSE

Dongfeng Li. Autumn 2010

9.1 SAS. SQL Query Window. User s Guide

Using the Magical Keyword "INTO:" in PROC SQL

PharmaSUG Paper QT26

The Query Builder: The Swiss Army Knife of SAS Enterprise Guide

Parallel Data Preparation with the DS2 Programming Language

Data exploration with Microsoft Excel: univariate analysis

Salary. Cumulative Frequency

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC

From Database to your Desktop: How to almost completely automate reports in SAS, with the power of Proc SQL

Innovative Techniques and Tools to Detect Data Quality Problems

SAS Views The Best of Both Worlds

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

Crossing the environment chasm - better queries between SAS and DB2 Mark Sanderson, CIGNA Corporation, Bloomfield, CT

Applications Development

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

Improving Maintenance and Performance of SQL queries

Appendix III: SPSS Preliminary

THE POWER OF PROC FORMAT

Data Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type

Evaluating the results of a car crash study using Statistical Analysis System. Kennesaw State University

Instant Interactive SAS Log Window Analyzer

Importing Excel File using Microsoft Access in SAS Ajay Gupta, PPD Inc, Morrisville, NC

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Histogram of Numeric Data Distribution from the UNIVARIATE Procedure

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Data exploration with Microsoft Excel: analysing more than one variable

Encoding the Password

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

Simulate PRELOADFMT Option in PROC FREQ Ajay Gupta, PPD, Morrisville, NC

Oh No, a Zero Row: 5 Ways to Summarize Absolutely Nothing

Alternatives to Merging SAS Data Sets But Be Careful

Survey Analysis: Options for Missing Data

Categorical Variables

Using SAS Views and SQL Views Lynn Palmer, State of California, Richmond, CA

We begin by defining a few user-supplied parameters, to make the code transferable between various projects.

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

The HPSUMMARY Procedure: An Old Friend s Younger (and Brawnier) Cousin Anh P. Kellermann, Jeffrey D. Kromrey University of South Florida, Tampa, FL

Introduction to Criteria-based Deduplication of Records, continued SESUG 2012

Reshaping & Combining Tables Unit of analysis Combining. Assignment 4. Assignment 4 continued PHPM 672/677 2/21/2016. Kum 1

You have got SASMAIL!

AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C. Y. Associates

SUGI 29 Data Warehousing, Management and Quality

Creating Dynamic Reports Using Data Exchange to Excel

Foundation of Quantitative Data Analysis

Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC

Using the COMPUTE Block in PROC REPORT Jack Hamilton, Kaiser Foundation Health Plan, Oakland, California

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

Performance Test Suite Results for SAS 9.1 Foundation on the IBM zseries Mainframe

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

6 Steps to Faster Data Blending Using Your Data Warehouse

2 Describing, Exploring, and

PROC SUMMARY Options Beyond the Basics Susmita Pattnaik, PPD Inc, Morrisville, NC

Introduction; Descriptive & Univariate Statistics

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

Chapter 6 INTERVALS Statement. Chapter Table of Contents

Descriptive Statistics

Relational Database: Additional Operations on Relations; SQL

MEASURES OF LOCATION AND SPREAD

IBM Sterling Control Center

1 Files to download. 3 A macro to list out-of-range data values. 2 Reading in the example data file. 22S:172 Lab session 9 Macros for data cleaning

a presentation by Kirk Paul Lafler SAS Consultant, Author, and Trainer

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

Data management and SAS Programming Language EPID576D

A Recursive SAS Macro to Automate Importing Multiple Excel Worksheets into SAS Data Sets

An Approach to Creating Archives That Minimizes Storage Requirements

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Using Proc SQL and ODBC to Manage Data outside of SAS Jeff Magouirk, National Jewish Medical and Research Center, Denver, Colorado

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Make Better Decisions with Optimization

Integrating SAS with JMP to Build an Interactive Application

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

PharmaSUG AD08. Maximize the power of %SCAN using WORDSCAN utility Priya Saradha, Edison, NJ

Technical Paper. Defining an ODBC Library in SAS 9.2 Management Console Using Microsoft Windows NT Authentication

Finding National Best Bid and Best Offer

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

MODUL 8 MATEMATIK SPM ENRICHMENT TOPIC : STATISTICS TIME : 2 HOURS

SAS vs DB2 Functionality Who Does What Where?! Harry Droogendyk Stratia Consulting Inc.

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Combining SAS LIBNAME and VBA Macro to Import Excel file in an Intriguing, Efficient way Ajay Gupta, PPD Inc, Morrisville, NC

2. Filling Data Gaps, Data validation & Descriptive Statistics

Transcription:

Different ways of calculating percentiles using SAS Arun Akkinapalli, ebay Inc, San Jose CA ABSTRACT Calculating percentiles (quartiles) is a very common practice used for data analysis. This can be accomplished using different methods in SAS with some variation in the output. This paper compares the various methods with their run times which in turn will give good insights for a programmer to choose the suitable option for their scenario. INTRODUCTION Percentile or Quartile is the value that represents a percentage position in a range of different values. The 25 th percentile is referred to as first quartile, 50 th percentile is the median and 75 th percentile is the third quartile. Based on where the data resides, the programmer can choose a method of calculating percentile. Percentiles can be calculated using any one of the following procedures: 1) PROC UNIVARIATE 2) PROC MEANS 3) PROC SUMAMRY 4) PROC REPORT The programmer can also take advantage of SAS In database with Teradata to get the approximate percentile values. We will evaluate also PROC FREQ and also explore a method to divide data and deriving percentiles with minimal transfer of data to SAS. The dataset used here for comparison resides in Teradata with around 80 million records and 100 columns. The SAS code included as part of this paper is functional on SAS v 9.3 in UNIX environment. Please note that output and run times may differ with respect to system environments and settings. METHODS TO CALCULATE PERCENTILES Proc Univariate / Proc Stdize These procedures provide comprehensive solution for calculating percentiles. PCTLPTS option can be used to specify the percentile value user is looking for. PCTLDEF specifies the definition this procedure uses to calculate percentiles. The default is 5. Advantage of these procedures is the flexibility to calculate value at any level, which is not the case with most of the other procedures. They don t support SAS In database with Teradata and hence the primary limitation is to have data stored locally in SAS. The syntax of PROC STDIZE is quite similar to PROC UNIVARIATE. Below is the syntax for the time taken to transfer and calculate 99.9-percentile value of dataset with 80 million records and 100 columns with default method of 5. proc univariate data=test.otl_chk; WHERE metric_1 > 0; CLASS DIM1 DIM2; VAR metric_1; ; output out=cap_val pctlpts = 99.9 pctlpre = pcap

42 minutes 27 minutes 69 minutes Proc Means / Proc Summary PROC MEANS / PROC Summary also support calculating percentile values. Statistical keyword has to be specified to get the percentile values such as P1, P10, P25, P50, P90, P99 and so on as per the requirement. QNTLDEF defines the method used to calculate the percentiles. The default value is 5. Advantage of this method is its support to In-database and hence the user doesn t have to transfer data from Teradata to SAS. The limitation is its restriction in calculating the percentile values at low levels except integers. 99.9- percentile value cannot be achieved in one step in this procedure. The user will have to first calculate the 99 th percentile value, subset the data and then apply 90 th percentile, making it inefficient while handling big data with minimal transfer. Below is the syntax to calculate the 99.9 percentile value using PROC MEANS with its run times. Proc summary syntax is similar to this. Step1: Calculate the 99 th percentile value on the input dataset options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d sastraceloc=saslog nostsuffix; libname indb teradata USER = xxxxxxx PASSWORD = "xxxxxxx" database = TEST_PRCT_W tdpid = "xxxxx"; proc means data=indb.xl_h_0704 noprint; class DIM_1 DIM_2; var METRIC_1; output out=test.cap_val P99=P99; Step2: Filer the Initial dataset with data greater then 99 th percentile value obtained from test.cap_val dataset above and create a different table in Teradata CONNECT TO TERADATA AS TD (USER=xxxxxxx PASSWORD= "xxxxxxx" DATABASE = TEST_PRCT_W logdb = xxxxxx fastexport=yes TDPID="xxxxx" mode = teradata); execute (INSERT INTO TEST_PRCT_W.XL_H_0705 SELECT * FROM TEST_PRCT_W.XL_H_0704 WHERE METRIC_1 >= 1000) by td; *1000 is derived from sas dataset in the above step; QUIT; Step 3: Calculate the 90 th percentile value on the new dataset to obtain final 99.9 th percentile value. proc means data=indb.xl_h_0705 noprint; class DIM_1 DIM_2; var METRIC_1; output out=test.final_cap P90=P90; 2

Transfer (TD - SAS) Percentile Capping (Step1 Step3) Total 0 minutes 19 minutes 19 minutes Proc Freq PROC FREQ procedure might be a good alternative solution to the above while handling big data. Using cumulative frequency option, the user can get similar result in a much efficient way. The idea is to get a cumulative frequency distribution on the initial dataset and filter for the specific value from the output. The advantage is its support to In database, which doesn t require any data transfer from Teradata to SAS and get the value at lower levels without multiple passes to the dataset. On the flip side, the value obtained here may not be as accurate as above two methods and there may be some issues trying to process large data (~10-11 billion records) in one go. Below is the syntax to calculate the 99.9 percentile value with its run times. Step1: Get the cumulative frequency distribution of the input dataset options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d' sastraceloc=saslog nostsuffix; libname indb teradata USER = xxxxxxx PASSWORD = "xxxxxx" database = TEST_PRCT_W tdpid = "xxxxxxx"; proc freq data = indb.xl_h_0704 noprint; where METRIC_1 > 0; tables METRIC_1 /out = TEST.poc_freq_1 outcum nofreq; by DIM_1 DIM_2; Step2: Filter for the minimum value at cumulative frequency on 99.9 create table pert_freq as select DIM_1, DIM_2, min (METRIC_1) from TEST.poc_freq_1 where cum_pct >= 99.9 group by DIM_1 DIM_2; 0 minutes 2 minutes 2 minutes Bucketing and subset of Data for Large Datasets: This method uses PROC SQL, data step & Teradata to calculate the percentile values. This can be alternative for cases where above approach (PROC FREQ) is not efficient due to large volumes of data. Limitation of this method is its multiple passes. Steps are outlined as below Divide the data into 20 different buckets based on a static lookup that define the starting and ending value of the each bucket. The number of buckets may vary based on data skewness. The bucket id is already populated in the source dataset in this case. 3

Calculate the overall count of the dataset and count of each bucket id. By dividing bucket id count to overall count in sorted order and doing a cumulative sum, user can determine the bucket id that consists of the 99.9 percentile value. CONNECT TO TERADATA AS TD (USER = xxxxxxxx PASSWORD = "xxxxxx" TDPID="xxxxxx" mode = teradata); create table BCKT_CNT as select * from connection to td (select bckt, count (*) as bckt_cnt from TEST.XL_H_0706 GROUP BY 1 order by 1); create table TOTAL_CNT as select * from connection to td (select count(*) as cnt from TEST.XL_H_0706); proc SQL; create table ttl as select a.bckt,(a.bckt_cnt/b.cnt)*100 as bckt_shr from bckt_cnt a, total_cnt b; data csum; set ttl; by bckt; retain total; total = sum(total,bckt_shr); Once the bucket id is determined, initial dataset can be filtered for data with only the specific bucket id. Dataset csum contains the required bucket id. The value associated with the bucket id here is 4000. We can filter the initial dataset from source data as follows: CONNECT TO TERADATA AS TD (USER = xxxxxx PASSWORD = "xxxxxxx" TDPID="xxxx" mode = teradata); execute(insert into TEST.XL_H_0707 select * from TEST.XL_H_0706 where METRIC_1 > 4000) by td; Get the cumulative frequency distribution of the subset as specified in above approach (PROC FREQ). Calculate the modified cumulative frequency that applies to the whole dataset as shown below. options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d' sastraceloc=saslog nostsuffix; libname indb teradata USER = XXXXXX PASSWORD = "xxxxxxxx" database = TEST tdpid = "XXXXXXX"; proc freq data = indb.exl_mn_0707 noprint; tables METRIC_1 /out = poc_freq_2 outcum nofreq; by DIM_1 DIM_2; create table mn_2 as select METRIC_1, (cum_pct * (100-99.375996035))/100 as cum_freq_19th from poc_freq_2; Calculate the cumulative sum with the starting point (99.375996035) of the bucket as the base value and choose the 99.9 th percentile value from the dataset. 4

data prctl; set mn_2; total = 99.375996035; retain total; total = sum(total,cum_freq_19th); create table prctl_mn as select min (METRIC_1) as 99_9_prctl from prctl where total > 99.9; 0 minutes 2.5 minutes 2.5 minutes CONCLUSION Using PROC UNIVARIATE / STDIZE for smaller datasets (SAS) would be appropriate as it provides a comprehensive solution. As data volume increases and scenarios where data resides in a different database, choosing one of the In-database procedures will eliminate the data transfer and is an efficient way to calculate percentiles. PROC MEANS is good alternative for calculating percentiles with integer values (99,50,75,10,etc.). If the data volume is close to a billion records and to calculate percentiles at decimal level (99.9,75.8,0.01), PROC FREQ will serve as an effective method. For datasets greater than 1-2 billion, bucketing and subset may yield the results users are looking for. REFERENCES Patricia Guldin and Liping Zhang, 2009. Quartile Conundrum. Proceedings of the southeast SAS Users group (SESUG). PO-001. Available at http://analytics.ncsu.edu/sesug/2009/po001.guldin.pdf SAS Institute (2012). SAS 9.2 Documentation, SAS 9.2 Procedures GUIDE http://support.sas.com/documentation/cdl/en/procstat/63104/html/default/viewer.htm#procstat_univariate_s ect008.htm SAS Institute (2012). SAS 9.2 Documentation, SAS 9.2 Procedures GUIDE http://support.sas.com/documentation/cdl/en/proc/61895/html/default/viewer.htm#a000146729.htm ACKNOWLEDGEMENTS The author would like to thank ebay for allowing to use the necessary information CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Arunkumar Akkinapalli ebay Inc, 2525 North 1 st street, San Jose CA 95131 aakkinapalli@ebay.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5