Different ways of calculating percentiles using SAS Arun Akkinapalli, ebay Inc, San Jose CA

Different ways of calculating percentiles using SAS Arun Akkinapalli, ebay Inc, San Jose CA ABSTRACT Calculating percentiles (quartiles) is a very common practice used for data analysis. This can be accomplished using different methods in SAS with some variation in the output. This paper compares the various methods with their run times which in turn will give good insights for a programmer to choose the suitable option for their scenario. INTRODUCTION Percentile or Quartile is the value that represents a percentage position in a range of different values. The 25 th percentile is referred to as first quartile, 50 th percentile is the median and 75 th percentile is the third quartile. Based on where the data resides, the programmer can choose a method of calculating percentile. Percentiles can be calculated using any one of the following procedures: 1) PROC UNIVARIATE 2) PROC MEANS 3) PROC SUMAMRY 4) PROC REPORT The programmer can also take advantage of SAS In database with Teradata to get the approximate percentile values. We will evaluate also PROC FREQ and also explore a method to divide data and deriving percentiles with minimal transfer of data to SAS. The dataset used here for comparison resides in Teradata with around 80 million records and 100 columns. The SAS code included as part of this paper is functional on SAS v 9.3 in UNIX environment. Please note that output and run times may differ with respect to system environments and settings. METHODS TO CALCULATE PERCENTILES Proc Univariate / Proc Stdize These procedures provide comprehensive solution for calculating percentiles. PCTLPTS option can be used to specify the percentile value user is looking for. PCTLDEF specifies the definition this procedure uses to calculate percentiles. The default is 5. Advantage of these procedures is the flexibility to calculate value at any level, which is not the case with most of the other procedures. They don t support SAS In database with Teradata and hence the primary limitation is to have data stored locally in SAS. The syntax of PROC STDIZE is quite similar to PROC UNIVARIATE. Below is the syntax for the time taken to transfer and calculate 99.9-percentile value of dataset with 80 million records and 100 columns with default method of 5. proc univariate data=test.otl_chk; WHERE metric_1 > 0; CLASS DIM1 DIM2; VAR metric_1; ; output out=cap_val pctlpts = 99.9 pctlpre = pcap

42 minutes 27 minutes 69 minutes Proc Means / Proc Summary PROC MEANS / PROC Summary also support calculating percentile values. Statistical keyword has to be specified to get the percentile values such as P1, P10, P25, P50, P90, P99 and so on as per the requirement. QNTLDEF defines the method used to calculate the percentiles. The default value is 5. Advantage of this method is its support to In-database and hence the user doesn t have to transfer data from Teradata to SAS. The limitation is its restriction in calculating the percentile values at low levels except integers. 99.9- percentile value cannot be achieved in one step in this procedure. The user will have to first calculate the 99 th percentile value, subset the data and then apply 90 th percentile, making it inefficient while handling big data with minimal transfer. Below is the syntax to calculate the 99.9 percentile value using PROC MEANS with its run times. Proc summary syntax is similar to this. Step1: Calculate the 99 th percentile value on the input dataset options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d sastraceloc=saslog nostsuffix; libname indb teradata USER = xxxxxxx PASSWORD = "xxxxxxx" database = TEST_PRCT_W tdpid = "xxxxx"; proc means data=indb.xl_h_0704 noprint; class DIM_1 DIM_2; var METRIC_1; output out=test.cap_val P99=P99; Step2: Filer the Initial dataset with data greater then 99 th percentile value obtained from test.cap_val dataset above and create a different table in Teradata CONNECT TO TERADATA AS TD (USER=xxxxxxx PASSWORD= "xxxxxxx" DATABASE = TEST_PRCT_W logdb = xxxxxx fastexport=yes TDPID="xxxxx" mode = teradata); execute (INSERT INTO TEST_PRCT_W.XL_H_0705 SELECT * FROM TEST_PRCT_W.XL_H_0704 WHERE METRIC_1 >= 1000) by td; *1000 is derived from sas dataset in the above step; QUIT; Step 3: Calculate the 90 th percentile value on the new dataset to obtain final 99.9 th percentile value. proc means data=indb.xl_h_0705 noprint; class DIM_1 DIM_2; var METRIC_1; output out=test.final_cap P90=P90; 2

Transfer (TD - SAS) Percentile Capping (Step1 Step3) Total 0 minutes 19 minutes 19 minutes Proc Freq PROC FREQ procedure might be a good alternative solution to the above while handling big data. Using cumulative frequency option, the user can get similar result in a much efficient way. The idea is to get a cumulative frequency distribution on the initial dataset and filter for the specific value from the output. The advantage is its support to In database, which doesn t require any data transfer from Teradata to SAS and get the value at lower levels without multiple passes to the dataset. On the flip side, the value obtained here may not be as accurate as above two methods and there may be some issues trying to process large data (~10-11 billion records) in one go. Below is the syntax to calculate the 99.9 percentile value with its run times. Step1: Get the cumulative frequency distribution of the input dataset options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d' sastraceloc=saslog nostsuffix; libname indb teradata USER = xxxxxxx PASSWORD = "xxxxxx" database = TEST_PRCT_W tdpid = "xxxxxxx"; proc freq data = indb.xl_h_0704 noprint; where METRIC_1 > 0; tables METRIC_1 /out = TEST.poc_freq_1 outcum nofreq; by DIM_1 DIM_2; Step2: Filter for the minimum value at cumulative frequency on 99.9 create table pert_freq as select DIM_1, DIM_2, min (METRIC_1) from TEST.poc_freq_1 where cum_pct >= 99.9 group by DIM_1 DIM_2; 0 minutes 2 minutes 2 minutes Bucketing and subset of Data for Large Datasets: This method uses PROC SQL, data step & Teradata to calculate the percentile values. This can be alternative for cases where above approach (PROC FREQ) is not efficient due to large volumes of data. Limitation of this method is its multiple passes. Steps are outlined as below Divide the data into 20 different buckets based on a static lookup that define the starting and ending value of the each bucket. The number of buckets may vary based on data skewness. The bucket id is already populated in the source dataset in this case. 3

Calculate the overall count of the dataset and count of each bucket id. By dividing bucket id count to overall count in sorted order and doing a cumulative sum, user can determine the bucket id that consists of the 99.9 percentile value. CONNECT TO TERADATA AS TD (USER = xxxxxxxx PASSWORD = "xxxxxx" TDPID="xxxxxx" mode = teradata); create table BCKT_CNT as select * from connection to td (select bckt, count (*) as bckt_cnt from TEST.XL_H_0706 GROUP BY 1 order by 1); create table TOTAL_CNT as select * from connection to td (select count(*) as cnt from TEST.XL_H_0706); proc SQL; create table ttl as select a.bckt,(a.bckt_cnt/b.cnt)*100 as bckt_shr from bckt_cnt a, total_cnt b; data csum; set ttl; by bckt; retain total; total = sum(total,bckt_shr); Once the bucket id is determined, initial dataset can be filtered for data with only the specific bucket id. Dataset csum contains the required bucket id. The value associated with the bucket id here is 4000. We can filter the initial dataset from source data as follows: CONNECT TO TERADATA AS TD (USER = xxxxxx PASSWORD = "xxxxxxx" TDPID="xxxx" mode = teradata); execute(insert into TEST.XL_H_0707 select * from TEST.XL_H_0706 where METRIC_1 > 4000) by td; Get the cumulative frequency distribution of the subset as specified in above approach (PROC FREQ). Calculate the modified cumulative frequency that applies to the whole dataset as shown below. options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d' sastraceloc=saslog nostsuffix; libname indb teradata USER = XXXXXX PASSWORD = "xxxxxxxx" database = TEST tdpid = "XXXXXXX"; proc freq data = indb.exl_mn_0707 noprint; tables METRIC_1 /out = poc_freq_2 outcum nofreq; by DIM_1 DIM_2; create table mn_2 as select METRIC_1, (cum_pct * (100-99.375996035))/100 as cum_freq_19th from poc_freq_2; Calculate the cumulative sum with the starting point (99.375996035) of the bucket as the base value and choose the 99.9 th percentile value from the dataset. 4

data prctl; set mn_2; total = 99.375996035; retain total; total = sum(total,cum_freq_19th); create table prctl_mn as select min (METRIC_1) as 99_9_prctl from prctl where total > 99.9; 0 minutes 2.5 minutes 2.5 minutes CONCLUSION Using PROC UNIVARIATE / STDIZE for smaller datasets (SAS) would be appropriate as it provides a comprehensive solution. As data volume increases and scenarios where data resides in a different database, choosing one of the In-database procedures will eliminate the data transfer and is an efficient way to calculate percentiles. PROC MEANS is good alternative for calculating percentiles with integer values (99,50,75,10,etc.). If the data volume is close to a billion records and to calculate percentiles at decimal level (99.9,75.8,0.01), PROC FREQ will serve as an effective method. For datasets greater than 1-2 billion, bucketing and subset may yield the results users are looking for. REFERENCES Patricia Guldin and Liping Zhang, 2009. Quartile Conundrum. Proceedings of the southeast SAS Users group (SESUG). PO-001. Available at http://analytics.ncsu.edu/sesug/2009/po001.guldin.pdf SAS Institute (2012). SAS 9.2 Documentation, SAS 9.2 Procedures GUIDE http://support.sas.com/documentation/cdl/en/procstat/63104/html/default/viewer.htm#procstat_univariate_s ect008.htm SAS Institute (2012). SAS 9.2 Documentation, SAS 9.2 Procedures GUIDE http://support.sas.com/documentation/cdl/en/proc/61895/html/default/viewer.htm#a000146729.htm ACKNOWLEDGEMENTS The author would like to thank ebay for allowing to use the necessary information CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Arunkumar Akkinapalli ebay Inc, 2525 North 1 st street, San Jose CA 95131 aakkinapalli@ebay.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5