A PROC SQL Primer Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC ABSTRACT Most SAS programmers utilize the power of the DATA step to manipulate their datasets. However, unless they pull from a relational database, there is a good chance they have not explored an alternate method of data manipulation, namely PROC SQL. Once learned, PROC SQL can be more intuitive in some cases than DATA step solutions and also have unique properties that allow certain functionality that is very difficult to obtain through a DATA step. This paper will review the basics of PROC SQL including creating and manipulating datasets, combining tables, summarizing data and tips to help the beginner code efficiently. INTRODUCTION SQL stands for Structured Query Language and is the building block of most relational database structures. PROC SQL is a unique procedure that allows you to utilize SQL within a SAS program. This helps when trying to pull datasets down from a relational database. It also adds additional programming flexibility as many features of SQL mirror those of DATA step functionality. Learning SQL can be a good point of entry for analysts coming from other programming languages and can also enhance the toolbox of existing SAS programmers. BASIC SYNTAX PROC SQL has a different syntax then virtually every other SAS procedure. It uses clauses which have a semicolon at the end. It also uses the term QUIT to terminate an operation instead of the RUN statement that a DATA step would use. The simplest form of a SQL query has three parts: A SELECT statement, a FROM statement and a WHERE statement. Here is what they signify: SELECT This statement chooses which variables you wish to have in your results or resulting dataset. FROM This statement identifies which table or dataset from which you want to select the data. WHERE This statement allows you to subset the results by a conditional set of variables. Now, we will look at an example of the basic PROC SQL syntax. Our source data for this exercise will look like this: State City Popflag Balance Acct_no VA Blacksburg Pop1 5234 1432534 VA Richmond Pop1 2353 1432535 VA Norfolk Pop1 9427 1432536 VA Chantilly Pop1 5423 1432537 VA Roanoke Pop1 6342 1432538 NC Charlotte Pop1 9769 1432539 NC Raleigh Pop1 2313 1432540 NC Asheville Pop1 3745 1432541 NC Burlington Pop1 1231 1432542 NC Mebane Pop1 8568 1432543 Our base data has some geographical information, some sort of identifying flag, balance information and an account number. Let s show the syntax of the most basic of PROC SQL queries: 1
select acct_no, balance, state ; The sample query is pulling from our dataset, called SAMPLE. It is selecting three fields and subsets only those accounts in the state of Virginia. The results appear in our OUTPUT window and look like this: Acct_no Balance State 1432534 5234 VA 1432535 2353 VA 1432536 9427 VA 1432537 5423 VA 1432538 6342 VA 1432560 4236 VA 1432561 5234 VA 1432562 2353 VA 1432563 9427 VA 1432564 5423 VA 1432596 8568 VA You can see that our results list in the output window the account number, balance and state which appeared in our SELECT statement. The variables will appear in the order specified in the SELECT statement. This can be helpful if you want to arrange columns in a certain order. The results are only for the state of Virginia due to our WHERE statement. Next, we will look at putting our results into a dataset. PROC SQL has its own set of conditional statements which allow you to subset your data. Here is a list of some of the more frequent items: Note that most conditional statements can be reversed using the keyword NOT, such as NOT IN, IS NOT NULL, IS NOT MISSING, etc. CREATING AND SORTING TABLES A table, in PROC SQL language, is the same as a SAS dataset. Most of the time, you will want to put your results into a SAS dataset for further use. Let s look at the statement which will help achieve this. CREATE TABLE This statement allows you to create a SAS dataset of your results using PROC SQL. We will recreate the above example query, this time creating a dataset of our results. Here is how that would look: 2
select acct_no, balance, state ; This query creates a table called T1 with the results of our above PROC SQL query. Note the syntax for this CRE- ATE TABLE dataset AS which is different from any other procedure in SAS. The resulting dataset T1 will be shown as created in your log and will have results that mirror our previous query. Many times you will want to sort your PROC SQL table results, much like using a PROC SORT in the SAS language. The good thing about the PROC SQL is you can sort within the same query, without having to add an extra PROC SORT afterwards. This is achieved using the ORDER BY statement ORDER BY Lists a PROC SQL table in order specified, sorting your dataset or output. It performs a similar function to a PROC SORT. Let s look at an example using the ORDER BY statement: select acct_no, balance, state order by balance; The ORDER BY is placed last in a query and appears after any other clauses. The results in the output are being sorted ascending by balance. Here are the results: Obs Acct_no Balance State 1 1432689 1231 VA 2 1432649 1231 VA 3 1432702 1231 VA 4 1432729 1231 VA 5 1432687 2313 VA 6 1432647 2313 VA 7 1432633 2313 VA 8 1432700 2313 VA 9 1432727 2313 VA 10 1432668 2353 VA If you wanted your results to be ordered descending, you can use that keyword similar to the PROC SORT. Note that the keyword appears after the variable in PROC SQL as opposed to the SORT where it appears before: select acct_no, balance, state order by balance descending; 3
SUMMARY FUNCTIONS PROC SQL is able to summarize your base datasets with the use of a variety of summary functions. This can often replicate the results of a PROC FREQ which is a staple of the traditional SAS programmer but having the additional functions add a twist to the results. There are two main components to summarizing using PROC SQL. You need to use a summary function(s) and you need to include the GROUP BY statement to complete the procedure. GROUP BY This statement rolls up a SQL query by the listed variables. It is needed to summarize data. Let s take a look at an example of a summary query: create table t2 as select state, count(acct_no) as accountcount, sum(balance) as sumbalance group by state; In this query, we are rolling up our data by the variable state, which is why it appears in our GROUP BY statement. You can list more than one field in the GROUP BY statement which will result in the finished dataset being at the level of the combination of the metrics listed. In this example, we used two summary functions, count and sum. The COUNT function will give you a record count for each example of the GROUP BY variables. The SUM function will summarize the variable listed which also means it needs to be a numeric field. Here is a list of some commonly used SQL summary functions: There are a few more functions than this list, consult your SAS documentation for the complete list. The results from our sample query look like this: Obs State accountcount sumbalance 1 DC 11 42890 2 DE 17 70082 3 FL 20 108729 4 GA 20 140488 5 NC 49 276666 6 PA 2 14022 7 SC 29 174479 8 VA 52 269084 The new variables ACCOUNTCOUNT and SUMBALANCE are created by the SELECT statement of the query. You can see the results are a summarized table by the variable state. There is a shortcut for the GROUP BY statement that will allow you to indicate fields by position instead of listing them. This can be handy if you are using numerous fields in your statement. Here is an example: 4
create table t2 as select state, count(acct_no) as accountcount, sum(balance) as sumbalance group by 1; The number one in the GROUP BY statement above refers to the first position in the SELECT statement, in this case being state. The above query is interchangeable with the first example in this section. SHORTCUTS There are a variety of techniques you can use to make your life easier programming with PROC SQL. One of the most common is using table aliases in your query. An alias is a letter or phrase used to reference a table without having to spell out the entire name. This is especially helpful when joining tables, which we will learn later, because it helps in keeping different fields and tables straight. The alias works by creating a quick reference for tables, allowing quicker programming for complicated queries. Let s take a look at an example: select a.acct_no, a.balance, a.state a where a.state='va' order by a.balance descending; The alias in this example is a which appears in the FROM statement after the word sample. This tells SAS that anything with the reference a comes from the table sample. In order to specify which table each field comes from, we indicate this with a. in front of every field. We will build on this later in the paper to better see the benefits of this process. The next shortcut will allow you to dedup your PROC SQL table much like you would with a PROC SORT. This is done using the DISTINCT keyword. This keyword when applied to a field will only return the unique value of the field in question. This allows you to limit your results to only unique values. Here is an example: select distinct a.state a; The keyword DISTINCT is added in front of the field state. In our results, instead of having multiple values of the variable state, we end up with a unique list. Obs State 1 DC 2 DE 3 FL 4 GA 5 NC 6 PA 7 SC 8 VA 5
One of the problems you can often run across with the PROC SQL is that in order to choose a field you need to list it in the SELECT statement. This can become quite cumbersome if your dataset has a large number of fields and you want to select them all. There is a shortcut for select all the fields in a dataset, called select *. This is a work around to select all fields in your dataset by default. Here is an example: select * order by balance descending; You can see the * replacing the fields list in the SELECT statement. This will choose all variables available in the dataset sample. Alternatively, if you were using aliases for your data you could also use the SELECT * methodology: select a.* a order by balance descending; This technique is often referred to as a.*. The final shortcut we will discuss is limiting observations in your PROC SQL query. This is very useful if you want a sample of a table you are going to use and believe that the size is too large to grab it all. This works similar to the OPTIONS OBS= logic used in SAS, but this applies only to your SQL query and not to all procedures in your active program. There are two statements that achieve these results, the INOBS and OUTOBS statements. The difference is the INOBS statement limits the data before it even is processed by the query, where the OUTOBS limits the data as the query is completed. Let s look at the OUTOBS in an example: proc sql outobs=10; select a.* a order by balance descending; Note that you will reference the OUTOBS or INOBS syntax in the PROC SQL line of your query. One important note is the the OUTOBS will issue a warning in your log saying: WARNING: Statement terminated early due to OUTOBS=10 option. This would fall under the category of a harmless warning but you should be aware it will occur. The results from our sample query above look like this: 6
Obs State City Popflag Balance Acct_no 1 VA Norfolk Pop4 9786 1432598 2 VA Richmond Pop3 9786 1432718 3 VA Richmond Pop2 9769 1432646 4 VA Chantilly Pop1 9769 1432632 5 VA Richmond Pop2 9769 1432686 6 VA Norfolk Pop1 9427 1432536 7 VA Norfolk Pop2 9427 1432669 8 VA Chantilly Pop3 9427 1432563 9 VA Blacksburg Pop1 9427 1432629 10 VA Blacksburg Pop4 8568 1432596 CONDITIONAL PROGRAMMING Often times you will be looking to group fields into your own classifications. This is a similar functionality to the IF- THEN-ELSE syntax from a SAS DATA step. In order to do this in PROC SQL, you first need to learn how to create your own columns. In order to create your own variable, you just list the variable you want to change, the keyword AS and list the new variable name. Here is an example: create table t2 as select acct_no, balance*.6 as newbalance ; The results of the above query will have two variables, account number and the created field newbalance. Now that we understand how to create a new variable, we ll review the syntax of conditional programming with PROC SQL. The proper syntax to create a conditional variable is called CASE-WHEN-THEN-ELSE. It works very similarly to the IF-THEN-ELSE statement. Let s look at an example: create table t3 as select acct_no, case when balance < 3000 then 'Low' when 3000 >= balance < 6000 then 'Medium' else 'High' end as flag ; The syntax of the CASE statement differs some from a normal IF-THEN statement although it works a similar way. The statement begins with CASE WHEN followed by conditional statements. The statement finished with an END statement followed by the AS and the name of the new variable. Here are the results from that query: Obs Acct_no flag 1 1432534 High 2 1432535 Low 3 1432536 High 4 1432537 High 5 1432538 High 6 1432539 High 7 1432540 Low 8 1432541 High 9 1432542 Low 10 1432543 High 7
The resulting dataset contains the account number and the new variable flag which was created by the CASE statement. JOINING TABLES The biggest key to PROC SQL in a relational database sense is the ability to join tables. The table join is the way to link tables which replicates the similar functionality of a merge in the DATA step. In order to join properly, the first concept that needs to be discussed is the Cartesian product. The Cartesian product happens when you put two tables together without any joins. Essentially, the PROC SQL will combine each record to each other, producing an extremely large dataset when dealing with large data to begin with. In most cases, the Cartesian product is not what you desire for your results. This is why it is most important to learn to join properly. There are different types of joins which we will learn, the inner, the directional outer (right, left) and the full join. INNER JOIN The INNER join is a join that returns those records that match both tables in question. A diagram of the INNER join looks like this: During this joining section, we will add a small dataset, called SAMPLE2, to illustrate the various techniques. Our extra dataset will look like this: Obs Acct_no product 1 1432534 product1 2 1432535 product2 3 1432536 product3 4 1432539 product1 5 1432542 product1 This type of table is common in a relational database, it houses the account number which is the key, and a product flag. This means if you want to add the product flag, we will need to join to our second table. The first join we will attempt is the INNER join. The INNER join returns only the matching records from both tables. Let s look at an example: create table t4 as select a.*, b.product 1 a, sample2 b where a.acct_no=b.acct_no; The first key feature of joining is you need to use your table aliases. Without the aliases, the PROC SQL will get confused with common fields like account number. In our example above, it is clear that we want account number to come from table sample1 because we specify using aliases. The INNER join itself occurs in the WHERE statement of the PROC SQL. The results look like this: 8
Obs State City Popflag Balance Acct_no product 1 VA Blacksburg Pop1 5234 1432534 product1 2 VA Richmond Pop1 2353 1432535 product2 3 VA Norfolk Pop1 9427 1432536 product3 4 NC Charlotte Pop1 9769 1432539 product1 5 NC Burlington Pop1 1231 1432542 product1 The results show that the join only returns matching records and adds the field product to the finished table. The next join we will look at is a directional join. In this paper, we ll look at the LEFT OUTER join, which will return these results: The LEFT OUTER join returns all the information from the first table listed and the matching information from the second table. Let s take a look at the syntax: create table t4 as select a.*, b.product 1 a LEFT JOIN sample2 b ON a.acct_no=b.acct_no; The first thing that you will notice is the syntax changes for the LEFT OUTER join. The FROM statement is split apart with the term LEFT JOIN in between the two tables. Instead of a WHERE clause, there is an ON statement that contains the join. The resulting table will have all fields 1 and the matches 2, like this: Obs State City Popflag Balance Acct_no product 1 VA Blacksburg Pop1 5234 1432534 product1 2 VA Richmond Pop1 2353 1432535 product2 3 VA Norfolk Pop1 9427 1432536 product3 4 VA Chantilly Pop1 5423 1432537 5 VA Roanoke Pop1 6342 1432538 6 NC Charlotte Pop1 9769 1432539 product1 7 NC Raleigh Pop1 2313 1432540 8 NC Asheville Pop1 3745 1432541 9 NC Burlington Pop1 1231 1432542 product1 10 NC Mebane Pop1 8568 1432543 The final join we will discuss is the FULL OUTER join. It returns all data from both tables, matching when it is proper: 9
The syntax for the FULL OUTER join is very similar to the LEFT JOIN we diagrammed earlier. Let s take a look at the syntax: create table t4 as select a.*,coalesce(a.acct_no,b.acct_no) as acct_no, b.product 1 a FULL JOIN sample2 b ON a.acct_no=b.acct_no; Essentially the only difference from the LEFT JOIN is the changing of the word LEFT to FULL. In this example, the results are also the same. When joining tables, you could specify the word OUTER in the join to be more specific. Also, the INNER join can be specified with similar syntax to the above OUTER joins, for this paper it was done with the shortest syntax possible. Note that we use the COALESCE function to ensure that our key is populated properly. The COALESCE function will take the first non-missing value. SQL VIEWS The VIEW in PROC SQL is a useful tool when dealing with complex relational databases. Often times, there are WHERE statement elements that are standard and need to be repeated to pull data properly. It can be confusing for a new analyst to remember all of the features that they need to be successful. The VIEW allows you to imbed a query in the background, having the analyst access the VIEW which sits behind the scenes. That way, standard processes can be preserved with an easy user interface. Let s learn how to create a VIEW: create view t1 as select a.* a order by balance descending; The only noticeable difference in syntax is the replacement of the word TABLE with the word VIEW. However, because we are creating a VIEW, there is no dataset created. In the log you will see the phrase: NOTE: SQL view WORK.T1 has been defined. Now we can use the VIEW just as we would use a table in the PROC SQL. Let s look at how we would accomplish this: create table t5 as select acct_no, state from t1 a; 10
The VIEW is referenced in the FROM statement just like any other SQL table. The resulting dataset t5 now holds all the characteristics built into the VIEW that was created, including those limiting factors in the WHERE statement and the ORDER BY statement. Let s take a look at the results: Obs Acct_no State 1 1432598 VA 2 1432718 VA 3 1432646 VA 4 1432632 VA 5 1432686 VA 6 1432536 VA 7 1432669 VA 8 1432563 VA 9 1432629 VA 10 1432596 VA CONCLUSIONS PROC SQL is a useful tool that can increase efficiency of programming and gives an alternative to some DATA step functionality. It can also serve as a gateway to SAS for SQL users from other programs. It is a valuable tool for all SAS programmers to be familiar with. REFERENCES Ian Whitlock. PROC SQL Is It a required Tool for Good SAS programming? Proceedings of SAS User Group International Conference, Paper 60-26, Long Beach, CA, USA 2001. Katie Minten Ronk, et al. An Introduction to PROC SQL Proceedings of SAS User Group International Conference, Paper 191-27, Orlando, FL, USA 2002 Weiming Hu. Top 10 Reasons to Use PROC SQL Proceedings of SAS User Group International Conference, Paper 042-29, Montreal, Quebec, Canada 2004. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Matt Taylor Carolina Analytical Consulting, LLC 8511 Davis Lake Parkway Ste #C6-285 Charlotte, NC 28269 704-947-8882 matt.taylor@cacanalytics.com www.cacanalytics.com * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 11