SQL Query Performance Tuning: Tips and Best Practices

SQL Query Performance Tuning: Tips and Best Practices Pravasini Priyanka, Principal Test Engineer, Progress Software INTRODUCTION: In present day world, where dozens of complex queries are run on databases involving GBs of data, a single inefficient query can have severe impact on the application performance. Testing applications in-house is very different from that of real-time production environment, where the test data is relatively smaller and may not detect the underlying performance issue right away. That is when the basic knowledge of Query Execution Plans and simple tips to tune SQL queries comes in handy. Understanding and interpreting query execution plans is not only important for database administrators but for any engineers dealing with SQL queries testers included. A quick look at the query plan can help identify if it is a poorly written SQL statement and then one can decide ways to improve the query performance. Since the underlying design concepts for most of the relational databases is same, this paper is not specific to any particular vendor, however, uses SQL Server examples in few cases. This paper would touch upon the basics of query plan, how to read and inspect simples Query Plan, why it is important and thereafter some commonly followed tips and tricks to tune the performance of a SQL query. Target audience should have general understanding of databases and SQL queries. QUERY EXECUTION PLAN: A Query Execution Plan or Execution Plan is a blueprint or map of what's goes on behind the scenes to execute a SQL query in order to fetch the resultset. In other words, it is a series of steps taken by the Query Optimizer to calculate the most optimal plan, out of several possible plans, to fetch the data. It provides a tree view of how the query optimizer executes the query - how it reads the data scans the whole table or uses an index; if index is picked then which one, performs aggregation or sorting or joins on tables the types of joins or all of them, more. It also estimates the cost of all of these operations, taking in account the statistical information available and finally, considering all the factors, selects the most optimal plan. QUERY PROCESSING: When a SQL query is submitted for execution, it broadly goes through the following phases: Parsing and Translation: Checks syntax errors and semantics. Query is then translated to its internal form recognizable by the Optimizer. Optimizer: Looks for indexes, statistics information etc., does cost measurements and evaluates the optimal plan, out of several possible execution plans. Evaluation: Once the optimal plan is identified, the query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query. CACHING EXECUTION PLANS: When a new query is executed, Optimizer evaluates the query plan, optimizes and compiles it, and stores it in the plan cache. When a query is executed, Query Optimizer first checks the plan cache looking for a query plan that can be reused, thus making the execution faster. If there s no query plan that can be reused, a new one has to be created, which takes time and therefore makes query execution last longer. If the underlying tables, indexes or statistics change between each execution, the execution plan is recompiled before being reused. GENERATING EXECUTION PLANS: Most of the databases provide SQL statements or tools to generate the execution plan for any SELECT statement Page 1

that is created by the optimizer. This plan is very useful in fine tuning SQL queries. However, each database has its own tools or syntaxes with multiple options to generate execution plans. 1. Table Scan In absence of a proper index, each row in the table is read one-by-one. For large tables, it is hugely time consuming and causes huge performance overheads. Following are the few ways in which one can generate plans in different databases Oracle - EXPLAIN PLAN FOR <SQL query> SQL Server - Set SHOWPLAN_ALL ON <SQL query> MySQL Explain <Your SQL query> 2. Index Scan: An index is a data structure associated with a table or view, built on one or more columns, that speeds retrieval of rows from the table or view. UNDERSTANDING PLANS: HOW IT HELPS Case: Say, a Customer reports a huge performance hit with one or more SQL queries in its production environment. And there might be very limited or no access to the customer environment and it is nearly impossible to simulate it either. The logs also provide little help in pinpointing which particular SQL query is performing badly. Indexes undergo constant change with the changing data. 1. Clustered Index : Select con.contacttypeid FROM Employee.ContactType AS con WHERE con. ContactTypeId = 10; In such cases, Query Plans is one of the most preferred places to begin with. Knowledge of Query Plans often helps engineers determine if the SQL query could be further optimized either by 1) Re-writing the query differently such that there is possibility that it produces a different plan; OR 2) Forcing the Optimizer to use different set of Index (using hints). Knowing how the query optimizer works for your particular database can be a plus and help you tune SQL queries. In-house, when new enhancements or SQL optimizations are added to databases, this knowledge can be used by QA during testing to find out if there areas where there is a possibility of additional optimization or if the new optimization has negatively impacted any other query performance. It is also important to keep in mind that understanding complex query plans is not easy. It requires lot of expertise and thorough understanding of internals. INSPECTING A QUERY PLAN: If you are using SQL Server, you can use the Query Analyzer tool to display the execution plan graphically, instead to Set ShowPlan_Text ON; Few simple SQL Server examples are depicted below: 2. Non-Clustered Index: SELECT con.contacttypeid FROM Employee.ContactType AS con WHERE Name LIKE 'Own%' Note: Since Index Scan is faster, if indexes are available that can be used for a given query, then the optimizer will perform an index scan or index seek, otherwise a table scan. Other common artefacts or components of an Execution Plan are [1] 3. SORT produced by ORDER BY 4. Compute Scalar - produced by Scalar Functions or Computed columns 5. Stream Aggregate, Hash Match Aggregate functions and GROUP BY. 6. Filter restricts or filters the data (WHERE condition) 7. Table Joins Nested Loops, Hash Match and Merge Joins. Page 2

Query Plans are also generated for DML operations Insert, Update, and Delete. More complex query plans are Stored Procedures, Views, sub-queries or nested queries, complex table expressions, which require advanced learning and out of purview of this paper. Create multi-component indexes on the combination of most selective columns. For example if there is an index is created on ( Firstname, LastName ), queries like: -- Where FirstName = Alex And LastName = Brown ; SQL QUERY TUNING: Tweaking a SQL query or re-writing it with alternative syntaxes such that it produces the same output but with better query plan resulting in improved performance overall is called as SQL Query Tuning. Before we move to the case by case descriptions, here are some general considerations: Understand the application and data It is the inherent job of a QA to know well about the application, how it works, what data it uses, how often it uses etc. This is also very much for finetuning SQL queries as well. Before even we jump start to tuning SQL statements, it utmost important to know well about the application, how it uses and manages the data and business use cases. Test data should be as realistic as possible When the application is tested in-house, it is important to have data as close as possible to real time, both in terms of volume and type, in order to find out issues that could eventually pop up in production environments. QUERY TUNING: TIPS AND BEST PRACTISES: SQL query tuning is a wide area with lot of factors taken into consideration. Below are some general guidelines that one has to keep in mind: 1. Effective Use of Indexes: Indexing is an effective way to tune your SQL database that is often neglected during development. One of the important aspects of SQL Tuning is how efficiently the indexes are used. Indexes are most useful when the required data is fetched with fewer IO operations and hence less resource usage. Indexes also should have high selectivity The lower the ratio of matching rows to that of total rows better is the selectivity. A unique index, for instance, has highest selectivity as it has exactly one matching entry, hence most efficient. are likely to be benefitted. However, queries will not gain if only one of them is in Where clause i.e. -- Where FirstName = Alex -- Where LastName = Brown Same goes for indexed columns in Order By, Group By and Distinct clauses. Queries to be benefitted are -- Order By/ Group By /Distinct Firstname, LastName But no benefit in case of -- Order By/Group By/Distinct Firstname; -- Order By/Group By/Distinct LastName, FirstName; Create the right indexes - Indexes are meant to make the queries faster but at the same time costlier to store and maintain. Hence it is important to create the right kind of indexes neither too neither less nor too many. Create indexes on primary keys columns as they are frequently used in Joins, Where and Order By clauses. They are also used for creating foreign key constraints which would again make joins on these tables faster. 2. Arithmetic expressions and Scalar functions: Other than the above, there are few more situations where index though present on a column may not be picked. For example, if indexed columns have the following, then Index Scan is not done. Computed fields Arithmetic operations or expressions: Where EmpId + 10 <1000 Concatenation operators on indexed columns: Where FirstName LastName = Allan Brown Page 3

Scalar functions: Concat(FirstName, LastName, Year(JoiningDate) etc. 3. Index Scan or Full Table Scan: SQL queries doing Table Scan are one of the most common candidates to be considered for tuning. Looking at the indexes defined, one can determine if there is a way to re-write the query using a condition or predicates such that it Index Scan is picked up instead of Table Scan, especially for large tables. As a thumb rule, logically Index Scan will yield better performance than Full Table Scan, with some exceptions however. Avoid Indexing small tables as this saves the cost of loading and processing index pages. Table Scan may be as efficient as Index Scan here. Index Scan is faster only Index Scan is faster only if the data fetched by index out of the total rows is very small else full table scan might be a preferred option. Indexed scan causes multiple reads per row accessed whereas a full table scan can read all rows contained in a block in a single logical read operation. 4. Optimizer Hints: Though the SQL optimizer is smart enough to pick the most optimal plan or index for the query execution, there could be situations where it doesn t do so for various reasons. In such cases, if we have found the alternative index that we think could yield better performance compared to the one picked by the Optimizer, we can override the Optimizer s decision by using Hints. Hints are like guidelines for the Optimizer, directing it to pick your specified index instead of the ones it chooses on evaluation. This could be useful in cases, such as troubleshooting a bad performing query. You want to manually try out different query plans using hints and check if there is a better performing index (may be!!) and then it could help figure out why the optimizer isn t using this plan. Each database has its own syntax of specifying Hints and also types of Hints, which are more or less similar. For example, SQL Server has following types of Hints: Query, Join and Table hints. 5. Updating Statistics from time to time: Statistics is the information about indexes and their distribution with respect to each other. Statistics collected for different tables is very crucial to the SQL optimizer. Optimizer heavily relies on this information to decide the least expensive path that satisfies a query. Obsolete or missing statistics information will hamper the optimizer s decision in cost estimation and hence, it is likely to pick a less optimized path hence degrading the overall performance time. Hence, it is important to keep the statistics updated. 6. Select Selectively, Select Smartly: One of the most common habits with writing Select query is to use Select * instead of individually listing the columns because it is quick and easy to write but all we need actually is fewer set of columns; without realizing the performance hit it takes. Case: Consider a table Employee with large number of columns EmpId, FirstName, LastName, DeptId, JoiningDate, Address, City, State, Pincode, PhoneNo1, PhoneNo2, EmailId. Indexes have been defined on EmpId, DeptId, JoiningDate and the table has millions of rows. Performance consideration: There are two issues when we run just Select * from Employee, 1) When fetching all columns and all rows with a Select *, the whole has to be loaded from disk for processing your query. We are processing more data than we need thus resulting in huge cost of IO; hence massive wastage of resources. 2) Without a Where condition or Order By, we are also preventing Index Scan, Optimizer goes for Table Scan. Even if we select fewer columns, for e.g. Select FirstName, LastName from Employee, which do not have an Indexed column in Select list, Optimizer will still go for Table Scan. Find out which columns your application exactly needs and write the columns individually in the Select query. Remember the mantra the less columns you ask for, the less data must be loaded from disk when processing your query, hence faster will be the performance [2] Page 4

When you pick selective columns and one of more of them is indexed co, specify the columns in Order By may pick up Index Scan instead of Table Scan. For e.g. Select EmpId, FirstName, DeptId, JoiningDate from Item Order By EmpId; Similarly, check if a Where condition on indexed columns will fit your needs. If yes this will pick an Index Scan instead of a Table Scan For smaller tables, where the data is too little, there may not be any significant performance difference. 7. Union vs. Union All: Wonder whether to use Union and Union All? Which is the better one? To answer the question in simple terms Both UNION and UNION ALL concatenate the result of two different SQLs. They differ in the way they handle duplicates. Case: Consider table Item.Item_Id has (1,2,3,4,5) and table Sales.Item_Id has (1,1,2,2,3) -- Query with UnionAll: Select Item_Id from Item UNIONALL Select Item_Id from Sales produces (1,2,3,4,5,1,1,2,2,3,3) -- Query with Union: Select Item_Id from Item UNIONALL Select Item_Id from Sales produces (1,2,3,4,5) Performance consideration: Union All appends the rows from two SQLs without eliminating the duplicates. Union, on the other hand, doesn t just append the two set of rows but also does a DISTINCT Sort appends the rows and also removes the duplicates. As a thumb rule, UnionAll will be more performant than UNION because UnionAll doesn t do the additional work of eliminating the duplicates in the result set. Union, on the other hand, performs the expensive operation of Distinct Sort. Use UNION ALL instead Union when you specifically do want to include the duplicates rows. Another common rule is that UNION performs better OR in WHERE clause. However, this is not always true, especially when there are multiple OR conditions. Even writing too many UNION instead of OR may make the query less readable. 8. Exists or IN: Preferred one! At the first glance, the EXISTS and IN clauses look fairly similar in terms that they both use a subquery to evaluate rows, but they do it in a slightly different way Case: Consider two tables Item and Sales, each with 1000 rows. Display ItemId, ItemName, Price of Items whose more than 50 quantity have been sold -- Query with IN: Select i.item_id, i.item_name, i.price from Item i Where i.item_id IN (Select s.item_id from Sales s Where qty > 50); -- Query with Exists: Select i.item_id, i.item_name, i.price from Item i where Exists (Select * from Sales s Where qty > 50 And i.item_id = s.item_id); Performance consideration: IN when used with a sub query, first process the entire sub-query and then processes the overall query as a whole, matching up rows based on the relationship specified for the IN. EXISTS on the other hand, does not look for a match. It just checks the existence of a single row and returns true or false. The moment, the condition is true, EXISTS bails out of the subquery and returns all the rows from outer table since condition is true. Hence, to use EXISTS to do equivalent thing as IN, there must be a correlation predicate within the subquery, i.item_id = s.item_id, which may or may not be good, depending on the amount of data. For large tables, EXISTS will yield better performance than IN in case there are lots of matches for the subquery. So if the subquery has multiple child records for the most of the outer records, EXISTS is most likely to be faster. If the amount of data returned is small, Exists and IN are likely to perform similar without significant performance difference. EXISTS most likely will not perform worse than IN, so it is a safe bet. Extending the discussion to NOT EXISTS vs. NOT IN, unlike Exists and IN, Not Exists and Not IN are not equivalent in all cases, especially when NULLs are involved. They will return different results in such case. Page 5

For non-nullable columns, the behavior and performance of NOT IN and NOT EXISTS are similar, so use whichever one works better for the specific situation. 9. Sub-query or Joins : Case: Display all the Items whose more than 50 quantity has been sold. This can be achieved in following queries: -- Subquery with IN: SELECT Item_Id, Item_Name FROM Item WHERE Item_Id IN (SELECT Item_Id FROM Sales WHERE qty > 50) -- Subquery with Exists: SELECT Item_Id, Item_Name FROM Item it where Exists (SELECT sl.item_id FROM Sales sl WHERE qty > 20) And it.item_id = sl.item_id --Equivalent query with Join: SELECT it.item_id, it.item_name FROM Item it, Sales sl WHERE qty >50 AND it.item_id = sl.item_id Performance Consideration: As explained above, Exists checks if the subquery returns rows or not in short, a True or False. And based on this true or false, the outer query either returns data or empty resultset. Exists is essentially a semi-join. Joins, on the other hand, returns a resultant set by combining data from two or more tables based on one or more conditions. If the join condition has indexed columns, it is likely to be faster. One of the general tuning practises is to re-write a subquery as a join, for better performance. As a rule of thumb, joins perform better than subqueries. For a large table, Solution 1 having Subquery with IN will be slower compared to Solution 2 having Exists for reasons explained above. Similarly, Solution 3 with Join will be definitely faster than Solution1 and most likely than Solution 2 as well. Not every sub-query can be converted to a join. Sometime, the Optimizer itself replaces the sub-query with a join but not always and even if does, it might not pick the best possible plan. There is also a possibility in certain cases, subqueries perform better compared to joins. It depends on various factors like how much of data the subquery returns, how much of data has to be joined, if indexed columns have been used in Join condition etc. 10. Making Joins efficient: For regular joins between two or more tables, performance can be optimized if there are indexed columns on both sides of the join, preferably of numeric data type. It is also best to avoid columns in the join condition which have too many duplicates or very few unique values. In short, primary indexes are good candidates for join condition. Eliminate the extra columns which are not required in the join. Look at the underlying type of Join being selected Nested Loop Join, Merge Join or Hash Join and the Join Order. Depending on the data, small or large, and presence of indexes, look if the optimal join is being picked. For example, a Nested loop is similar to a foreach loop and is suitable when either or both tables are small. It gets very expensive if the data to be joined is too large. Merge Join works by comparing two sets of data and performing a join on matched rows, such that data is read only once from each set. However, it requires the joined columns to be sorted and hence can be more effective for larger set of rows. 11. Do away with Correlated queries: A correlated subquery is one query which has a reference to the outer or the parent query. Case: Fetch the Department Names of all employees. New SQL users are often caught structuring their queries in this way because it s usually the easy route. --Correlated subquery: SELECT e.empid, e.firstname, e.lastname, e.deptid, (Select DeptName from Department d where e.deptid= d.deptid) As DeptmtName from Employee e Performance Consideration: A correlated query essentially runs row-by-row, once for each row returned by the outer query, and thus impacts SQL query performance. Page 6

Tuning tip: A more efficient technique would be to replace a correlated sub-query with joins. Thus, the above solution can be re-written as SELECT e.empid, e.firstname, e.lastname, e.deptid, d.deptname from Employee e Left Outer Join Department d where e.deptid=d.deptid Here we are using left outer join because we want to return all records from Employee table. 12. Count or Exists: There is a common practice to use COUNT to check the existence of a record, without considering the performance factor. When the table is small, using COUNT for this purpose does not impact the performance significantly. However things change when the data is huge. Case: Return true if there is an Employee with DeptId=1000. --With Count (*) If (Select Count(*) from Employee where Deptid=1000) --With Exists Print True If Exists (Select * from Employee where Deptid=1000) Print True Performance Consideration: In the above case, all we want to check is if there in Employee record with DeptId=1000 or not. Hence EXISTS is a better option because as soon as it gets the first record, the condition it true and exits. However, in case of COUNT, it scans the entire table, counting up all entries matching your condition and then returns true. Hence when the data is large, the performance will be largely impacted. Use Exists for checking the presence of a record instead of Count. Count should be used for retrieving records count instead of its presence. 13. Where and Having Clause: Where and Having clause perform filter operations in a Select query. Both give same results when used with Group By. Similarly, ON clause in joins is also filter operation used to restrict data. -- Where with Group By SELECT City, COUNT(*) FROM Customer WHERE City IN ('London','New York') GROUP BY City -- Group By with Having SELECT City, COUNT(*) FROM Customer GROUP BY City HAVING City IN ('London','New York') Performance consideration: The precedence in which these clauses are evaluated is 1) ON in Joins 2) Where and lastly 3) Having clause. Earlier is the filter applied in the query processing, the more is the data restricted, the better is the performance. Query with Where first eliminates the rows as per the condition and then does grouping. Query with Having first groups and then filters the data. So former has less data to group than the later, resulting in better performance. Tuning tip: The primary criteria of filtering the rows which results in maximum elimination of rows should appear earliest possible in the query. In case of joins, the condition that filters most rows should appear in ON condition. It is even better if these columns are indexed or unique columns. Similarly in regular queries, not involving joins, the primary condition of restriction should appear in the Where clause followed by Having clause. 14. Derived tables or Inline Views instead of complex subqueries: There is some debate on whether derived tables or inline views are better performing than complex subqueries. An inline view or derived table is a table created on the fly using a SELECT statement, and can be referenced like a regular table [3]. Apart from performance, they improve readability for sure, when compared to complex subqueries. Derived tables are best candidates for queries which require multi-level aggregations or joins on aggregates, instead of complex subqueries or correlated subqueries. Page 7

Case: Display the total bonus of employees received for Year 2009 grouped by department. This can be achieved using a derived table SELECT e.empid, e.firstname, e.lastname, COUNT(b.bonus_id) AS TotalBonuses FROM Employee e LEFT OUTER JOIN ( Select Max(Bonus) from Bonuses b WHERE YEAR(b.Award_date) = 2009 GROUP BY DeptId ) as MaxBonus m ON e.deptid= m.deptid Group By e.deptid 15. Denormalization: Sometimes, Table Denormalization or storing the duplicate data in another table may provide faster access to the data but also requires more update time and maintaining additional objects. 16. Eliminate extra Sorts (Order By): Sorts are usually at the end of the query execution and sort time is directly dependent on the size of data to be sorted. Look if there is a possibility of eliminating duplicate or extra sorts. 17. Parameterized Queries: If your application has a large number of queries that deal with constants, then the performance can be improved by using parameterized queries. The better performance is because the query is compiled once and then re-used by executing the compiled plan multiple times. For example If we want to return Employee from different departments Select EmpId, FirstName, LastName from Employee Where DeptId =? The key here is to retain the Command object programmatically. If it is destroyed, the Plan has to be recompiled again. If there are several parameterized queries running concurrently, then if we can retain the Command objects, each caching the execution plan for a parameterized query, the re-compilations can be effectively avoided. CONCLUSION Database tuning can be an incredibly difficult task, particularly when working with large-scale data. A large amount of database performance problems arise from bad SQL. The key to tuning often comes down to how effectively you can tune those single problem queries. Not only, Database Administrators and SQL developers need to learn the tuning tips but it helps testers and quality engineers to identify if the developer has done everything right or if there are some keys points missing. After all, the more a tester knows, better is the chances of identifying defects, and hence better is the quality. REFERENCES [1] Refer to SQL Server Execution Plans by Grant Fritchey: https://www.simpletalk.com/redgatebooks/grantfritchey/ebook_sqlserverexecuti onplans_2ed_g_fritchey.pdf [2] http://use-the-index-luke.com/blog/2013-08/its-not-about-thestar-stupid - it is not about Select star [3] http://www.sqlteam.com/article/using-derived-tables-tocalculate-aggregate-values [4] Collection of several articles on Performance Tuning Tips: http://www.mssqltips.com/sql-server-tipcategory/9/performance-tuning/ [5] http://beginner-sql-tutorial.com/sql-query-tuning.htm [6] http://www.cubrid.org/wiki_tutorials/entry/performanceimplications-of-shared-query-plan-caching - Query Plan Caching [7] SQL Database Performance tuning for Developers: http://www.toptal.com/sql/sql-database-tuning-for-developers [8] http://geekexplains.blogspot.in/2008/06/how-to-tune-sqlqueries-for-better.html Page 8