Deploying and Optimizing SQL Server for Virtual Machines
Deploying and Optimizing SQL Server for Virtual Machines Much has been written over the years regarding best practices for deploying Microsoft SQL Server (or any database server, for that matter). Turns out, though, not much has been written about deployment on virtual machines. To be fair, it s only in recent years that Microsoft has even supported the idea of SQL Server on a virtual machine, and perhaps only with the release of Hyper-V 2012 is it a practical reality. For the most part, the guidance is fundamentally identical and sadly, it seems, just as universally ignored. However, there are a couple of considerations unique to the virtual environment that should also be discussed. In this white paper we re going to dig into the four critical resource areas that impact database performance and reliability in a virtual machine: CPU Memory Network Disk (Storage) CPU The CPU is a resource that experiences a notable amount of fluctuation. On a database server dedicated to a single application, the CPU may only see occasional spikes in full utilization, like when a particularly intense query is executed, or a database maintenance task is being run (such as a table re-index). As hardware becomes more advanced we now have more options for managing CPU resources. Ten years ago the only real choice was to buy the fastest CPU available and hope you didn t run out of capacity. Today s advanced platforms allow us to create physical systems with dozens of processor cores, and allocate virtual CPUs to a machine as needed. page 1
CPU Utilization on the Virtual Machine Standard CPU performance metrics apply to the virtual machine. We want to maximize the utilization of the CPU resources assigned to the virtual machine without reaching sustained high-utilization values. You can quick-check the CPU Utilization by using Task Manager, or do a more in-depth analysis using Performance Monitor. You ll want to look at these five metrics: Processor: % processor time Processor: % privileged time Processor: % user time System: context switches/sec System: processor queue length Sustained values of % processor time greater than 80%, a processor queue length greater than two per core, or context switches/sec are unusually high. Comparing these values against a baseline is also important. Fixing these issues can be relatively simple in a virtual environment, as it may only require adding additional vcpus to the virtual machine. However, there s also a disadvantage to having those kind of resources immediately available, a manifestation of virtualization itself: placing too many machines on a host, and causing the database server to become CPU-starved because the host s CPU resources are oversubscribed. This creates a unique risk to a virtualized database server. CPU Utilization on the Host When the host s CPU resources are oversubscribed, this results in excessive task switching in the host, and manifests as latency in the virtual machines. In effect, the performance of the database server is slowed down, not because of the database server itself, or its own virtual machine, but because of the combination of virtual machines and resource load that exist on the host. page 2
There are two performance parameters that should be monitored on the host to ensure this is not happening: CPU load average Percentage of workload per core In VMware vsphere you can obtain these values using the esxtop command. In Microsoft Hyper-V you can use the Windows Performance Monitor from the host system s console or via remote connection. The CPU load average should be less than 1.0. A value of 1.0 indicates that the host s CPU resources are fully utilized; a value greater than 1.0 indicates that the host needs more physical CPU resources. The percentage of workload per core should be below 75%. Regular values greater than 75% indicate oversubscription in the cores. In both instances, the addition of CPU resources to the host is indicated; however, in most cases the practical solution will be to migrate one or more virtual machines to another host. Memory Memory utilization is the single most misunderstood aspect of Microsoft SQL Server. I continually hear system administrators complaining about high memory utilization of the sqlservr.exe process. This is absolutely normal! Here are two facts about SQL Server that you should remember: The sqlservr.exe process will consume as much memory as is available on the machine. More memory for SQL Server is a good thing! The sqlservr.exe process will give back memory to the OS and applications when it s requested, even sometimes to the detriment of the database service itself. On a physical machine installation of SQL Server this is rarely an issue because this is almost always a dedicated server and nothing else needs that memory. On a virtual machine, however, there are many additional variables that impact memory management, so we do need to be a bit more aware of memory consumption and memory management within this realm. page 3
As a starting point, though, you can never give too much memory to an instance of SQL Server, so give it all that you can afford to allocate from the host. You ll probably need more even after that. Do Not Disable Memory Balloon Drivers The first thing to be aware of is that on VMware vsphere it is possible to oversubscribe the amount of physical memory installed in the host. This can be a good thing for SQL Server because it can borrow some memory from other virtual machines when necessary. This is handled through the Memory Balloon Drivers in vsphere. Considering that SQL Server is memory hungry, you might be inclined to disable the Memory Balloon Driver to keep memory from being taken back from the virtual machine hosting SQL Server. Don t do that. The negative impact is actually worse than the temporary return of memory to the host. If the host becomes memory-starved, it will start paging its own memory to disk. But guess what s sitting in those memory pages being swapped out? The memory pages of virtual machines, and it could be the buffer cache of your SQL Server instance! A better way to handle this concern in a virtual machine is to configure the SQL Server min memory and max memory values. Typically, on a physical server, the guidance has always been to leave these at the defaults (min=0; max=max), because a dedicated server shouldn t have any memory resource contention from other processes. On a virtual machine, though, some protection from the Memory Balloon Driver s impact can be useful. Determine the absolute minimum amount of memory needed to maintain operation of the database server and set min memory to that value. Set max memory to 90% of the memory assigned to the virtual machine. Do Not Enable the Lock Pages in Memory Option Another technique sometimes employed by database administrators, ostensibly to improve performance, is to uselock pages function in your memory setting. There s another camp that holds this option should only be used in exceptionally unique circumstances. The utilization of this option as a regular practice dates back to a particularly ugly bug in Windows Server 2003 RTM when Remote Desktop is used to access the SQL Server. The bug was fixed in Service Pack 1. page 4
However, our task here is not to debate the pros/cons of this setting, but to emphasize that it should never be used when SQL Server is running in a virtual machine. First, enabling this option will critically interfere with the Memory Balloon Driver (which we just noted should not be disabled either). If pages are locked in memory, the Memory Balloon Driver cannot return those pages to the host, resulting in the host being starved for memory. Like disabling the balloon driver, starving the host s memory pool results in unnecessary paging, and potentially paging the SQL Server Buffer Cache. Memory De-Duplication One of the newer features now available in memory systems, previously introduced in file storage systems, is de-duplication. Memory de-duplication works great for the memory pages holding the operation system pages, but is generally worthless for the memory pages holding SQL Server memory, unless you have multiple instances of the same database running on the same host (probably not a good deployment plan). The advantage of memory de-duplication for the OS pages is that it reduces the amount of host memory needed to support the OS instances of the virtual machines, and that frees up more host memory for use by SQL Server (or other applications on other virtual machines). Networking There s not a lot of special considerations for networking on a virtual machine instance of SQL Server that doesn t equally apply to a physical installation. Ensure sufficient bandwidth for client connections Dedicate a connection for management/monitoring/recovery NIC Teaming is a must-have for a mission critical database server Each virtual NIC must have its own dedicated physical port on the host page 5
Storage Disk storage is, and always has been, the single most critical component of a database server, and almost always, the resource that is typically under-provisioned. When considering storage capabilities for a physical database server, we need to think in terms of volumes, spindles, and arrays. When we add the virtual environment to the mix, we now complicate the situation with the concept of the virtual disk, but we still need to be aware of the physical implementation of volumes, spindles, and arrays sitting underneath those virtual disks. Why not just one big disk? Let s first review why we want to split up the components of a database server onto multiple volumes, spindles, and arrays. The OS generally sits on a file system that is almost exclusively read-only, although the paging file does see some write activity. Hopefully, though, on a database server, paging activity is kept at an absolute minimum. Database files and transaction log files are much more oriented toward heavy write activity, although the types of writes are significantly different. Database files are random writes; transaction log files are sequential writes. SQL Server uses a temporary database for a significant amount of internal uses, including query sorts, index reorganization and rebuilding. This temporary database can involve a significant amount of the disk I/O present in a database server. So from those premises, the base-level SQL Server implementation has always called for at least three disks. Disks, here, however, is being used as a catchall term for either volumes, spindles, or arrays. So, taking this a step farther: page 6
It s almost universally agreed that the OS volume on a database server needs to have fault tolerance, so we re concerned with an array here. A transaction log file, because it s primarily a sequential write operation, benefits most from having a dedicated spindle where the heads only move when the transaction log file is written to (or read from). It also needs fault-tolerance. The database file mostly just needs to be on a separate file system, but also benefits from fault-tolerance. Of course, fault-tolerance isn t actually an implementation requirement, it s just good common sense. So we could say that a minimal disk configuration for performance would be two physical disks, with three volumes. We ll have disk 0 with the OS on volume 0, the database files including the temp database on Volume1 and the transaction log files on disk 1. Now, in reality, this will be a pretty dysfunctional implementation for a physical server, but for a virtual machine with the right SAN sitting behind it, this is exactly how a typical database server will be configured. Fault tolerance Let s go to the next step and talk about fault tolerance, and the right kind of fault tolerance for each of these scenarios. The OS needs only a two-spindle mirror (RAID-1). In fact, most operating systems, certainly Windows, can only boot from a mirror array. The transaction log files need their own array, a minimum wo-spindle mirror, but one of the disadvantages of RAID-1 is its write performance, so we can make significant improvements in the performance of a transaction log file by placing it on a multi-spindle stripe that is mirrored (RAID-10). RAID-5 is an acceptable compromise because the writeperformance is better than a RAID-1 mirror, but if you lose a spindle in the array, the read performance will be significantly impacted. page 7
The database files need their own array, and the type of array used is all about performance. RAID-1 suffers from degraded write performance; RAID-5 gives us better write performance, but the risk of degraded read performance if a spindle is lost. Conventional practice is to use RAID-10 arrays for database files. The temporary database is a unique beast in the SQL Server environment. It can be placed on an array, but it may be that multiple files across single spindles is a better performance strategy than a single file across multiple spindles. Multiple files can take advantage of multiple disk queues, which can be handled by multiple cores in parallel in a multi-core server. Translating physical storage requirements to virtual storage Now, all of that generally relates to a physical server. How do we translate this to a virtual machine? The key is in understanding how the underlying disk structure is implemented. If you re using direct attached storage, it will be difficult to optimize the disk subsystem on a virtual host in almost any scenario. If you re using SMB v3, not only can you map the virtual disks for the SQL Server across multiple drives, you may even be able to map it across multiple nodes (of a file services cluster) or across multiple file server clusters (i.e. database on cluster1, logs on cluster2, tempdb on cluster3). The typical scenario, though, for most implementations of SQL Server on a virtual machine will be with a storage area network. With a SAN, most of the legacy arguments for separate spindles and file systems become trivial because the performance of the SAN significantly overrides the performance considerations with direct attached storage that drove many of these storage practices. With a SAN we want to have one virtual disk per LUN, the LUN needs to be fault tolerant or you ll need to build the fault-tolerance using multiple virtual disks, and the LUN should be spread across the same number of spindles in the storage array that you would expect in a physical installation. page 8
Generally, the number of spindles in a LUN won t be a problem. Consider a 24-drive chassis with 2TB drives. The storage administrator carves out three LUNs for your database server, a 250GB LUN for the OS, and a trio of 1TB LUNs for the database, logs, and tempdb. You re getting 3.5TB of the 48TB available in the chassis, but those LUNs, most likely, are allocated across all 24 spindles of the array. Monitoring The final thought I ll offer on implementing SQL Server on a virtual machine is the absolute need for a functional monitoring system. Monitoring the host, the virtual machine(s) you may actually implement multiple VMs as nodes of a SQL Server cluster and the storage subsystem (or at least the storage performance as relates to your assigned LUNs). It s absolutely pointless to invest extra effort in designing an optimized database server installation if you re not going to invest the equivalent effort in ensuring that your design performs as intended. Whether it s as simple as setting up data collection sets using Windows Performance Monitor, or something more advanced such as a third-party server monitoring product, make sure you have a plan for post-deployment care. page 9