1) Disk performance When factoring in disk performance, one of the larger impacts on a VM is determined by the type of disk you opt to use for your VMs in Hyper-v manager/scvmm such as fixed vs dynamic. A great article explaining the performance impacts from the various virtual disk options can be found from MS HERE check it out. On the physical side, there are certain fundamental truths with drive performance, such as; A 15,000 rpm SAS drive is faster in every way compared to a 5,400 rpm IDE drive A raid 10 array provides much better random R/W performance than a raid 5 array The more spindles you have in an array, the faster it generally is An array with less activity will be more responsive than a heavily utilized one In reality, budget and availability will likely play a very big role in what disk subsystem you use and the fastest may either not be the best use of your IT budget, or SAS may not be able to provide the storage capacity you need. In this cluster, we have connected our nodes to a Dell MD3000i iscsi SAN. iscsi introduces a few more factors to consider regarding performance such as network bandwidth, parameters like jumbo frames, and contention issues from competing network traffic. As with any complex device such as this, I strongly recommend reading any and all performance tuning documentation the manufacturer offers for your unit. Dell has a couple, for this unit alone. HERE is an excellent Dell tuning document for this unit, and I would suggest starting with it first. It is pretty deep, and you could spend a long time sorting out the best configuration for your needs. The MD3000i we used has the maximum - a total of 15 SATA drives populating the device. S-ATA drives are the lowest performing option available for this SAN, but they do provide a massive amount of storage at a very reasonable cost, which in all reality fits our needs of this unit very well. The SAN does have an optional second raid storage module on the backplane which offers us twice as many connections out to our iscsi network, however by design this particular SAN processes asymmetrically which means that even though we have two modules and 4 iscsi ports, access to any specific virtual disk will always be via just two iscsi ports on one of the two modules. To address our constraints as best as possible, the SAN was configured based on the results of information gathered by following the Dell tuning guides. We are using numerous drives configured in a raid 10 providing us with the best possible overall performance at the disk level for our given needs. This had the highest disk cost configuration however, especially compared to a Raid 5 option, but the storage capacity after configuring the array was still well above
my forecasted needs. The constraint of asymmetrical processing was offset two ways. The first was by configuring the SAN and all hosts to utilize both iscsi ports actively on each raid controller module. The second was by creating two virtual disks on the SAN and assigning each a different module as its owner. a. The first area related to our cluster that we can monitor and make adjustments on is in the load carried by each of the raid storage modules, since each owns one of the two CSV disks. There are several different ways to do this, even perfmon will provide a fair bit of valid and useful information relating to disk performance. Here are three ways to view statistical information on your MD3000i SAN i. From the iscsi tab s View Statistics in the MDSM utility 1. Open the Dell MDSM utility 2. Click on the iscsi tab 3. Click the View iscsi Statistics link at the bottom of the page 4. Here you can view a large variety of statistics, set baseline statistics, and save information to your local workstation in CSV format from this page
A quick look at the byte counts columns alone shows that Raid Controller Module 1 is both transmitting and recieving over twice as many bytes over the past two week period since the baseline was set. This module is the owner of CSV Volume1. Moving some of the disk intensive VMs from Volume1 to Volume2 which is owned by Module 0, could help balance these numbers out but It is always best to take a sampling of several different statistics to get a clearer picture before making any significant changes. There could be a unique event that caused a large byte count on one module over another, factors like backups could influence these statistics as well. From gathered support statistics, find and track the values that are most important to your server needs. The needs vary greatly depending on server role like transmit for a read only website, or receive for an archive fileserver, on SQL overal IOPs are king. ii. Another location from within the utility where you can get excellent performance information is under the MDSM support tab. 1. Open the Dell MDSM utility 2. Click on the Support tab 3. Click the Gather Support Information link 4. Click Save Support information 5. Choose a name and location for the fileand click start 6. This process gathers a large amount of information, and will take some time. 7. Once compete, you will have a zip file 8. Inside the zip file are two files of interest to performance monitoring a. performancestatistics.csv b. statecapturedata.txt 9. 10. Open and review performancestatistics.csv
a. The statistics will show you virtual disk information seperated by which module it referrs to. This is helpful as you may have more than just your Hyper-v cluster virtual disks running on your SAN, and you need the whole picture to make the best choices. b. Notice the statistic of % read requests and Cache read check hits counter. 11. Open statecapturedata.txt in Excel 12. This is a large and detailed report of statistical information on the SAN. Although it does not have the be formatting, you can get most information that you will need about cache hits, IOPs, reads and writes within. The Dell documentation helps decipher this information to a degree, but forums help further. iii. Use the CLI to gather specific statistical information over a set period of time. 1. From the command line in the MDSM client directory run smcli -n MD3000i -c "set session performancemonitorinterval=5 performancemonitoriterations=250; save storagearray performancestats file=\"c:\\md3000iperfstats.csv\";" The name MD3000i after the n should equal your SANs given name The interval determines how often it will poll the information in seconds Iterations will set the maximum amount of polling it will do before enting the process. In the example, the SAN named MD3000i every 5 seconds 250 times before completing. (about 20 minutes) 2. Open the generated file once complete, and it will look similar to below 3. You can see that this simple command provides some excellent performance information which gives a clear and straight forward view of desirable information. Sort and filter the capture iterations to get quick total overall averages from each module, as well as from each Virtual disk on your SAN. This is my preferred tool for monitoring over a workweek, as it can be scripted, and provides good information for most of my assessment needs in an easily adjustable format 4. This link HERE covers this command in greater detail on how to create graphs and charts related to the information.
2) Network performance Not all network equipment is created equal. Testing shows that differing gigabit switches and differing network cards will provide different levels of performance. This holds true regardless of what traffic your network is supporting. On the iscsi side, we have mitigated our limitations as much as possible already by isolating all iscsi traffic from LAN traffic through the utilization of dedicated iscsi network equipment. We implemented two switches and used two network cards on each host creating a loadbalanced configuration. We then ensured that jumbo frames were enabled and functioning on the iscsi topology, which is noted to work well with iscsi protocol and improves performance when enabled. Monitoring the iscsi switches for load and port traffic for anomalies such as high collisions is recommended, but beyond this there is not much that can be tuned per-se in our configfuration other than possibly adjusting what ports are used on the switch, or replacing underperfroming equipment. If you are using QoS or vlans for iscsi traffic, you may find monitoring and adjusting will prove valuable as there are more variables involved in either of these options. On the LAN side, things are a bit different since hosts and VMs will require varying levels of network performance, and if you put too many network hungry VMs on one node, you will have a bottleneck. a. To tune LAN performance, monitor network utilization on each of the VMs and if sharing a NIC with the host, monitor the host network utilization as well. There are numerous network utilization tools available, feel free to use whichever you like most. The key here is to monitor all your VMs over a typical work week (and/or month) then use the gathered data to make informed adjustments to VM node placement and balance network loads. You may find that if you run a large amount of VMs on all your hosts, that it will be prudent to add another NIC or two to your each of your hosts and create additional virtual NICs for the cluster and help share the network load that way. 3) Processor performance I rarely find in our environments that the processors on our Hyper-v hosts are the source of a bottleneck. Generally one of the other resources constrains us first, however every environment is different and as with any key resource, it should be monitored regardless. Monitoring processor use in Hyper-v guests and hosts is not as clear cut as it is on a stand alone physical server. One of the issues is that processor utilization within a VM can be influenced greatly by the processor count that you set for the VM. You can potentially see low CPU usage within a VM guest while in fact the CPU is being taxed heavily. a. To accurately measure the overall processor utilization of the guest operating systems, use the \Hyper-V Hypervisor Logical Processor(_Total)\% Total Run Time performance monitor counter on the Hyper-V host operating system via a remote PerfMon session and/or through a user defined Data Collector Set. Use the
following thresholds to evaluate guest operating system processor utilization using the \Hyper-V Hypervisor Logical Processor(_Total)\% Total Run Time performance monitor counter: i. Less than 60% consumed = Healthy ii. 60% - 89% consumed = Monitor or Caution iii. 90% - 100% consumed = Critical, performance will be adversely affected Live PerfMon view Data Collector Log review 4) Memory allocation RAM is king in the VM world, and the more your host has, generally the more VMs you can handle. Where you can run into performance degradation is when one of two situations occur i. You have assigned insufficient RAM to a VM. 1. This will cause varying issues and performance loss through paging etc.
ii. Your host no longer has sufficient RAM free for its own use. 1. This too can cause performance issues to all the VMs it is hosting, as well as system stability issues. b. Monitoring the hosts for free memory is easy enough, and is even commonly reviewed in SCVMM. Where you will want to pay attention as well however, is that you have sufficient RAM for your running VMs. This too can be monitored just as easily. You may even find that some of your VMs have been over assigned memory, and you can reclaim some back for the host to reallocate. c. Another important point to note, especially in a Hyper-v cluster is ovecommitting of RAM. If you overcommit memory through all of your VMs residing on your cluster, you my find yourself with a problem should one of your nodes fail. You could end up having insufficient memory available to support the continued service to all the VMs on the failed node. Summary You now have a Hyper-v cluster configured and VMs online using SCVMM. From this point forward you can; Move running VMs between any node in your cluster with no real loss of service. Put nodes into maintenance mode to service the host without shutting down VMs. Continue to tune overall cluster performance by monitoring the nodes and SAN modules, making appropriate adjustments as dictated. Survive a node failure, possible a few failures depending on your level of node commitment