SolarWinds Comparison of Monitoring Techniques On both Target Server & Polling Engine
Contents Executive Summary... 3 Why Should You Keep Reading (ie: why do I care?)... 4 SNMP polling (as compared to WMI)... 4 WMI polling (as compared to SNMP)... 4 Introduction... 5 Disclaimer... 5 SolarWinds Monitoring Impact on a Single Target... 6 Premise and Architecture... 6 Details... 7 Timeline and Graphical Results... 8 SolarWinds Monitoring Impact on a Polling Engine... 9 Premise and Architecture... 9 Details... 10 Timeline and Graphical Results... 11
Executive Summary SolarWinds (relatively) new technique of monitoring windows servers via WMI instead of SNMP represents a measurable - but manageable impact on both the target and the polling engine. On a target server, monitoring with WMI + a SAM template had no effect whatsoever on RAM or CPU (compared to simple ping monitoring) although it did represent an average increase of 12Kbps. The difference between WMI and SNMP polling was even less noticeable, with a 4Kbps bandwidth bump being the only noticeable effect. On the polling engine the impact was more pronounced: monitoring 300 servers via WMI with a SAM template included (the most aggressive monitoring combination) resulted in the following increases compared to monitoring with simple ping : a 16% increase in average CPU utilization a 4% increase in average RAM usage and a 4Mbps increase in incoming bandwidth. The difference between monitoring 300 machines with WMI vs. SNMP was even less of an impact on average: 6% CPU 2% RAM 2.5Mbps bandwidth received.
Why Should You Keep Reading (ie: why do I care?) If (as the executive summary states), the difference between WMI and SNMP polling is (statistically) negligible, then why the need for additional hand-wringing? Why not just make the switch and go? The answer is that the choice of polling method has other impacts beyond the physical toll on the machines involved. Functionally, there are some pros and cons to be weighed: SNMP polling (as compared to WMI) CON Cannot monitor Windows Volume Mount points CON Challenges with earlier versions of Windows (NT, W2k) CON Requires additional non-standard configuration actions (enabling snmp agent, etc) PRO Fewer ports for enterprise firewall rules PRO No single point of failure for access CON Changing SNMP string requires enterprise-wide changes CON Uses SNMP service start time for uptime metrics o Work-around: set up UnDP for hrsystemuptime PRO Extremely efficient use of CPU, RAM and bandwidth (on both target and poller) WMI polling (as compared to SNMP) CON WMI-only devices cannot use custom pollers (UnDP). o Work-around: If the machine has EVER been an SNMP polled device, the snmp info is retained and custom pollers can be used (at least until the SNMP RO string changes) PRO Settings used by SAM automatically CON significantly more firewall ports required o Work around: per-server config can nail down WMI to just a couple of ports CON will not work across a NAT-ed WAN connection (VPN, etc) CON one password change in AD can cripple monitoring CON cannot monitor topology PRO uses REAL reboot time for uptime metrics CON less efficient (vis a vis SNMP) use of CPU, RAM and bandwidth on both target and poller
Introduction As we rolled out SolarWinds monitoring in our environment (about 5,000 servers and 3,000 network devices) the question of load both on the target devices and the monitoring infrastructure itself became increasingly important. Even seemingly small additions such as a single custom universal device poller could have wide-ranging impacts when applied to 1,000 devices. We wanted to be able to respond with data to concerns of both the application owners (who didn t want monitoring to rock the boat), and the monitoring team (who didn t want to turn on an option that looked nice on one or two systems but would crash everything when rolled out enterprise wide). Much of what we needed was already documented, either in the technical information or in online forums. However, when we looked for some hard numbers regarding WMI we found less. When we asked experienced technical resources, Can you show me the impact of turn on WMI in a large environment, and how that load compares to SNMP (or nothing) we received (at best) vague responses like WMI is 5 times chattier than SNMP ; and (at worst) responses that bordered on snarky: Since we don t have 4000+ nodes in our test environment its difficult for me to tell you exactly what will be the impact of moving 4000+ nodes from SNMP to WMI polling. So, we ended up doing it ourselves. We broke the testing into two major areas of focus: 1) The impact of various monitoring methods on a single target server 2) The impact of various monitoring methods on a polling engine, when used on a significant number of target nodes. Because we own both NPM and SAM, the methods we focused on were: Ping SNMP standard collection WMI standard collection SAM monitoring Disclaimer These tests were designed and executed with exactly one goal: to answer my own curiosity and help me make the right decision for my project. It was not intended to be exhaustive or completely comprehensive. It had to be performed with the hardware we had at hand, in a relatively short timeframe, with minimal impact to both the infrastructure and my real task list. Your mileage may vary, caveat Emptor, and don t forget to tip the wait staff.
SolarWinds Monitoring Impact on a Single Target Premise and Architecture We set out to answer the question of load on a single node (a windows server, in this case) using 5 scenarios: 1) When we just monitor for ping 2) Monitoring via ping and SNMP hardware collection on a standard windows device. 3) Monitoring via ping/snmp plus a SAM template (perfmon, service, and eventlog) 4) Monitoring via ping/wmi plus a SAM template (perfmon, service, and eventlog) 5) Monitoring via ping/wmi hardware for collection only We wanted to avoid observer bias, where our monitoring of the server under load was causing more load than the monitoring that was generating the load. Therefore, we set up an extremely aggressive collection of hardware via SNMP using one polling engine, and then let the server baseline itself to that level. Then we performed the real monitoring (i.e.: the scenarios described above) using a different polling engine writing to a separate database (on a different server). Aggressive polling to observe changes on targe Target Node Normal polling using standard monitoring techniques Poller A Poller B Database A Database B This allowed us to change the monitoring scenarios while observing the effect of those changes on the target from a separate point of reference. Summary of Results Cutting to the chase: In the end, all of the various monitoring options had a negligible impact on the server. Overall, CPU ranged from 0 to 11% utilization, with the high point occurring during WMI + SAM monitoring. RAM varied only by 2% (from 22 to 24%) and bandwidth used by monitoring ranged between 5 and 35Kbps o The only significant spike was in bandwidth used by WMI+SAM, which was higher by 10Kbps than any other monitoring technique
Details The diagram below (and associated spreadsheet) shows the change in RAM, CPU and Network for a target device when different monitoring is applied. The target device was a Vmware guest running on ESX 5.0, provisioned as Windows 2008 R2 (Version 6.1.7601 Service Pack 1 Build 7601) with 4 single-core 3.07GHz Intel Xeon CPUs and 16 Gb of RAM. No other processes were running on the server while this testing was done. Poller A the one doing the heavy collection of metrics another Vmware guest on ESX 5.0 running with Windows 2008 R2 (Version 6.1.7601 Service Pack 1 Build 7601) with 4 single-core 3.07GHz Intel Xeon CPUs and 16 Gb of RAM. Database A was a HP Proliant BL460c G7 with 2 3.07 GHz 6-core Intel Xeon CPUs and 16 Gb RAM. It was running Windows 2008 R2 and MS SQL 2008 Standard edition. Poller B the one doing the standard monitoring we were measuring was a Vmware guest on ESX 5.0 running with Windows 2008 R2 (Version 6.1.7601 Service Pack 1 Build 7601) with 4 dual-core CPU s and 2 logical processors for an effective total of 16 3.07 GHZ Intel Xeon CPU and 12Gb RAM Database B was a HP Proliant BL460c G7 with 2 12-core 3.07GHz Intel Xeon CPU and 192Gb RAM. It was running Windows 2008 R2 and MS SQL 2008 Standard For the SAM monitoring, we built a template that collected 3 perfmon counters, checked for 3 eventlog messages, and gathered the status of 1 service.
Timeline and Graphical Results The sequence of events (corresponding to the numbered red lines on the diagram) are: 1. At 6:45 the target was baselined with ping-only 2. switch to SNMP monitoring: CPU/RAM topology 2 HD RAM/Phys memory as disk 3. 8:50: add SAM template 4. 10:20: Changed to WMI polling CPU/RAM 2 HD RAM/Phys memory as disk 1 nic 5. 11:15: removed SAM template
SolarWinds Monitoring Impact on a Polling Engine Premise and Architecture For this test, we wanted to understand how different monitoring types change the overall load on a polling engine, when those monitors are performed on a significant number of machines The test sequence was: 1) Load a number of servers and monitor their hardware via ICMP/SNMP 2) Add a SAM template 3) Convert those servers to ICMP/WMI 4) Remove the SAM template In this scenario we couldn t reasonably have the monitoring server monitor itself, so we used a second monitoring implementation that would aggressively gather statistics from real poller as we put it through each test 300 Devices Aggressive polling to observe changes on targe Normal polling using standard monitoring techniques Poller A Poller B Database A Database B Summary Ram utilization remained steady throughout the testing, with a high point of 42.3% and a low of 32.7% CPU usage jumped from 2% to 64% o Ping-only ranged from 6% to 38% o SNMP ranged from 9% to 32% o Adding a SAM template to SNMP increased CPU to 12% through 50% o Switching to WMI Polling (still with SAM) took the poller up to 17% through 62% o And WMI polling used between 9% and 47% CPU Bandwidth usage ranged from 2Mbps up to 55Mbps overall o Ping only: 2 10Mbps o SNMP polling: 2.3-11.4Mbps o SNMP+SAM: 5 55Mbps o WMI+SAM: 3Mbps 51Mbps o WMI polling: 3 47Mbps
Details The diagram below (and associated spreadsheet) shows the change in RAM, CPU and Network for a polling engine when different monitoring is applied to approximately 300 nodes. The target poller the one doing the heavy lifting was a Vmware guest on ESX 5.0 running with Windows 2008 R2 (Version 6.1.7601 Service Pack 1 Build 7601) with 4 single-core 3.07GHz Intel Xeon CPUs and 16 Gb of RAM. It should be noted that, besides the 300 nodes in this test scenario, the polling engine was managing another 700 nodes at the same time. The database connected to the target poller was a HP Proliant BL460c G7 with 2 3.07 GHz 6-core Intel Xeon CPUs and 16 Gb RAM. It was running Windows 2008 R2 and MS SQL 2008 Standard edition. The monitoring poller the one that was watching the stats on the target poller was a Vmware guest on ESX 5.0 running with Windows 2008 R2 (Version 6.1.7601 Service Pack 1 Build 7601) with 4 dual-core CPU s and 2 logical processors for an effective total of 16 3.07 GHZ Intel Xeon CPU and 12Gb RAM The database connected to the monitoring poller was a HP Proliant BL460c G7 with 2 12-core 3.07GHz Intel Xeon CPU and 192Gb RAM. It was running Windows 2008 R2 and MS SQL 2008 Standard
Timeline and Graphical Results The sequence of events and corresponding (red) markers are: 1. 2:23: scanned 400 nodes 2. 2:35: switched to ping-only 3. 2:50: finished switching to ping-only 309 nodes total 4. 4:05: re-scanned nodes, updated to SNMP 5. 4:15: scan completed 309 nodes 1008 disks 1620 nics 6. 5:45: added SAM Template 7. 9:03: re-scanned nodes, hand-converted some, updated to WMI 8. 9:50: scan and convert completed 301 nodes 1226 disks 299 nics 9. 5:45: removed SAM Template