Splunk for VMware Virtualization Marco Bizzantino marco.bizzantino@kiratech.it Vmug - 05/10/2011
Collect, index, organize, correlate to gain visibility to all IT data Using Splunk you can identify problems, models, threats and deals that help you to improve IT and business decisions
Real Time indexing
Search and Investigate
Interact with search results
Correlate complex events
Analyze and Report
Build custom Dashboards
Deploy IT Apps
Centralizes Data Across the Environment - Universal Forwarder sends data to Splunk from remote systems - Uses minimal system resources, easy to install and deploy - Delivers secure, distributed, real-time universal data collection for tens of thousands of endpoints
Scales to TBs/day and Thousands of - Automatic load balancing linearly scales indexing Users - Distributed search and MapReduce linearly scales search and reporting
Runs Across Datacenters - Distributed search unifies the view across locations - Role-based access controls how far a given user's search will span
Getting Data into Splunk Agent and Agent-less Approach for Flexibility syslog TCP/UDP Local File Monitoring log filesconfig files dumps and trace files syslog compatible hosts and network devices Mounted File Systems \\hostname\mount WMI Event Logs Performance Active Director yy she ll cod e perf Scripted Inputs shell scripts custom parsers batch loading virtual host Windows Inputs Event Logs performance counters registry monitoring Active Directory monitoring Unix, Linux and Windows hosts Windows hosts Custom apps and scripted API connections Agent-less Data Input Splunk Forwarder Windows hosts 13
Universal Data Forwarder Forward data without negatively impacting production performance. Delivers secure, distributed, real-time universal data collection for 10 s of thousands of endpoints Extends Splunk data fabric to large scale private cloud and desktop environments Uses minimal system resources, easy to install and deploy Logs Universal Forwarder Deployment Message s Configurations Metrics Central Deployment Management Scripts Monitor files, changes and the system registry; capture metrics and status. 14
IT Infrastructures Are Complex, Expensive and Mission Critical Explosion of servers number, differences and interdependencies Growing volumes of IT data, captive in silos, all managed separately Diagnosing and fixing issues too time-consuming and manual Relentless focus on IT efficiency
Virtualization Increases Complexity Virtualization separates applications from hardware Virtualization allows sharing of infrastructure but problems also get shared! Troubleshooting issues with application service levels now needs one more layer of visibility 16
Why Splunk Is Unique for Complex, Distributed Environments Universal. Works with ALL IT infrastructure data from a single place, without complex parsers or adapters. Centralized views spanning virtual and physical environments provide visibility and correlation across all layers. Proactive. Monitor and alert for early warning signs to preempt performance degradation and catch issues before they affect services. Flexible. Analyze any type of problem with rapid drilldown into source data to precisely pinpoint root cause; adapts quickly to any type of change in your environment Massively Scalable. Scale linearly with commodity hardware; scale horizontally across Splunk instances. Easily Customizable. Create reports, dashboards and views on the fly and in minutes. Customize views depending on the type of reporting needed (example: business-level vs device-level metrics) 17
Splunk for VMware Gain deep insights into virtual environments with correlation across the application, hypervisor and hardware tiers. Scalable and extensible monitoring for all elements of the virtual infrastructure 18
What Splunk Monitors in Virtual Environments VMware vcenter Server VMware vsphere VC logs and events Application configurations/ logs/metrics OS logs/metrics Kernel logs Network device logs Storage access logs Host level logs 19 Splunk natively works with data generated by every layer of the stack ESX/vSphere logs can be sent over syslog to Splunk VC logs and events can be directly tailed by Splunk Collect from within virtual machines VMware app to pull cluster, host and virtual machine metrics
Only Splunk Can: Find errors and issues hidden in host log files and correlate them with application performance slow downs or outages Detect hypervisor functionality failure issues (e.g. HA/DRS failures for VMware environments) Persist all events for compliance and security Coming in the future: Display critical host and virtual machine metrics/monitor configurations and correlate with application behaviour Persist VC data and use for historical analysis and trending 20
Index - Remotely indexes all of the logs, metrics and configurations from all the applications and operating systems, hypervisors and the underlying infrastructure Search Features - Pre-defined searches accelerate troubleshootingacross dynamic virtual environments - Instantaneous free form search across all IT data: apps, guests, VMs, physical host and the network - Find information hidden in logs without having to log in to multiple, individual hosts or virtual machines 21
Features Alert - Pre-defined alerts notify administrators of common performance and resource contention issues - Root cause investigation searches can be saved as new alerts to improve monitoring coverage over time - Automated actions using management APIs Report - Pre-defined reports and dashboards provide management visibility into workload and service levels within virtualized environments - Custom and ad-hoc reports can be created easily - No schema to maintain. Identify fields and report on identified fields on the fly - Persist transient data and flexibly report on it to meet compliance requirements 22
Example Scenario 1: Symptom : Application performance slows down Diagnosis : Splunk dashboards for the application show normal CPU/memory availability Splunk dashboards for ESX indicates SCSI aborts on the host running the application Root Cause : Virtual machine is connected on the backend to a shared storage LUN where many other busy VMs reside Virtual machine running the application is encountering storage conflicts
Example Scenario 2: HA Heartbeat network Symptom : Applications on a particular host suddenly get powered off Diagnosis : Splunk shows syslog entries for the particular IP address as missing for more than 12 seconds but less than 15 seconds Root cause : Sometimes VMware features (ironically, High Availability) will cause applications to get powered off. HA heartbeat network goes down, 12 seconds later VMs get powered off, if network resumes in 3 secs after that, they don t get restarted.
Example Scenario 3: HA Heartbeat network Symptom : Storage accesses by virtual machines are very slow Diagnosis : Splunk dashboard indicates a huge increase in entries in vmware-x log files Root cause : VMware HA wrongly tries to restart some virtual machines. Logs are flooded with the below warning messages WARNING: Swap: vm 13001: 1480: Failed to create swap file '/volumes/datastorename/vmfolder/vmname-6bc43c2b.vswp': Out of resources
Other Nuggets in Hypervisor Logs: Host level logs contain critical events like: VMotion failed due to virtual hardware misconfiguration connectivity lost : for networking and storage (vsphere events have vprob preceding them) vmfs volume locked : possibly because the host crashed while accessing a volume APD : all paths are dead vprob.net.migrate.vmknic : Migration failures due to NIC misconfiguration Storage misconfiguration issues
Coming in the Future : Metrics in Splunk! Example Uses: View CPU utilization or CPU ready time by virtual machines on the same host, by virtual machines in the cluster and other views View memory swap rates to determine if memory is being allocated appropriately View network and storage stats to proactively discover contention Persist your metrics for historical views or analyses without overwhelming vcenter Server
Getting Data In ESX logs Forwarding host logs through Syslog to Splunk Edited ESX host config file and restarted syslog server, details at below URL http://www.splunk.com/wiki/community:vmwareesxsyslog Pulling Virtual Center logs and events Splunk can directly index the VC logs file at C:\Documents and Settings\All users\application Data\VMware\VMware VirtualCenter\Logs
Getting Data In The new APP The Splunk for VMware solution collects and harnesses data from the virtualization layer to enable true end-to-end visibility in virtualized environments 1. Splunk App for VMware. This app has the views, dashboards and saved searches that provide insights into your virtualization layer 2. Splunk Forwarder Virtual Appliance for Vmware. This VM image (.ova) is a data collector that you deploy using vcenter (VC) 3. Splunk Add-on for vcenter. This is used to collect vcenter log data and is installed into Splunk Forwarders running on vcenter machines 4. Perl API package
Virtualization management using Splunk Index metrics, configurations, status and logs from the hypervisor via the VMware ESX, Xen and other APIs as well as logs and other data from the guest OS and applications. This indexed data repository will survive guest power-down, critical for compliance-mandated log retention.
Virtualization management using Splunk Systems administrators and developers will initially use Splunk to troubleshoot problems with apps deployed in the virtual infrastructure, with the ability to navigate from the application tier to guest OS, underlying hypervisor, and identify cross-guest issues. Security analysts will use Splunk to investigate incidents and identify zero-day attack footprints across both running and power-down systems.
Virtualization management using Splunk Over time, everyone will enrich the indexed data with knowledge of the environment and the data it produces, breaking down the even more severe silos of knowledge endemic to virtualized environments.
Virtualization management using Splunk Virtualized infrastructure managers will come to automate Splunk searches with alert triggers to easily monitor for new problems they identify as they adopt virtualization and lack adequate monitoring coverage.
Virtualization management using Splunk Splunk reports and dashboards will provide a quick way to build visibility into utilization, performance and faults across all tiers of the virtualized environment, even where existing management tools haven't kept up. Ultimately, operations staff will realize that proactively searching and visualizing machine data across the stack with Splunk is one of the best approaches to navigating the unknowns of new virtualized infrastructure.
Typical Saved Searches/Alerts Alerts for: Host re-boots Hardware or machine check errors Predict HA failures by watching for a memory leak - Hosts exceeding soft memory limits - SCSI Aborts AdHoc reports such as: Quick scan to see where VMware tools are out of date What percentage of Win 64 bit vs. Win 32 bit? How many Red Hat Linux VMs do we have? Who logged into VM environment? What did they do?
Questions? Useful links: www.splunk.com http://splunk-base.splunk.com/apps/ http://splunk-base.splunk.com/answers/ http://docs.splunk.com/documentation http://splunkninja.ning.com/