Better Integration of Systems Management Hardware with Linux LINUXCON NORTH AMERICA Aug 2014 Charles Rose Engineer Dell Inc.
Agenda Introduction Systems Management Hardware/Software Information Available to the Service Processor The Need for Better Integration Integration of the Service Processor with Linux Managing Servers In-band and Out-of-band Current State IPMI Exchange of information between OS and Service Processor System Recovery/Debug SNMP Redirection USB NIC Pass-through Server Health Future Features OS Event logging in Service Processor Aid with Diagnostic/Debugging Automatic Configuration of console redirection 2
Introduction 3
Systems Management Hardware/Software Systems Management Hardware on Server systems: Helps manage, monitor, update and deploy Servers. Provides remote management and configuration options. Independent of the presence and status of the Operating System. Referred to as Service Processor/Baseboard Management Controller (BMC) Interfaces/API IPMI CIM WSMAN SSH SNMP Telnet VNC Web UI 4
Information Available in the Service Processor Server Hardware CPU RAM Storage/RAID Controller NIC Convergent Network Adapter/Fibre Channel Server Firmware BIOS Service Processor NIC, Storage Controller Server Software NIC IP, drivers 5
The need for better Integration 6
Integration of the Service Processor with Linux Servers can be managed: Over the systems management interface (IPMI, CIM, SNMP) Out-of-band. Over the OS s network interface (SNMP, CIM, etc.) In-band. In-band or out-of-band should not result in loss of information/functionality. OS information should be available in the Service Processor. Service processor information should be available in the OS. Operating System Server Hardware Service Processor Eliminate the need for any proprietary agents on the OS. Utilize OS to Service Processor Pass-through network. LAN On Motherboard. Virtual USB NIC. Security Considerations. In-band Out-of-band 7
Managing Servers In-band and Out-of-band Operating System Server Hardware Service Processor Operating System Server Hardware Service Processor Operating System Server Hardware 8 Management Console In-band Out-of-band Service Processor Managed Servers
Current Status 9
IPMI IPMI kernel module Autoload Older systems required OpenIPMI s startup script to load ipmi kernel modules Kernel 3.10 and later will autoload ipmi modules ipmi_devintf Ipmi_si Ipmi_msghandler Simplifies IPMI s use in installation/livecd environments ipmi_watchdog does not yet load automatically TODO: autoload ipmi_watchdog 10
Exchange Information between OS and Service Processor What OS is running on a server? What is the Service processor s IP/URL? OS information is set in the Service Processor System Host Name Operating System Operating System Version Service Processor s IP/URL is exported to the OS /etc/init.d/exchange-bmc-os-info ipmitool/contrib 11
System Recovery/Debug On OS lock-up, capture information that can aid with debugging. Watchdog timer facility provided by the Service Processor Unlike the Chipset Watchdog (itco), does more than just resetting the system. Record failure in Sensor Event Log Send alerts over SNMP/SMS/Phone, etc. Capture VGA as a JPEG, Capture Video. 12
System Recovery/Debug IPMI driver has had support to detect/log kernel panic events for years. Linux Watchdog API: ipmi_watchdog.ko /dev/watchdog interface to the Service Processor. watchdog pings converted to KCS messages to BMC. Traditionally required agents in OS to send KCS messages to BMC. Watchdogd or Systemd can act as watchdog daemons in the OS. Can co-exist/supplement kdump/kexec, requires some guess work. TODO: Update ipmi_watchdog.ko to support multi-watchdog. 13
SNMP Redirection Service Processor has exhaustive Hardware information. OS contains information for resources it manages. Many Management Consoles communicate with OS s SNMP agent. Hardware health/inventory information available to OS is limited/non-exhaustive. Service Processor s OID is grafted as part of the OS s SNMP MIB. Traps from Service Processor can be configured to reach the network s Trap Sink. Hardware Health is now available to management console. Support SNMP v2 and v3. SNMP proxy TRAP forward Management Console: SNMP get/set TRAP Operating System Server Hardware Service Processor 14
SNMP Redirection Operation Get/Set Enable SNMP on the Service Processor proxy get/set SNMP requests to the Service Processor s IP for a subset of OID SNMPv2-SMI::enterprises.674.10892 Trap Enable snmptrapd to accept traps from Service Processor s IP. forward traps to sink configured on the host. Enable SNMP Alerting on Service Processor ipmitool-1.8.15 contrib/bmc-snmp-proxy 15
USB NIC Pass-Through Dedicated channel for OS Service Processor communication Operating System Service Processor at 169.254.0.1 (default). Non-routable. Automatic configuration with Avahi and nss-mdns or NetworkManager. Server Hardware USB NIC Service processor can be reached with idrac.local http://idrac.local # ipmitool I lan H idrac.local # snmpget idrac.local Service Processor 16
System Health Health of CPU, Fan, Temp, Voltages, etc. available already Aggregate the above into System Health machine readable value. Available in-band and/or out-of-band Can be used by cluster software, virtualization managers, cloud compute managers to perform workload migration decisions Available over SNMP or IPMI SNMP redirection can make health available in-band Health Operating System Server Hardware Service Processor Health 17
System Health over IPMI and SNMP IPMI raw 0x30 0x51 Byte 5: Global and Storage status Bit 0- Set = Storage status Normal Bit 1- Set = Storage status Error (non-critical) Bit 2- Set = Storage status Failed (critical) Bit 3- Set = Storage status Unknown Bit 4- Set = Global status Normal Bit 5- Set = Global status Error (non-critical) Bit 6- Set = Global status Failed (critical) Bit 7- Set = Global status Unknown SNMP SNMPv2-SMI::enterprises.674.10892.5.2.2.0 1: other -- the is not one of the below. 2: unknown -- not known or monitored. 3: ok -- the status is ok. 4: noncritical -- the status is warning, noncritical. 5: critical -- the status is critical (failure). 6: nonrecoverable -- the status is nonrecoverable (dead). 18
Opportunities 19
OS event logging in Service Processor Log OS Events to the Service Processor to have a better understanding of the host OS: OS Started OS Stopped OS Install Started OS Install Stopped OS Install Aborted OS Install Failed Standard IPMI Sensor Events Combined with OS Name, OS Version and Power Status information, this will help administrators/console software on server state. SUSE s YaST2 Hooks 20
Aid with Debugging OS configuration and logs crucial for debugging Logs might be unavailable if system has locked-up or there was a Kernel Panic. On application/kernel error: Collect relevant configuration and logs. Store in Service Processor. Accessible out-of-band even with host OS down. 21
Automatic Configuration of Console Redirection Most headless servers use IPMI Serial Over LAN to access remote server s console. BIOS contains options to setup redirection to serial console. Administrator has to duplicate BIOS setup information on kernel command line. console=ttys0,115200 Can reduce overhead if kernel can read BIOS serial port information. ACPI already has SPCR Serial Port Console Redirection. Linux support was introduced in 2.4 and removed in 2.5. Would be nice to have something similar. 22
References IPMI on Linux http://openipmi.sourceforge.net/ipmi.pdf http://ipmitool.sourceforge.net/ http://www.gnu.org/software/freeipmi/ Related Projects http://www.openlmi.org/ https://github.com/abrt/abrt/wiki/abrt-project Scripts Exchange Information http://sourceforge.net/p/ipmitool/source/ci/master/tree/contrib/exchange-bmc-os-info.init.redhat SNMP Redirection http://sourceforge.net/p/ipmitool/source/ci/master/tree/contrib/bmc-snmp-proxy Installer Status Event logging http://sourceforge.net/p/ipmitool/patches/97/ Fedora Feature Page http://fedoraproject.org/wiki/features/agentfreemanagement Dell idrac http://en.community.dell.com/techcenter/systems-management/w/wiki/3204.dell-remote-access-controller-drac-idrac.aspx 23
Thank You! charles_rose@dell.com linux-poweredge@dell.com 24
Backup 25
Server Block Diagram 26
Automated System Recovery with Systemd Watchdog Daemon Set RuntimeWatchdogSec Set ipmi_watchdog timeout to the same Blacklist chipset watchdog Load ipmi_watchdog Reload systemd systemctl daemon reexec 27