Ongoing evolution of Linux x86 machine check handling

Similar documents
Machine check handling on Linux

MCA Enhancements in Future Intel Xeon Processors June 2013

How to Configure Intel X520 Ethernet Server Adapter Based Virtual Functions on Citrix* XenServer 6.0*

Intel RAID Volume Recovery Procedures

Intel Embedded Virtualization Manager

Intel RAID Controller Troubleshooting Guide

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

Configuring RAID for Optimal Performance

XID ERRORS. vr352 May XID Errors

Intel RAID Controllers

RAID Basics Training Guide

BIOS Update Release Notes

Intel N440BX Server System Event Log (SEL) Error Messages

ASF: Standards-based Systems Management. Providing remote access and manageability in OS-absent environments

Intel Matrix Storage Console

White Paper. ACPI Based Platform Communication Channel (PCC) Mechanism. InSarathy Jayakumar Intel Corporation

PARALLELS SERVER 4 BARE METAL README

AirWave 7.7. Server Sizing Guide

Intel Entry Storage System SS4000-E

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

KVM: A Hypervisor for All Seasons. Avi Kivity avi@qumranet.com

21152 PCI-to-PCI Bridge

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

Deploying Citrix MetaFrame on IBM eserver BladeCenter with FAStT Storage Planning / Implementation

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

COS 318: Operating Systems. Virtual Machine Monitors

Intel RAID Software v6.x (and newer) Upgrade/Installation Procedures

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Introduction to Windows Server 2016 Nested Virtualization

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

CSC 2405: Computer Systems II

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

AKIPS Network Monitor Installation, Configuration & Upgrade Guide Version 16. AKIPS Pty Ltd

Intel RAID Basic Troubleshooting Guide

Enterprise-class versus Desktopclass

C440GX+ System Event Log (SEL) Messages

Intel Storage System SSR212CC Enclosure Management Software Installation Guide For Red Hat* Enterprise Linux

Intel Service Assurance Administrator. Product Overview

Intel Itanium Processor Family Error Handling Guide February 2010

Express5800/320Lb System Release Notes

Intel Desktop Board DP35DP. MLP Report. Motherboard Logo Program (MLP) 6/17/2008

BIOS Update Release Notes

Intel Simple Network Management Protocol (SNMP) Subagent v6.0

Hybrid Virtualization The Next Generation of XenLinux

BIOS Update Release Notes

I/O Device and Drivers

Intel Extreme Memory Profile (Intel XMP) DDR3 Technology

LANDesk Management Suite 8.7 Extended Device Discovery

EMC RepliStor for Microsoft Windows ERROR MESSAGE AND CODE GUIDE P/N REV A02

CS161: Operating Systems

Intel RAID Software User s Guide:

Processor Reorder Buffer (ROB) Timeout

IOS110. Virtualization 5/27/2014 1

QNAP in vsphere Environment

IBM Europe Announcement ZG , dated March 11, 2008

AKIPS Network Monitor Installation, Configuration & Upgrade Guide Version 15. AKIPS Pty Ltd

Full and Para Virtualization

GRID VGPU FOR VMWARE VSPHERE

Best Practices for Installing and Configuring the Hyper-V Role on the LSI CTS2600 Storage System for Windows 2008

NVIDIA GRID 2.0 ENTERPRISE SOFTWARE

KVM: Kernel-based Virtualization Driver

Cloud based Holdfast Electronic Sports Game Platform

Intel Desktop Board D925XECV2 Specification Update

Difference between Enterprise SATA HDDs and Desktop HDDs. Difference between Enterprise Class HDD & Desktop HDD

MySQL and Virtualization Guide

HRG Assessment: Stratus everrun Enterprise

An Oracle White Paper November Oracle Real Application Clusters One Node: The Always On Single-Instance Database

Intel Desktop Board DG31GL

Audit & Tune Deliverables

OpenBSD s New Suspend and Resume Framework

Enhanced Diagnostics Improve Performance, Configurability, and Usability

Bandwidth Calculations for SA-1100 Processor LCD Displays

Intel RAID Software User s Guide:

EMC DATA DOMAIN DATA INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

System Management Interrupt Free Hardware

Running Windows 8 on top of Android with KVM. 21 October Zhi Wang, Jun Nakajima, Jack Ren

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Quest vworkspace Virtual Desktop Extensions for Linux

Enterprise-class versus Desktop-class Hard Drives

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

High Availability Storage

Monthly Specification Update

Novell SUSE Linux Enterprise Virtual Machine Driver Pack

Virtualization. Pradipta De

kvm: Kernel-based Virtual Machine for Linux

Exadata Database Machine Administration Workshop NEW

Intel Server Board S5000PALR Intel Server System SR1500ALR

COS 318: Operating Systems

Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center

Intel Server S3200SHL

Intel RAID Software User s Guide:

UEFI Driver Development Guide for All Hardware Device Classes

Intel architecture. Platform Basics. White Paper Todd Langley Systems Engineer/ Architect Intel Corporation. September 2010

The Microsoft Windows Hypervisor High Level Architecture

MapGuide Open Source Repository Management Back up, restore, and recover your resource repository.

Transcription:

Ongoing evolution of Linux x86 machine check handling Sept. 2009 Andi Kleen LinuxCon 2009

What's a good error? User has to see it, of course That can be surprisingly difficult Also psychological barriers ( users don't read errors ) High level classification Software error versus hardware error Don't want a hardware error reported as a kernel bug Still have low level details for experts (ideally separated) Identify affected component Do not require low level knowledge to process Works out of the box How serious is the error? Do not upset the user unnecessarily 2

Error sources Machine checks from the CPU NMI PCI-Express Advanced Error Handling (AER) Chipset ACPI4 (APEI) Drivers SATA errors Ethernet... Presentation only covers machine check errors from the CPU This includes memory on modern systems 3

What is a machine check? Machine check is a hardware error reported by the CPU Not a software problem! Hardware corrects most problems, but sometimes it can fail Memory, Interconnect, Cache, Internal errors Uncorrected errors raise exceptions ( MCEs ) Better to stop than to continue with corrupted data Otherwise the corrupted data could hit disk or give wrong results If you aren't sure where it is, stop the machine... Corrected errors are reported in the background Either using a poll handler or with interrupts ( CMCI ) Didn't cause corruption (yet) 4

Why are machine checks important? They report memory errors on modern systems Memory error rate scaling roughly with memory sizes Memory sizes are increasing quickly More cores need more memory Virtualization needs a lot of memory This also means more memory errors So good error handling is important On large clusters, errors are common What's uncommon on a single system becomes common when you have a hundred of them and even very rare events become common on thousands of systems In general, good error diagnosis is useful If you ever searched manually for a bad DIMM Saves time and hassle 5

MCE errors in practice today Error flows for uncorrected and corrected errors Assuming 64 bit or 2.6.31+ with CONFIG_X86_NEW_MCE Mcelog has to be installed Some of these flows are still somewhat clumsy The future will be brighter, hopefully 6

Classic unrecoverable MCE error today System detects uncorrected error Requires ECC for memory Machine check exception happens Machine check handler collects error and prints it out CPU x: Machine check exception... System panics on unrecoverable error On auto reboot (panic=30) can be logged after reboot on many systems Available then decoded in /var/log/mcelog some time after boot Trap: doesn't work with double reboot or with power switch You will see the panic on the console (not in X) or on serial/netconsole If you don't have auto reboot a logging console is very useful Console output can be run through mcelog -ascii to decode Analyze error, based on decoded output For example, map to DIMM ( Memory DIMM ID of error:... ) If common, take corrective action 7

Corrected error flow today A data bit flips Hardware detects error, using checksum, and corrects it and reports event CMCI interrupt happens or poll timer picks it up Kernel logs it to internal buffer accessible through /dev/mcelog Note this buffer can overflow Mcelog picks it up Either as cronjob or in daemon mode or as trigger In cronjob mode up to 10 minutes delay, worst case with polling Mcelog decodes Logged in /var/log/mcelog or syslog Analysis of the log entry Identify component Take corrective action if common 8

Low level MCE handler improvements General overhaul after comprehensive audit A lot of small improvements, too many to list Monarch support: synchronization over all CPUs Collect errors from all CPUs Synchronize all CPUs Process the most serious error first to avoid data corruption When the hardware didn't contain the error shut down Bank sharing Handle shared machine check banks correctly Corrected Machine Check Interrupt Support ( CMCI ) Injector support for testing and a comprehensive test suite 9

MCA recovery New CPU feature in upcoming Nehalem-EX CPU Recovery from some memory uncorrected errors For example, Patrol scrub memory error in the background Required a lot of changes in the MCE handler to do reliable When recovering it's much more important to handle all corner cases OS finds out what the corrupted page does And attempts to get rid of it Machine check architecture has new status bits for recovery Signalled, Action-Required Different types of errors: Action-Optional, Action-Required, UCNA, Corrected 10

HWPoison handling in the VM VM finds out who owns a page and stops using it Pages with copy on disk can be just dropped Or application is killed, if data has no copy IO error for dirty file cache pages Free pages will be ignored on allocation Difficult because error can come in at any time Can disrupt normal page livecycle Error code has to be very careful Also testing is difficult Put page on bad page list and never reuse it again Page table entry of any mappings is poisoned Early kill versus late kill mode 11

Virtualization Virtualization requires a lot of memory Many eggs in a single basket ( servers on a single machine ) Error handling and error containment is important If there's a problem only kill single guest, not all KVM guests act like processes So uses the per process infrastructure in standard hwpoison Uncorrected recoverable errors can be forwarded to a guest Guest can inject the error as machine check (or KVM kills) Guest with MCA recovery support can recover (or panic) Similar code developed for Xen No forwarding for Xen currently 12

Memory error application interface Applications can catch memory errors, which are signals Was needed for KVM, but can be used by others too Applications have often cache, which they can drop Expect only a few specialized applications to use it SIGBUS with an address can be caught BUS_MCEERR_AO Action-Optional error in process Get rid of page specified by si_addr, si_addr_lsb Take out of free list or similar BUS_MCEERR_AR Action-Required error in current execution context Need to abort right now (siglongjmp etc.) Prctl to set early kill vs late kill for the process Late kill is typically better for error aware applications 13

Mcelog User space backend that decodes and processes MCE errors Also identifies components with some firmware help Traditionally on 64 bit x86, now on 32 bit too Traditionally cronjob every 5 minutes, future daemon Daemon mode allows to keep state about errors in memory with query interface and triggers Some old attempts using an on-disk database proved difficult Can run shell scripts on specific events ( triggers ) Notify administrator, offline component,... Long term goal: high level errors in syslog Some steps into this direction 14

Mcelog error flow 15

Error accounting Possible in mcelog daemon mode Often most interesting is which component the error affects DIMM, memory, PCI card, etc. To see trends and replace the right component quickly if needed Individual errors are often not that interesting Errors often come in bursts and individual errors in a burst are not interesting Large clusters can generate a lot of data, which is difficult to process Mcelog moving towards accounting errors per component Only reports n errors in last x hours on component k Triggers when thresholds are exceeded Discovers component names with firmware help Can disable individual error logging for less data 16

Open issues Crashdump handling More testing is always useful Stress test suite under way Any contributions welcome Better error reporting in general More high level errors better presented More error sources in mcelog Intelligent error handling in mcelog If you have ideas, feel free to contact me 17

Resources http://halobates.de/mce.pdf Old paper about Linux machine checks Intel Software Developer's Manual: 3A/B System Programming Guide Description of the x86 Machine check architecture git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 MCE development tree (hwpoison, mce4) git://git.kernel.org/pub/scm/utils/cpu/mcelog.git Mcelog development repository ftp://ftp.kernel.org/pub/linux/utils/cpu/mce/ Mcelog releases git://git.kernel.org/pub/scm/utils/cpu/mce-test.git MCE test suite (or now part of LTP to (http://ltp.sourceforge.net) git://git.kernel.org/pub/scm/utils/cpu/mce-inject.git MCE injector 18

Backup 19

Mcelog versus EDAC EDAC old style driver for chipset memory controllers Exposes a lot of low level details New model: memory controller in CPU Memory errors integrated with machine checks Handled by standard MCE subsystem EDAC needs driver for each platform And often accesses non stable registers that could change even with steppings Mcelog uses standardized interfaces No integration with software Requires special configuration for each board to identify components Cannot do a lot of things that user space (mcelog) can do 20

Testing Testing machine checks is difficult Normal operation doesn't have enough errors So standard Linux community testing model doesn't work very Needs special injection support and test suites Various injectors on software level (without hardware support) Low level machine check injector Page error injector for process and for arbitrary page New injectors, test suite mce-test for testing MCEs Testing low level handler with mce-inject Testing hwpoison VM code in process context Bring pages into specific states and test to see if they can be poisoned Ongoing work to get the best test coverage 21

Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. All dates provided are subject to change without notice. Intel is a trademark of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2009, Intel Corporation. All rights are protected. 22

23