Oak Ridge National Laboratory Computing and Computational Sciences Directorate. Lustre Crash Dumps And Log Files



Similar documents
Oak Ridge National Laboratory Computing and Computational Sciences Directorate. File System Administration and Monitoring

Troubleshooting. System History Log. System History Log Overview CHAPTER

Oracle Linux Advanced Administration

Copyright 2015 SolarWinds Worldwide, LLC. All rights reserved worldwide. No part of this document may be reproduced by any means nor modified,

Virtual Private Systems for FreeBSD

Enterprise Manager Performance Tips

Manage the Endpoints. Palo Alto Networks. Advanced Endpoint Protection Administrator s Guide Version 3.1. Copyright Palo Alto Networks

McAfee Web Gateway 7.4.1

Guidelines for using Microsoft System Center Virtual Machine Manager with HP StorageWorks Storage Mirroring

What is new in Zorp Professional 6

Tech Tip: Understanding Server Memory Counters

Configuring System Message Logging

An Oracle White Paper September Advanced Java Diagnostics and Monitoring Without Performance Overhead

Monitoring Tools for Large Scale Systems

IMF Tune v7.0 Backup, Restore, Replication

Trend Micro Incorporated reserves the right to make changes to this document and to the products described herein without notice.

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

Fiery Clone Tool For Embedded Servers User Guide

Module 12. Configuring and Managing Storage Technologies. Contents:

Informatica Master Data Management Multi Domain Hub API: Performance and Scalability Diagnostics Checklist

How To Install An Aneka Cloud On A Windows 7 Computer (For Free)

Hands-On Microsoft Windows Server 2008

System Monitoring and Diagnostics Guide for Siebel Business Applications. Version 7.8 April 2005

WINDOWS PROCESSES AND SERVICES

Cray Lustre File System Monitoring

PostgreSQL Backup Strategies

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

EView/400i Management Pack for Systems Center Operations Manager (SCOM)

Configuring Load Balancing

COMMANDS 1 Overview... 1 Default Commands... 2 Creating a Script from a Command Document Revision History... 10

FINFISHER: FinFly ISP 2.0 Infrastructure Product Training

Manage Log Collection. Panorama Administrator s Guide. Version 7.0

Deploying a Virtual Machine (Instance) using a Template via CloudStack UI in v4.5.x (procedure valid until Oct 2015)

Performing Administrative Tasks

WEBTITAN CLOUD. User Identification Guide BLOCK WEB THREATS BOOST PRODUCTIVITY REDUCE LIABILITIES

Caché Data Integrity Guide

RedHat (RHEL) System Administration Course Summary

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam

Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction

XenMobile Logs Collection Guide

Running a Workflow on a PowerCenter Grid

NetBackup Performance Tuning on Windows

LOCKSS on LINUX. Installation Manual and the OpenBSD Transition 02/17/2011

POSIX and Object Distributed Storage Systems

Using Debug Commands

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Troubleshooting This document outlines some of the potential issues which you may encouter while administering an atech Telecoms installation.

Chapter 14 Virtual Machines

SonicWALL SRA Virtual Appliance Getting Started Guide

Red Hat Linux Internals

Audit Trail Administration

Deploying LoGS to analyze console logs on an IBM JS20

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

System Log Setup (RTA1025W Rev2)

Machine check handling on Linux

Caché Data Integrity Guide

EVault Software. Course 361 Protecting Linux and UNIX with EVault

Configuring System Message Logging

Percona Server features for OpenStack and Trove Ops

Veritas Cluster Server

INUVIKA TECHNICAL GUIDE

StarWind iscsi SAN Software: Using an existing SAN for configuring High Availability storage with Windows Server 2003 and 2008

NETWRIX EVENT LOG MANAGER

Cloud.com CloudStack Community Edition 2.1 Beta Installation Guide

S2A6620 Technical Service Bulletin. Cache Protect Battery Replacement

VMware Consolidated Backup: Best Practices and Deployment Considerations. Sizing Considerations for VCB...5. Factors That Affect Performance...

Using Debug Commands

Reinhard Stadler Customer Support Consultant HP Services April Analyzing data and displaying results

How To Configure Syslog over VPN

Example of Standard API

WebSphere Application Server security auditing

VX 9000E WiNG Express Manager INSTALLATION GUIDE

IBRIX Fusion 3.1 Release Notes

Tue Apr 19 11:03:19 PDT 2005 by Andrew Gristina thanks to Luca Deri and the ntop team

How To Install Project Photon On Vsphere 5.5 & 6.0 (Vmware Vspher) With Docker (Virtual) On Linux (Amd64) On A Ubuntu Vspheon Vspheres 5.4

SUSE Manager in the Public Cloud. SUSE Manager Server in the Public Cloud

High Availability of the Polarion Server

StarWind iscsi SAN Software: Implementation of Enhanced Data Protection Using StarWind Continuous Data Protection

Using Process Monitor

ClearPass Policy Manager 6.3

LOCKSS on LINUX. CentOS6 Installation Manual 08/22/2013

AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping

Parallels Virtuozzo Containers 4.7 for Linux

Basic System. Vyatta System. REFERENCE GUIDE Using the CLI Working with Configuration System Management User Management Logging VYATTA, INC.

Veritas Storage Foundation UMI error messages for HP-UX

CONFIGURING MICROSOFT SQL SERVER REPORTING SERVICES

2014 Electrical Server Installation Guide

Zing Vision. Answering your toughest production Java performance questions

GRID VGPU FOR VMWARE VSPHERE

The Evolution of Cray Management Services

ZXUN USPP. Configuration Management Description. Universal Subscriber Profile Platform. Version: V

Installing and Configuring a SQL Server 2014 Multi-Subnet Cluster on Windows Server 2012 R2

GL-250: Red Hat Linux Systems Administration. Course Outline. Course Length: 5 days

Cache Configuration Reference

Asterisk SIP Trunk Settings - Vestalink

Wavelink Client License Server Version 4.0 Reference Guide

10 STEPS TO YOUR FIRST QNX PROGRAM. QUICKSTART GUIDE Second Edition

Overview Customer Login Main Page VM Management Creation... 4 Editing a Virtual Machine... 6

Analyzing cluster log files using Logsurfer

Transcription:

Oak Ridge National Laboratory Computing and Computational Sciences Directorate Lustre Crash Dumps And Log Files Jesse Hanley Rick Mohr Sarp Oral Michael Brim Nathan Grodowitz Gregory Koenig Jason Hill Neena Imam November 2015 ORNL is managed by UT-Battelle for the US Department of Energy

Overview Crash Dumps What is a crash dump How does it fit into the Lustre flow How to configure and generate crash dumps How to use crash dumps System and Lustre debug log files Standard messages Tuning messages

Crash Dumps Definition: A portion of the contents of volatile memory (RAM) that is copied to disk whenever the execution of the kernel is disrupted Potential Causes: Kernel Panic* Non Maskable Interrupts (NMI) Machine Check Exceptions (MCE) Hardware failure Manual intervention * Some automatic, others only when explicitly activated

Relevance to Lustre With both clients and servers, Lustre kernel modules start multiple threads to process I/O requests Problems encountered during kernel execution can generate exceptions (known as LBUGs) System reboots are necessary to recover from LBUGs (the processing threads lock to help prevent further damage to the file system) Crash dumps provide a snapshot at the moment of the LBUG, allowing for analysis Analysis can help pinpoint code problems

Choices for dealing with LBUGs The default behavior is to lock the calling thread when encountering LBUGs, requiring a node reboot to resume normal operation Pass information about the LBUG to a custom tool lctl set_param upcall=/path/to/binary Called with 4 parameters: the string LBUG, the file where the LBUG occurred, the function name, and the line number in the file The calling thread remains locked in this scenario This method is not commonly used Create a kernel panic lctl set_param panic_on_lbug=1

Configuring Crash Kernel Need a separate kernel (the crash kernel ) to use when a kernel panic is detected Enabling panic_on_lbug will generate kernel panics when an LBUG is encountered Using kexec, this crash kernel executes, allowing access to the original kernel s memory. Otherwise, this information would be inaccessible RHEL/CentOS/SUSE kernels support creating this kernel by adding a parameter to the boot kernel crashkerneloption Ex: crashkernel=192m Ex: crashkernel=auto

Configuring Kdump The kdump tool handles automatically calling kexec, switching to the crash kernel Install the kexec-toolspackage Configure kdump through the /etc/kdump.conf file By default, core dumps are stored in /var/crash/ Can specify an alternate location, such as an nfs mount: nfs my.nfsserver.example.org:/path/to/export Or an scp target (ssh must be configured): ssh user@my.server.example.org:/dest/path By default, uses sshkey at /root/.ssh/kdump_id_rsa Will create a directory that includes the hostname and date, then write vmcore and vmcore-dmesg.txt files

Configure Kdump to use makedumpfile The makedumpfile utility dumps a kernel s contents into a file As part of the crash workflow, kdump can call makedumpfile to generate a file with the original kernel s contents. This file can then be debugged at some later point. Included within kexec-toolspackage Configured within /etc/kdump.conf: core_collector makedumpfile <options>

makedumpfile Options Make sure to pass the c option (to enable compression) The dump_level (-d) option: Allows omission of certain pages from the dump Is the sum of the chosen filtering levels: The message_level param allows the user to configure which outputs are included Follows a similar sum/bitmask setup as the dump_level parameter Option Description 1 Zero Pages 2 Cache Pages 4 Cache Private 8 User Pages 16 Free Pages Option Description 1 Progress Indicators 2 Common Messages 4 Error Messages 8 Debug Messages 16 Report Messages For example, to try including everything but free pages, but fall back to only collecting cache information: -d 16,25 Defaults to 7: progress indicators, common messages, and error messages

Full kdump configuration example /etc/kdump.conf: nfs my.server.example.org:/data/kdump/dumps core_collector makedumpfile d 16 c message_level 15 This will cause makedumpfile to generate a dump that contains the user, cache, and zero pages (all memory except free pages). Additionally, the kernel dump will include progress indicators, common messages, error messages, and debug messages.

Starting Kdump Ensure the kdump service is enabled: chkconfig kdump on (RHEL/CentOS 6) systemctl enable kdump (RHEL/CentOS 7) Start the service if not started already: service kdump start (RHEL/CentOS 6) systemctl start kdump (RHEL/CentOS 7) Verify that kdump is operational service kdump status (RHEL/CentOS 6) systemctl status kdump (RHEL/CentOS 7)

Using Crash The crash utility allows a user to interact with the contents of a kernel dump via a prompt Install crashpackage Use the appropriate kernel debuginfo and call the crash command: crash /usr/lib/debug/modules/kernel-${version}/vmlinux /path/to/vmcore If unsure of the kernel version of the vmcore: strings vmcore head

Crash commands Within the crash prompt, there are several useful commands for debugging: sys Gather system information:

Crash Commands bt Display a kernel stack backtrace log Dumps the kernel log_buf contents in chronological order ps Displays process status for selected, or all, processes in the system files Displays information about open files of a context dev Dumps character and block device data

Making sense of crash data Information from crash can be hard to digest Not always immediately helpful If a reproducer is found, multiple crash dumps can be compared to find a common pattern (usually through backtraces) Often most useful to vendors/developers

Lustre Syslog Messages As Lustre software code runs on the kernel, singledigit error codes display to the application; these error codes are an indication of the problem. These messages are sent to syslog and are available in /var/log/messages by default. Messages that contain LustreError, LBUG, or assertion failure can be indicative of problems. Also, check the console log (dmesg)

Lustre Debugging Logs Increasing verbosity can help when a problem occurs The overhead of enabling debugging can also hide problems related to race conditions Increasing verbosity can also decrease performance Controlled with: lctl {set,get}_param debug or sysctl lnet.debug Use + and - syntax to append/remove individual items instead of overloading entire setting The default log size is 5MB per CPU, but can be increased: lctl set_param debug_mb=1024 Logs use an in-memory ring buffer, so when they fill up, the oldest records will be discarded

Using lctl for debug logs lctl dk <filepath> will dump the contents of the debug logs to <filepath> Sample lctl run: [root@host ~]# lctl lctl > debug_kernel /tmp/lustre_logs/log_all Debug log: 324 lines, 324 kept, 0 dropped. lctl > filter trace Disabling output of type "trace" lctl > debug_kernel /tmp/lustre_logs/log_notrace Debug log: 324 lines, 282 kept, 42 dropped. lctl > show trace Enabling output of type "trace" lctl > filter portals Disabling output from subsystem "portals" lctl > debug_kernel /tmp/lustre_logs/log_noportals Debug log: 324 lines, 258 kept, 66 dropped.

Debug daemon and lctl The debug daemon periodically dumps the ring buffer contents to disk in binary form The contents need to be converted to human readable Example of using debug_daemon and interactive lctl to dump debug files to a 40 MB file: lctl lctl > debug_daemon start /var/log/lustre.40.bin 40 <run filesystem operation(s) to debug (outside of the lctl session)> lctl > debug_daemon lctl > debug_file stop /var/log/lustre.40.bin /var/log/lustre.log Optionally, start a daemon with an unlimited file size: lctl > debug_daemon start /var/log/lustre.bin

Conclusion Crash Cycle Debug information Exception encountered during execution (LBUG) Kernel Panic Generated Kdump launches crash kernel Kernel dump saved to file (makedumpfile) System logs Lustre debugging logs Can specify debug targets Can specify how much data to collect Analysis crash utility

Acknowledgements This work was supported by the United States Department of Defense (DoD) and used resources of the DoD-HPC at Oak Ridge National Laboratory.