Zen and The Art of High-Availability Clustering Lars Marowsky-Brée

Similar documents
Best Current Practices for SUSE Linux Enterprise High Availability 12

High Availability Storage

SUSE Linux Enterprise High Availability Extension. Piotr Szewczuk konsultant

Open Source High Availability Writing Resource Agents for your own services. Lars Marowsky-Brée Team Lead SUSE Labs

Implementing the SUSE Linux Enterprise High Availability Extension on System z

Pacemaker. A Scalable Cluster Resource Manager for Highly Available Services. Owen Le Blanc. I T Services University of Manchester

TUT8155 Best Practices: Linux High Availability with VMware Virtual Machines

High-Availability Using Open Source Software

High Availability Solutions for the MariaDB and MySQL Database

Highly Available NFS Storage with DRBD and Pacemaker

Ceph Distributed Storage for the Cloud An update of enterprise use-cases at BMW

High Availability and Disaster Recovery for SAP HANA with SUSE Linux Enterprise Server for SAP Applications

OpenLDAP in High Availability Environments

Red Hat Cluster Suite

Administration Guide. SUSE Linux Enterprise High Availability Extension 12 SP1

HA for Enterprise Clouds: Oracle Solaris Cluster & OpenStack

How To Use Suse Linux High Availability Extension 11 Sp (Suse Linux)

Novell PlateSpin Orchestrate

Based on Geo Clustering for SUSE Linux Enterprise Server High Availability Extension

PVFS High Availability Clustering using Heartbeat 2.0

SUSE Linux Enterprise Server

Implementing the SUSE Linux Enterprise High Availability Extension on System z Mike Friesenegger

HO15982 Deploy OpenStack. The SUSE OpenStack Cloud Experience. Alejandro Bonilla. Michael Echavarria. Cameron Seader. Sales Engineer

Overview: Clustering MySQL with DRBD & Pacemaker

Running SAP HANA One on SoftLayer Bare Metal with SUSE Linux Enterprise Server CAS19256

SUSE Linux uutuudet - kuulumiset SUSECon:sta

SAP Solutions High Availability on SUSE Linux Enterprise Server for SAP Applications

INUVIKA TECHNICAL GUIDE

Wicked A Network Manager Olaf Kirch

Novell Cluster Services Implementation Guide for VMware

CAS18543 Migration from a Windows Environment to a SUSE Linux Enterprise based Infrastructure Liberty Christian School

PRM and DRBD tutorial. Yves Trudeau October 2012

Red Hat Global File System for scale-out web services

Deployment Topologies

SAP Applications Made High Available on SUSE Linux Enterprise Server 10

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

Oracle Products on SUSE Linux Enterprise Server 11

SUSE OpenStack Cloud 4 Private Cloud Platform based on OpenStack. Gábor Nyers Sales gnyers@suse.com

Open Source High Availability on Linux

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

Relax-and-Recover. Johannes Meixner. on SUSE Linux Enterprise 12.

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

We are watching SUSE

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Building Images for the Cloud and Data Center with SUSE Studio

Advanced Systems Management with Machinery

Deploying Exchange Server 2007 SP1 on Windows Server 2008

Setup for Failover Clustering and Microsoft Cluster Service

Parallels Virtuozzo Containers 4.7 for Linux

Windows Server 2008 R2 Hyper-V Server and Windows Server 8 Beta Hyper-V

StarWind Virtual SAN Installation and Configuration of Hyper-Converged 2 Nodes with Hyper-V Cluster

Peter Ruissen Marju Jalloh

Best Practices for Installing and Configuring the Hyper-V Role on the LSI CTS2600 Storage System for Windows 2008

Installing, Tuning, and Deploying Oracle Database on SUSE Linux Enterprise Server 12 Technical Introduction

Using btrfs Snapshots for Full System Rollback

Real-time Protection for Hyper-V

Software Defined Everything

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

ORACLE DATABASE HIGH AVAILABILITY STRATEGY, ARCHITECTURE AND SOLUTIONS

SAP HANA System Replication on SUSE Linux Enterprise Server for SAP Applications

Symantec NetBackup OpenStorage Solutions Guide for Disk

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Ingres High Availability Option

High Availability with Windows Server 2012 Release Candidate

SanDisk ION Accelerator High Availability

SUSE Enterprise Storage Highly Scalable Software Defined Storage. Gábor Nyers Sales

Quorum DR Report. Top 4 Types of Disasters: 55% Hardware Failure 22% Human Error 18% Software Failure 5% Natural Disasters

Deployment Guide. How to prepare your environment for an OnApp Cloud deployment.

Dell High Availability Solutions Guide for Microsoft Hyper-V

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

SAP HANA System Replication on SLES for SAP Applications. 11 SP3

SAP on SUSE Linux Enterprise

Availability Digest. Stratus Avance Brings Availability to the Edge February 2009

Build Platform as a Service (PaaS) with SUSE Studio, WSO2 Middleware, and EC2 Chris Haddad

A new Cluster Resource Manager for heartbeat

High Availability & Disaster Recovery. Sivagopal Modadugula/SAP HANA Product Management Session # 0506 May 09, 2014

Active-Active ImageNow Server

Windows Server Failover Clustering April 2010

SUSE Cloud 5 Private Cloud based on OpenStack

Windows clustering glossary

ovirt and Gluster hyper-converged! HA solution for maximum resource utilization

Synology High Availability (SHA)

How To Make A Cloud Work For You

Building the Virtual Information Infrastructure

Oracle Linux Advanced Administration

Operating System Security Hardening for SAP HANA

Deploying Hadoop with Manager

Availability Digest. Redundant Load Balancing for High Availability July 2013

Setup for Failover Clustering and Microsoft Cluster Service

Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4

Cisco Active Network Abstraction Gateway High Availability Solution

Linux Cluster. Administration

What s new in Hyper-V 2012 R2

Kangaroot SUSE TechUpdate Interoperability SUSE Linux Enterprise and Windows

SUSE Storage. FUT7537 Software Defined Storage Introduction and Roadmap: Getting your tentacles around data growth. Larry Morris

Red Hat Enterprise linux 5 Continuous Availability

Automated Deployment of an HA OpenStack Cloud

Apache Tomcat Clustering

Intel Entry Storage System SS4000-E

SAN Conceptual and Design Basics

Transcription:

Zen and The Art of High-Availability Clustering Lars Marowsky-Brée Architect Storage and High-Availability, Principal Engineer lmb@novell.com

Agenda Introduction Summary of cluster architecture Common configuration issues Gathering cluster-wide support information Exploring effects of cluster events Self-written resource agents Understanding log files 2

Introduction

Data Center Challenges Minimize unplanned downtime Ensure quality of service Contain costs Utilize resources Effectively manage multiple vendors Minimize risk 4

Pacemaker and corosync/openais-based stacks Key Features Service Availability 24/7 Policy driven clustering > OpenAIS messaging and membership layer > Pacemaker cluster resource manager Sharing and Scaling Dataaccess by Multiple Nodes Cluster file system > OCFS2 > Clustered logical volume manager Disaster Tolerance Data replication via IP > Distributed replicated block device Scale Network Services IP load-balancing User-friendly Tools Graphical user interface Unified command line interface 5

Cluster Architecture

3 Node Cluster Overview Network Links Clients Xen VM 1 LAMP Apache IP ext3 Xen VM 2 clvm2+ocfs2 DLM Pacemaker Corosync + openais Storage Kernel Kernel Kernel 7

Detailed View of Components Per Node: SAP MySQL libvirt Xen Apache iscsi Filesystems IP address DRBD Resource Agents LSB init LRM STONITH... DRAC ilo SBD Fencing Web GUI Python GUI CRM Shell CIB Pacemaker Policy Engine LVS DRBD c MPIO YaST2 OpenAIS c clvmd Ocfs2_controld dlm_controld OpenAIS corosync ext3, XFS DRBD Local Disks clvm2 OCFS2 Multipath IO SAN FC(oE), iscsi DLM SCTP Linux Kernel TCP Bonding Ethernet UDP multicast UDP Infiniband multicast 8

Why Is This Talk Necessary? Can't you just make the software stack easy to understand? Why is a multi-node setup more complicated than a single node? Gosh, this is awfully complicated! Why is this stuff so powerful? I don't need those other features! This session will address most of these questions. 9

Design and Architecture Considerations

General Considerations Consider the support level requirements of your mission-critical systems. Your staff is your key asset! Invest in training, processes, knowledge sharing. A good administrator will provide higher availability than a mediocre cluster setup. Get expert help for the initial setup, and Write concise operation manuals that make sense at 3am on a Saturday ;-) Thoroughly test the cluster at deployment & regularly! Use a staging system before deploying large changes! 11

Manage Expectations Properly Clustering improves reliability, but does not achieve 100%, ever. Fail-over clusters reduce service outage, but do not eliminate it. HA protects data before the service. Clusters are more complex than single nodes. Clustering broken applications will not fix them. Clusters do not replace backups, RAID, or good hardware. 12

Complexity Versus Reliability Every component has a failure probability. Good complexity: Redundant components. Undesirable complexity: chained components. Choke point single point of failure Also consider: Administrative complexity. Use as few components (features) as feasible. Our extensive feature list is not a mandatory checklist for your deployment ;-) What is your fall-back in case the cluster breaks? Backups, non-clustered operation Architect your system so that this is feasible! 13

Cluster Size Considerations More nodes: Increased absolute redundancy and capacity. Decreased relative redundancy. One cluster one failure and security domain. HA is not HPC. Does your work-load scale well to larger node counts? Choose odd node counts. 4 and 3 node clusters both lose majority after 2 nodes. Question: 5 cheaper servers, or 3 higher quality servers with more capacity each? 14

Common Setup Issues

General Software Stack Please avoid chasing already solved problems! Please apply all available software updates, as soon as feasible. Consider a pre-build distribution 16

Networking: L2 Switches must support multicast properly. IGMP snooping some times works, some times it doesn't. Consider isolating cluster traffic on separate VLANs. Keep firmware on switches uptodate! Bonding is preferable to using multiple rings: Reduces complexity Exposes redundancy to all cluster services and clients Firewall rules are not your friend. 17

Networking Make NIC names identical on all nodes Local hostname resolution versus DNS Address assignment via DHCP Make sure it is static! Setup NTP for time synchronization. 18

From One to Many Nodes Error: Configuration files not identical across nodes. /etc/drbd.conf, /etc/corosync/corosync.conf, /etc/ais/openais.conf, resource-specific configurations... Symptoms: Causes weird misbehavior, works one but not on other systems, interoperability issues, and possibly others. Solution: Make sure they are synchronized. csync2 does this automatically for you. > You can add your own files to this list as needed. 19

Fencing (STONITH) Error: Not configuring STONITH at all It defaults to enabled, resource start-up will block and the cluster simply do nothing. This is for your own protection. Warning: Disabling STONITH DLM/OCFS2 will block forever waiting for a fence that is never going to happen. Error: Using external/ssh, ssh, null in production These plug-ins are for testing. They will not work in production! Use a real fencing device or external/sbd Error: configuring several power switches in parallel. Error: Trying to use external/sbd on DRBD 20

Virtualization and Clustering Hypervisor cluster managing VMs or Clustering VMs internally? Hide bonding at the hypervisor layer Cluster will happily live migrate a crashed kernel... You can mix physical and virtual nodes in a cluster But don't host the virtual nodes on the physical ones! 21

CIB Configuration Issues: Placement Even though I ordered them, they are starting on different nodes! Use (anti-)collocation > You can specify who wins using priority Resources are starting up on wrong nodes My location constraints are too complex! Use utilization Resources move around when something unrelated changes # crm configure property default-resource-stickiness=1000 22

CIB Configuration Issues: Quorum 2 node clusters cannot have majority with 1 node failed # crm configure property no-quorum-policy=ignore no-quorum-policy= stop Can make OCFS2/DLM misbehave no-quorum-policy= freeze Safe default for 2+ nodes 23

CIB Configuration Issues: Misc Resources are starting up in random order, even though I collocated them! Order!= Placement Clones most frequently need to have interleave= true Use simple constructs where possible! # crm_verify -L ; ptest -L -VVVV Will point out some basic issues CIB is easily the most important bit of your configuration, and the most easily misunderstood... We'll get back to that... 24

Configuring Cluster Resources Symptom: On start of one or more nodes, the cluster restarts resources! Cause: resources under cluster control are also started via the init sequence. The cluster probes all resources on start-up on a node, and when it finds resources active where they should not be possibly even more than once in the cluster, the recovery protocol is to stop them all (including all dependencies) and start them cleanly again. Solution: Remove them from the init sequence. 25

Setting Resource Time-outs Belief: Shorter time-outs make the cluster respond faster. Fact: Too short time-outs cause resource operations to fail erroneously, making the cluster unstable and unpredictable. A somewhat too long time-out will cause a fail-over delay; a slightly too short time-out will cause an unnecessary service outage. Consider that a loaded cluster node may be slower than during deployment testing. Check crm_mon -t1 output for the actual run-times of resources. 26

Storage in General Activating non-battery backed caches for performance Causes data corruption. iscsi over unreliable networks. Lack of multipath for storage. Believing that RAID replaces backups. RAID and replication immediately propagate logical errors! Please ensure that device names are identical on all nodes. 27

OCFS2 Using ocfs2-tools-o2cb (legacy mode) O2CB only works with Oracle RAC; full features of OCFS2 are only available in combination with Pacemaker # zypper rm ocfs2-tools-o2cb Forget about /etc/ocfs2/cluster.conf, /etc/init.d/ocfs2, /etc/init.d/o2cb and /etc/sysconfig/ocfs2 Nodes crash on shutdown If you have active ocfs2 mounts, you need to umount before shutdown If openais is part of the boot sequence > # insserv openais Consider: Do you really need OCFS2? Can your application really run concurrently? 28

Host-based RAID Active/Active: Possible with clvm2 and cmirrord Disk failures are costly; consider MPIO Consider centralizing your storage infrastructure! Active/Passive: Works fine with md / Raid1 resource agent Use recent versions to get reassembly on switch-over and monitor Preferable if not requiring active/active 29

Distributed Replicated Block Device Myth: has no shared state, thus no STONITH needed. Fact: DRBD still needs fencing! Active/Active: Does not magically make ext3 or applications concurrency-safe, still can only be mounted once With OCFS2, split-brain is still fatal, as data diverges! Active/Passive: Ensures only one side can modify data, added protection. Does not magically make applications crash-safe. Issue: Replication traffic during reads. noatime mount option. 30

Exploring the Effect of Events

What Are Events? All state changes to the cluster are events They cause an update of the CIB Configuration changes by the administrator Nodes going up or going down Resource monitoring failures Response to events is configured using the CIB policies and computed by the Policy Engine This can be simulated using ptest Available comfortably through the crm shell 32

Simulating Node Failure hex-0:~ # crm crm(live)# cib new sandbox INFO: sandbox shadow CIB created crm(sandbox)# cib cibstatus node hex-0 unclean crm(sandbox)# ptest 33

Simulating Node Failure 34

Simulating Resource Failure crm(sandbox)# cib cibstatus load live crm(sandbox)# cib cibstatus op usage: op <operation> <resource> <exit_code> [<op_status>] [<node>] crm(sandbox)# cib cibstatus op start dummy1 not_running done hex-0 crm(sandbox)# cib cibstatus op start dummy1 unknown timeout hex-0 crm(sandbox)# configure ptest ptest[4918]: 2010/02/17_12:44:17 WARN: unpack_rsc_op: Processing failed op dummy1_start_0 on hex-0: unknown error (1) 35

Simulating Resource Failure 36

Exploring Configuration Changes crm(sandbox)# cib cibstatus load live crm(sandbox)# configure primitive dummy42 ocf:heartbeat:dummy crm(sandbox)# ptest 37

Configuration changes - Woah! 38

Exploring Configuration Changes crm(sandbox)# configure rsc_defaults resource-stickiness=1000 crm(sandbox)# ptest crm(sandbox)# configure colocation order- 42 inf: dummy42 dummy1 crm(sandbox)# ptest 39

Configuration changes Almost... 40

Configuration changes - done 41

Log Files and Their Meaning

hb_report Is the Silver Support Bullet Compiles Cluster-wide log files, Package state, DLM/OCFS2 state, System information, CIB history, parses core dump reports (install debuginfo packages!) into a single tarball for all support needs. # hb_report -n node1 node2 node3 -f 12:00 /tmp/hb_report_example1 43

Logging The cluster generates too many log messages! Alas, customers are even more upset when asked to reproduce a problem on their production system ;-) Incidentially, all command line invocations are logged. System-wide logs: /var/log/messages CIB history: /var/lib/pengine/* All cluster events are logged here and can be analyzed with hindsight (python GUI, ptest, and the crm shell). 44

Where Is the Real Cause? The answer is always in the logs. Even though the logs on the DC may print a reference to the error, the real cause may be on another node. Most errors are caused by resource agent misconfiguration. 45

Correlating Messages to Their Cause Feb 17 13:06:57 hex-8 pengine: [7717]: WARN: unpack_rsc_op: Processing failed op ocfs2-1:2_monitor_20000 on hex-0: not running (7) This is not the failure, just the Policy Engine reporting on the CIB state! The real messages are on hex-0, grep for the operation key: Feb 17 13:06:57 hex-0 Filesystem[24825]: [24861]: INFO: /filer is unmounted (stopped) Feb 17 13:06:57 hex-0 crmd: [7334]: info: process_lrm_event: LRM operation ocfs 2-1:2_monitor_20000 (call=37, rc=7, cibupdate=55, confirmed=false) not running Look for the error messages from the resource agent before the lrmd/pengine lines! 46

Debugging Resource Agents

Common Resource Agent Issues Operations must succeed if the resource is already in the requested state. monitor must distinguish between at least running/ok, running/failed, and stopped Probes deserve special attention Meta-data must conform to DTD. 3rd party resource agents do not belong under /usr/lib/ocf/resource.d/heartbeat chose your own provider name! Use ocf-tester to validate your resource agent. 48

ocf-tester Example Output hex-0:~ # ocf-tester -n Example /usr/lib/ocf/resource.d/bs2010/dummy Beginning tests for /usr/lib/ocf/resource.d/bs2010/dummy... * Your agent does not support the notify action (optional) * Your agent does not support the demote action (optional) * Your agent does not support the promote action (optional) * Your agent does not support master/slave (optional) * rc=7: Stopping a stopped resource is required to succeed Tests failed: /usr/lib/ocf/resource.d/bs2010/dummy failed 1 tests 49

Questions and Answers

Community resources Documentation is excellent: http://www.novell.com/documentation/sle_ha/ http://www.clusterlabs.org/wiki/documentation > Full reference documentation, step-by-step guides, pointers to more documentation Very active community & discussions: http://www.clusterlabs.org/wiki/mailing_lists Bug reports: http://developerbugs.linux-foundation.org/ > Remember to include hb_report! 51

Unpublished Work of Novell, Inc. All Rights Reserved. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.