Open Source High Availability Writing Resource Agents for your own services. Lars Marowsky-Brée Team Lead SUSE Labs lmb@suse.de



Similar documents
High Availability Storage

Running SAP HANA One on SoftLayer Bare Metal with SUSE Linux Enterprise Server CAS19256

High Availability and Disaster Recovery for SAP HANA with SUSE Linux Enterprise Server for SAP Applications

Big Data, SAP HANA. SUSE Linux Enterprise Server for SAP Applications. Kim Aaltonen

Challenges Implementing a Generic Backup-Restore API for Linux

Build Platform as a Service (PaaS) with SUSE Studio, WSO2 Middleware, and EC2 Chris Haddad

Oracle Products on SUSE Linux Enterprise Server 11

SUSE Customer Center Roadmap

TUT8155 Best Practices: Linux High Availability with VMware Virtual Machines

HO15982 Deploy OpenStack. The SUSE OpenStack Cloud Experience. Alejandro Bonilla. Michael Echavarria. Cameron Seader. Sales Engineer

Ceph Distributed Storage for the Cloud An update of enterprise use-cases at BMW

Implementing the SUSE Linux Enterprise High Availability Extension on System z

SUSE OpenStack Cloud 4 Private Cloud Platform based on OpenStack. Gábor Nyers Sales gnyers@suse.com

Deploying Hadoop with Manager

Software Defined Everything

Installing, Tuning, and Deploying Oracle Database on SUSE Linux Enterprise Server 12 Technical Introduction

Based on Geo Clustering for SUSE Linux Enterprise Server High Availability Extension

Advanced Systems Management with Machinery

Operating System Security Hardening for SAP HANA

A new Cluster Resource Manager for heartbeat

We are watching SUSE

Data Center Automation with SUSE Manager Federal Deployment Agency Bundesagentur für Arbeit Data Center Automation Project

How To Make A Cloud Work For You

SUSE Storage. FUT7537 Software Defined Storage Introduction and Roadmap: Getting your tentacles around data growth. Larry Morris

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

TUT5605: Deploying an elastic Hadoop cluster Alejandro Bonilla

SUSE Linux uutuudet - kuulumiset SUSECon:sta

SUSE Cloud 5 Private Cloud based on OpenStack

Configuration Management in SUSE Manager 3

Public Cloud. Build, Use, Manage. Robert Schweikert. Public Cloud Architect

Implementing Linux Authentication and Authorisation Using SSSD

SUSE Enterprise Storage Highly Scalable Software Defined Storage. Gábor Nyers Sales

Workflow und Identity Management - Genehmigungsprozesse, Role Mining, Role Design und Compliance Management

Pacemaker. A Scalable Cluster Resource Manager for Highly Available Services. Owen Le Blanc. I T Services University of Manchester

Relax-and-Recover. Johannes Meixner. on SUSE Linux Enterprise 12.

Wicked A Network Manager Olaf Kirch

Zen and The Art of High-Availability Clustering Lars Marowsky-Brée

Achieving High Availability on Linux for System z with Linux-HA Release 2

SUSE Linux Enterprise High Availability Extension. Piotr Szewczuk konsultant

Novell Collaboration Vibe OnPrem

Wicked Trip into Wicked Network Management

Using btrfs Snapshots for Full System Rollback

How an Open Source Cloud Will Help Keep Your Cloud Strategy Options Open

Btrfs and Rollback How It Works and How to Avoid Pitfalls

Using SUSE Linux Enterprise to "Focus In" on Retail Optical Sales

Highly Available NFS Storage with DRBD and Pacemaker

HO5604 Deploying MongoDB. A Scalable, Distributed Database with SUSE Cloud. Alejandro Bonilla. Sales Engineer abonilla@suse.com

How SUSE Is Helping You Rock The Public Cloud

Best Current Practices for SUSE Linux Enterprise High Availability 12

SUSE Linux Enterprise 12 Security Certifications Common Criteria, EAL, FIPS, PCI DSS,... What's All This About?

OpenLDAP in High Availability Environments

Implementing the SUSE Linux Enterprise High Availability Extension on System z Mike Friesenegger

High-Availability Using Open Source Software

DevOps and SUSE From check-in to deployment

Kangaroot SUSE TechUpdate Interoperability SUSE Linux Enterprise and Windows

Novell PlateSpin Orchestrate

PVFS High Availability Clustering using Heartbeat 2.0

Overview: Clustering MySQL with DRBD & Pacemaker

SA Server 2.0. Application Note : Evidian SafeKit 7.0.4, Failover

kgraft Live patching of the Linux kernel

Leveraging Wikis to Manage SCP Documentation TWiki Novell Technical Services

Linux w chmurze publicznej SUSE na platformie Microsoft Azure

MySQL and Virtualization Guide

Securing Your System: Security Hardening Techniques for SUSE Linux Enterprise Server

SUSE Linux Enterprise Server

MySQL High Availability Solutions. Lenz Grimmer OpenSQL Camp St. Augustin Germany

CAS18543 Migration from a Windows Environment to a SUSE Linux Enterprise based Infrastructure Liberty Christian School

Building a High-Availability PostgreSQL Cluster

KVM, OpenStack and the Open Cloud SUSECon November 2015

EMC Documentum Interactive Delivery Services Accelerated: Step-by-Step Setup Guide

SUSE Linux Enterprise 12 Security Certifications

Siebel Installation Guide for UNIX. Siebel Innovation Pack 2013 Version 8.1/8.2, Rev. A April 2014

NetIQ Identity Manager Setup Guide

Automatic Service Migration in WebLogic Server An Oracle White Paper July 2008

Oracle Communications WebRTC Session Controller: Basic Admin. Student Guide

Of Pets and Cattle and Hearts

PRM and DRBD tutorial. Yves Trudeau October 2012

Introducing Director 11

High Availability Solutions for the MariaDB and MySQL Database

HyperFS PC Client Tools

IBM WebSphere Partner Gateway V6.2.1 Advanced and Enterprise Editions

Backup and Recovery for SAP Environments using EMC Avamar 7

Netezza PureData System Administration Course

FAQ: Data Services Real Time Set Up

Novell Nsure Audit Installation Guide. novdocx (ENU) 01 February Novell NsureTM Audit INSTALLATION GUIDE.

NIST ITL July 2012 CA Compromise

Advanced Service Design

From Idea to Working Deployment:

Hadoop Basics with InfoSphere BigInsights

File Management Suite. Novell. Intelligently Manage File Storage for Maximum Business Benefit. Sophia Germanides

IBM XIV Storage System. MSCS Guide. Version 1.0.x

Backup & Restore with SAP BPC (MS SQL 2005)

HA for Enterprise Clouds: Oracle Solaris Cluster & OpenStack

Transcription:

Open Source High Availability Writing Resource Agents for your own services Lars Marowsky-Brée Team Lead SUSE Labs lmb@suse.de

Agenda Introduction Resource Agents in context Basic Resource Agents (+ code) Extended Resource Agents (+ code) Testing Questions and answers 2

Context

Where do resource agents fit in? The goal of the cluster is to: Protect the data To keep the services up > fail- or switch-over The cluster manages many different services. The operations needed to perform are quite similar. Resource Agents provide the glue between the cluster stack and the actual service: Provide a common, parameterized interface to manage the services 4

Objects: Primitive resources Class: LSB, heartbeat,or OCF Resource Agent Provider (for OCF only) Type: Filesystem, Apache, Database... Instance parameters:(global / per-node) Supplied to the RA to identify the resource Supported operations and/or special timeouts 5

Heartbeat 2.0.x already provides... OCF Resource Agents for SAP - Developed together with SAP LinuxLab Oracle, MySQL, IBM DB2, Postgres Xen IBM Web Application Server Filesystems, Webservers, IP addresses, FTP, drbd,... Easily extended with own plugins 6

Major System Components Heartbeat master control process spawns: Local Resource Manager (LRM), which spawns: > Resource Agents, which manage the services > STONITH Agents tie-in here Cluster Consensus Membership Layer (CCM) Cluster Resource Manager (CRM), which spawns: > Cluster Information Base (CIB) (r/w on the DC node only) > Policy Engine (PE) (on the DC node only) > Transition Engine (TE) (on the DC node only) 7

Architecture CRM PE CIB TE (DC only) (DC only) LRM Heartbeat child CRM child init child CCM STONITH daemon * apphbd child = everybody calls this logd* heartbeat apphbd Recovery Manager Infrastucture* clplumbing utility library IPC library PILS 8

Basic Flow of Events Heartbeat reports node liveness event Consensus Membership synchronizes across cluster CRM (re-)elects a new Designated Coordinator (DC) Elected Designated Coordinator orchestrates response to events: Nodes coming and going Resources reporting monitor failures Adminstrative changes to the configuration Policy Engine computes needed operations Transitioner puts plan into action 9 LRM calls resource agents to (re-)start services Novell Final Inc. All cluster rights reserved state: Idle.

Basic operations

OCF Resource Agent fundamentals Installed under /usr/lib/ocf/resource.d/provider/type Use your own provider name to not clash with heartbeat's own Configured parameters (instance_attributes): Passed in environment with prefix $OCF_RESKEY_name=Value meta_attributes -> $OCF_RESKEY_crm_meta_... One commandline argument: The requested operation Must return well defined exit codes Shell library provides common functions: ocf_log (crit err warn info debug) Log message 11

Basic goals of a RA Provide common abstraction layer Start, stop, monitor the service Provide meta-data about the service for configuration Validate configuration Support some extended operations going forward 12

Deciding on good resource parameters Must clearly identify one resource instance Should allow several instances to be run on one node Should be flexible enough to customize the RA Should not overload the cluster configuration If the configuration is very complex, point to a configuration file IP address, for example, is small enough to carry all its configuration in the attributes 13

start After this operation has completed, service should be started Starting an already started service must not be an error! Service must be functional after start returns. 14

stop After this operation has completed, service must be fully stopped! Stopping an already started service must not be an error! Might want to implement shutdown escalation If after 50% of $OCF_RESKEY_crm_meta_timeout the service isn't stopped, kill it by force. A stop failure is considered fatal: The resource cannot be started anywhere else The cluster must resort to extended measures to clean up! 15

monitor Return the status of the service: 0 ($OCF_SUCCESS): Service running fine 7 ($OCF_NOT_RUNNING): Service NOT running at all 1 ($OCF_ERR_GENERIC): Considered active but failed This three-way distinction is NOT optional! 16

Startup monitor call (probes) CRM checking for active resources during startup Why? Failures and administrative accidents. Called everywhere, on all nodes Be careful to not return errors for resources which are stopped, otherwise recovery will begin 17

meta-data Print an XML chunk to stdout, describing The RA itself The parameters it accepts > Name > Description > Type (Integer, string) > Suggested default Operations supported > Name > Suggested timeouts and repeat intervals Descriptions can be localized 18

meta-data meta_data() { cat <<END <?xml version="1.0"?> <!DOCTYPE resource agent SYSTEM "ra api 1.dtd"> <resource agent name="clustermon"> <version>1.0</version> <longdesc lang="en"> This is a ClusterMon Resource Agent. It outputs current cluster status to the html. </longdesc> <shortdesc lang="en">clustermon resource agent</shortdesc> 19

meta-data (parameters) <parameters> <parameter name="user" unique="0"> <longdesc lang="en">the user we want to run crm_mon as</longdesc> <shortdesc lang="en">the user we want to run crm_mon as</shortdesc> <content type="string" default="root" /> </parameter>... </parameters> 20

validate-all Validate the instance attributes Syntax checking Be careful: This is called independently from start/stop state Not all prerequisites may be present Not actually called by anyone right now ;-) 21

Advanced Resources

Beyond start/stop Wouldn't it be nice if the cluster could move resources without shutting them down? meta_attribute: allow_migrate If the source and the target are healthy, migrate_to will be invoked to push the resource to OCF_RESKEY_CRM_meta_migrate_target migrate_from will be invoked to pull the resource and complete the migration If either one fails, reverts to stop/start Used to live migrate Xen guests. 23

What are Clones? A convenient way to start N copies of a resource on M nodes Types: Anonymous, Globally Unique Two extra environment variables: OCF_RESKEY_crm_meta_clone, OCF_RESKEY_crm_meta_clone_max Currently used to support OCFS2 and fencing Must use an OCF type resource agent See http://www.linux-ha.org/v2/concepts/clones 24

Probes & Clones The CRM will send a probe for each clone instance that may be active on the machine Anonymous Clones: Nodes must respond with ${OCF_SUCCESS} to exactly N of these probes. N is the number of active clone instances on the node All other probes: ${OCF_NOT_RUNNING} 25

Probes & Clones (cont'd) The CRM will send a probe for each clone instance that may be active on the machine Globally Unique Clones: Nodes must respond with ${OCF_SUCCESS} only if the particular clone instance is active. Check OCF_RESKEY_clone and OCF_RESKEY_clone_max to be sure All other probes: ${OCF_NOT_RUNNING} 26

Clone: receiving peer information Optional extra operation: notify meta_attribute: notify set to 1 Two types of notifications: pre & post Extra environment variables: OCF_RESKEY_crm_meta_globally_unique (default: true) OCF_RESKEY_crm_meta_notify_type OCF_RESKEY_crm_meta_notify_operation OCF_RESKEY_crm_meta_notify_{x}_resource OCF_RESKEY_crm_meta_notify_ {x}_uname {x} : start, stop, active, inactive 27

Multi-state resources Superset of a clone Can be either anonymous or unique Currently supports two states: After start always in slave mode promotion/demote are separate operations Used mostly for active/passive replication scenarios (drbd) 28

Multi-State RA Extensions Operations promote, demote, optional: notify Additional monitor return values $OCF_RUNNING_MASTER $OCF_FAILED_MASTER Extra environment variables: OCF_RESKEY_crm_meta_notify_{x}_resource OCF_RESKEY_crm_meta_notify_ {x}_uname {x} : promote, demote, master, slave 29

Multi-state: defining master ability During start or post-notify for start: crm_master -v X must be called X = 0 (or not called): cannot become master X > 0: Highest value node will be choosen as master monitor can adjust the values, if needed Example: drbd resource agent implements this 30

Testing...

Setup the environment Make sure all dependencies are started locally Setup the environment variables Call the agent manually with the desired operations Or use our automated test harness 32

Testing Your Agent # /usr/sbin/ocf-tester -n Test -o ip=172.16.1.1 -o netmask=255.255.255.0 /usr/lib/ocf/resource.d/heartbeat/ipaddr Beginning tests for /usr/lib/ocf/resource.d/heartbeat/ipaddr... * Your agent does not support the notify action (optional) * Your agent does not support the demote action (optional) * Your agent does not support the promote action (optional) * Your agent does not support master/slave (optional) * Your agent does not support the reload action (optional) /usr/lib/ocf/resource.d/heartbeat/ipaddr passed all tests 33

Curious...?

Where to go From Here: See also IO125, TUT234 Test using Xen! Build your highly available media archive at home using opensuse Book recommendations: Blueprints for High Availability by Evan Marcus and Hal Stern The Linux Enterprise Cluster by Karl Kopper Online resources: http://www.linux-ha.org/ http://www.lcic.org/ 35

Questions? Feedback?

Unpublished Work of Novell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.