High Availability and Backup Strategies for the Lustre MDS Server



Similar documents
short introduction to linux high availability description of problem and solution possibilities linux tools

Open-Xchange Server High Availability

High Availability Solutions for the MariaDB and MySQL Database

Disaster Recovery Disaster Recovery Planning for Business Continuity Session Name :

Lustre failover experience

Tushar Joshi Turtle Networks Ltd

IBM Security QRadar SIEM Version High Availability Guide IBM

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

Lessons learned from parallel file system operation

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Course Outline. Create and configure virtual hard disks. Create and configure virtual machines. Install and import virtual machines.

New Storage System Solutions

High Availability Solutions with MySQL

Explain how to prepare the hardware and other resources necessary to install SQL Server. Install SQL Server. Manage and configure SQL Server.

One Solution for Real-Time Data protection, Disaster Recovery & Migration

High Performance Computing OpenStack Options. September 22, 2015

Cray Lustre File System Monitoring

DB2 9 for LUW Advanced Database Recovery CL492; 4 days, Instructor-led

High Availability Solutions for MySQL. Lenz Grimmer DrupalCon 2008, Szeged, Hungary

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Maintaining a Microsoft SQL Server 2008 Database

High availability infrastructures for TYPO3 Websites

The functionality and advantages of a high-availability file server system

How To Use A Recoverypoint Server Appliance For A Disaster Recovery

External Storage 200 Series. User s Manual

DocuShare 4, 5, and 6 in a Clustered Environment

Lustre * Filesystem for Cloud and Hadoop *

Server Virtualization with Windows Server Hyper-V and System Center

Perforce Backup Strategy & Disaster Recovery at National Instruments

Building Storage Service in a Private Cloud

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

5054A: Designing a High Availability Messaging Solution Using Microsoft Exchange Server 2007

Disaster Recovery Checklist Disaster Recovery Plan for <System One>

High-Availability User s Guide v2.00

Hosting Solutions Made Simple. Managed Services - Overview and Pricing

February, 2015 Bill Loewe

Lustre SMB Gateway. Integrating Lustre with Windows

High Availability with DRBD & Heartbeat. Chris Barber

Forklifting to AWS: An Option for Migration to AWS October Forklifting to AWS: An Option for Migration to AWS

Course Syllabus. At Course Completion

EXPRESSCLUSTER X for Windows Quick Start Guide for Microsoft SQL Server Version 1

Server Virtualization with Windows Server Hyper-V and System Center

A Better Approach to Backup and Bare-Metal Restore: Disk Imaging Technology

IdP Clustering. You want to prevent service outages. High Availability and Load Balancing. Possible problems: HW failures

20409B: Server Virtualization with Windows Server Hyper-V and System Center

Hardware/Software Guidelines

Eloquence Training What s new in Eloquence B.08.00

Course. Overview. Length: 5 Day(s) Published: English. IT Professionals. Level: Type: Method: Delivery. Enroll now (CAL)

Designing, Optimizing and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

Preface Introduction... 1 High Availability... 2 Users... 4 Other Resources... 5 Conventions... 5

Investigation of storage options for scientific computing on Grid and Cloud facilities

Quantum StorNext. Product Brief: Distributed LAN Client

Data Center Optimization. Disaster Recovery

Mirror File System for Cloud Computing

Why Not Oracle Standard Edition? A Dbvisit White Paper By Anton Els

Astaro Deployment Guide High Availability Options Clustering and Hot Standby

High-Availability Using Open Source Software

MICROSOFT EXCHANGE best practices BEST PRACTICES - DATA STORAGE SETUP

HA clustering made simple with OpenVZ

INUVIKA TECHNICAL GUIDE

Implementing SAN & NAS with Linux by Mark Manoukian & Roy Koh

Server Virtualization with Windows Server Hyper-V and System Center

High Performance, Open Source, Dell Lustre Storage System. Dell /Cambridge HPC Solution Centre. Wojciech Turek, Paul Calleja July 2010.

MS Design, Optimize and Maintain Database for Microsoft SQL Server 2008

Contingency Planning and Disaster Recovery

Hyper-V backup implementation guide

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Lab : Planning and Implementing a Virtual Machine Deployment and Management Strategy

A Filesystem Layer Data Replication Method for Cloud Computing

High Availability Databases based on Oracle 10g RAC on Linux

Overview: Clustering MySQL with DRBD & Pacemaker

10GbE Ethernet for Transfer Speeds of Over 1150MB/s

Information Services hosted services and costs

Areas Covered. Chapter 1 Features (Overview/Note) Chapter 2 How to Use WebBIOS. Chapter 3 Installing Global Array Manager (GAM)

Architectures Haute-Dispo Joffrey MICHAÏE Consultant MySQL

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Protect SQL Server 2012 AlwaysOn Availability Group with Hitachi Application Protector

PRM and DRBD tutorial. Yves Trudeau October 2012

Virtual Infrastructure Security

Support of Windows Server 2012 The NCP Secure Enterprise VPN Server supports the Windows Server 2012 (64 bit) operating system.

MySQL High Availability Solutions. Lenz Grimmer OpenSQL Camp St. Augustin Germany

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Bricks Cluster Technical Whitepaper

EMC NetWorker. Server Disaster Recovery and Availability Best Practices Guide. Release 8.0 Service Pack 1 P/N REV 01

ORACLE DATABASE HIGH AVAILABILITY STRATEGY, ARCHITECTURE AND SOLUTIONS

Parallels Cloud Storage

Based on Geo Clustering for SUSE Linux Enterprise Server High Availability Extension

Microsoft SQL Server 2005 on Windows Server 2003

Hyper-V over SMB Remote File Storage support in Windows Server 8 Hyper-V. Jose Barreto Principal Program Manager Microsoft Corporation

VMware vsphere Data Protection 5.8 TECHNICAL OVERVIEW REVISED AUGUST 2014

Transcription:

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 1 High Availability and Backup Strategies for the Lustre MDS Server Spring 2008 Karin Miers / GSI

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 2 HA and Backup Methods for Lustre Lustre: What is necessary for the production cluster? or What will we do to make our new file system reliable? high availability setup backup of important parts (if high availability setup fails...)

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 3 Lustre Components main parts of a Lustre file system: MGT System Management Info about OSTs and clients MDT Meta Data: where is which file? OST-1 data files: aaa, bbb.txt, ccc... OST-2 data files: hij, klm.txt, nop... OST-3 data files: xyz, yzx.txt, zxy... OST.. data files:...

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 4 What happens if...... an OST breaks down? all data on this OST are not longer available Lustre continues... the MDT breaks down? all data become inaccessible and probably lost forever... In Case of Failure...... the MGT breaks down? no data loss, but Lustre becomes inoperable

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 5 HA Design (...based on budget restrictions...) OSTs are set up single, without backup...... means data loss is accepted Same situation as it is now for experiment data MDT and MGT (=MGS) are set up in a cluster 2 nodes, master / slave which can take over MDT is written to the backup otherwise in case of failure ALL data could be lost MGT is not written to the backup...no need can be set up new very fast

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 6 Cluster Tools (Software) Software tools are Open Source (GPL or similar) Main components: Heartbeat V2 (2.1.3-5, debian package) for cluster connection, management and monitoring DRBD for redundant data partition

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 7 Linux Heartbeat Package Heartbeat-2...... controls and checks the communication between both (or more) nodes connected by ethernet and / or serial line...checks connectivity to local network... monitors the resources (are MDT/MGT mounted...?)... (not implemented) can fulfil complicated conditions

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 8 DRBD Distributed Replicated Block Device file system server1 server2 file system in principle: RAID-1 over network DRBD disk driver TCP/IP TCP/IP NIC driver NIC driver network connection DRBD disk driver hard disk hard disk data exist twice real time update on slave consistency guaranteed fast recovery after failover no load balancing overhead of drbd: - needs cpu power - write performance is reduced

MDS Cluster master (mds1) ttys0 serial SM/IPMI heartbeat stonith slave (mds2) ttys0 serial SM/IPMI 2 nodes master slave eth3 10.1.0.1 eth2 10.2.0.1 heartbeat heartbeat drbd eth3 10.1.0.2 eth2 10.2.0.2 hot stand-by Raid 10 2 drbd volumes MGT/MDT Raid 10 2 drbd volumes MGT/MDT eth1 140.181.x.x eth0 eth0:0 140.181.z.z eth1 140.181.y.y eth0 heartbeat network connectivity PingNode (nameserver, gateway) virtual service mds lustre has to be told to use eth0:0 instead of eth0! HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 9

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 10 Failover Failover tests: master switched off -> slave takes over automatically, lustre is fully operable after a few minutes: heartbeat/drbd ~ 20-30s according to configuration lustre ~ few minutes (< 5 min)

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 11 MDT Backup Strategy...just in case... if cluster fails... Problem: permanent write processes on MDT backup on active MDT must fail no possibility to stop write access for backup duration DRBD... there is a copy of the MDT!

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 12 Backup Procedure Master drbd0 MGS drbd1 MDT Slave drbd0 MGS drbd1 MDT backup state: unconnected drbd snapshot MDT no HA normal state: connected drbd synced HA Master drbd0 MGS drbd1 MDT Slave drbd0 MGS drbd1 MDT Master drbd0 MGS drbd1 MDT Slave drbd0 MGS drbd1 MDT after backup: connected drbd syncing

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 13 Backup Steps in Detail drbdadm disconnet / detach mdt mount mdt device as ext3 fs save extended attributes with getffatr make a tar archive of the directory and save it umount mdt device reconnect drbd time factor - ~25 s for 0.5 GB MDT space (lustre test system with appr. 200 000 files/800 GB) but will depend mainly on size of MDT

Restore Procedure (... worst case scenario hopefully will never happen...) destruction of MDT with dd... umount all OSTs umount mdt device format and tune mdt mount mdt with ldiskfs restore tar archive and extended attributes umount mdt device activate mdt (mount -t lustre) lustre recovers soon (appr. 5 min, time needed to restore sessions), no files lost since backup! HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 14

Backup Problems No data loss with successfull backup, but...... appr. every third backup fails with error - inconsistent file system drbd is a layer between hardware and file system and does not care or see the file system 2 possibilities: deactivate MDT shortly before DRBD is disconnected no idea how much disturbance this causes under heavy used lustre? file system check on slave copy of MDT seems to help and produces correct backups - always? HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 15

HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 16 Open Questions and Improvements HA: setup well established and used successfully for other services for years improvement of monitoring scripts and integration in heartbeat-2 Backup / Restore: test of backup / restore procedure on heavily used lustre system (until now test system) no HA during backup procedure 3 nodes? 7zip instead of tar...?

Questions? HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 17