HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 1 High Availability and Backup Strategies for the Lustre MDS Server Spring 2008 Karin Miers / GSI
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 2 HA and Backup Methods for Lustre Lustre: What is necessary for the production cluster? or What will we do to make our new file system reliable? high availability setup backup of important parts (if high availability setup fails...)
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 3 Lustre Components main parts of a Lustre file system: MGT System Management Info about OSTs and clients MDT Meta Data: where is which file? OST-1 data files: aaa, bbb.txt, ccc... OST-2 data files: hij, klm.txt, nop... OST-3 data files: xyz, yzx.txt, zxy... OST.. data files:...
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 4 What happens if...... an OST breaks down? all data on this OST are not longer available Lustre continues... the MDT breaks down? all data become inaccessible and probably lost forever... In Case of Failure...... the MGT breaks down? no data loss, but Lustre becomes inoperable
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 5 HA Design (...based on budget restrictions...) OSTs are set up single, without backup...... means data loss is accepted Same situation as it is now for experiment data MDT and MGT (=MGS) are set up in a cluster 2 nodes, master / slave which can take over MDT is written to the backup otherwise in case of failure ALL data could be lost MGT is not written to the backup...no need can be set up new very fast
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 6 Cluster Tools (Software) Software tools are Open Source (GPL or similar) Main components: Heartbeat V2 (2.1.3-5, debian package) for cluster connection, management and monitoring DRBD for redundant data partition
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 7 Linux Heartbeat Package Heartbeat-2...... controls and checks the communication between both (or more) nodes connected by ethernet and / or serial line...checks connectivity to local network... monitors the resources (are MDT/MGT mounted...?)... (not implemented) can fulfil complicated conditions
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 8 DRBD Distributed Replicated Block Device file system server1 server2 file system in principle: RAID-1 over network DRBD disk driver TCP/IP TCP/IP NIC driver NIC driver network connection DRBD disk driver hard disk hard disk data exist twice real time update on slave consistency guaranteed fast recovery after failover no load balancing overhead of drbd: - needs cpu power - write performance is reduced
MDS Cluster master (mds1) ttys0 serial SM/IPMI heartbeat stonith slave (mds2) ttys0 serial SM/IPMI 2 nodes master slave eth3 10.1.0.1 eth2 10.2.0.1 heartbeat heartbeat drbd eth3 10.1.0.2 eth2 10.2.0.2 hot stand-by Raid 10 2 drbd volumes MGT/MDT Raid 10 2 drbd volumes MGT/MDT eth1 140.181.x.x eth0 eth0:0 140.181.z.z eth1 140.181.y.y eth0 heartbeat network connectivity PingNode (nameserver, gateway) virtual service mds lustre has to be told to use eth0:0 instead of eth0! HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 9
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 10 Failover Failover tests: master switched off -> slave takes over automatically, lustre is fully operable after a few minutes: heartbeat/drbd ~ 20-30s according to configuration lustre ~ few minutes (< 5 min)
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 11 MDT Backup Strategy...just in case... if cluster fails... Problem: permanent write processes on MDT backup on active MDT must fail no possibility to stop write access for backup duration DRBD... there is a copy of the MDT!
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 12 Backup Procedure Master drbd0 MGS drbd1 MDT Slave drbd0 MGS drbd1 MDT backup state: unconnected drbd snapshot MDT no HA normal state: connected drbd synced HA Master drbd0 MGS drbd1 MDT Slave drbd0 MGS drbd1 MDT Master drbd0 MGS drbd1 MDT Slave drbd0 MGS drbd1 MDT after backup: connected drbd syncing
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 13 Backup Steps in Detail drbdadm disconnet / detach mdt mount mdt device as ext3 fs save extended attributes with getffatr make a tar archive of the directory and save it umount mdt device reconnect drbd time factor - ~25 s for 0.5 GB MDT space (lustre test system with appr. 200 000 files/800 GB) but will depend mainly on size of MDT
Restore Procedure (... worst case scenario hopefully will never happen...) destruction of MDT with dd... umount all OSTs umount mdt device format and tune mdt mount mdt with ldiskfs restore tar archive and extended attributes umount mdt device activate mdt (mount -t lustre) lustre recovers soon (appr. 5 min, time needed to restore sessions), no files lost since backup! HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 14
Backup Problems No data loss with successfull backup, but...... appr. every third backup fails with error - inconsistent file system drbd is a layer between hardware and file system and does not care or see the file system 2 possibilities: deactivate MDT shortly before DRBD is disconnected no idea how much disturbance this causes under heavy used lustre? file system check on slave copy of MDT seems to help and produces correct backups - always? HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 15
HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 16 Open Questions and Improvements HA: setup well established and used successfully for other services for years improvement of monitoring scripts and integration in heartbeat-2 Backup / Restore: test of backup / restore procedure on heavily used lustre system (until now test system) no HA during backup procedure 3 nodes? 7zip instead of tar...?
Questions? HA and Backup Methods for Lustre Hepix May '08 K.Miers@gsi.de 17