REM-Rocks: A Runtime Environment Migration Scheme for Rocks based Linux HPC Clusters



Similar documents
Asymmetric Active-Active High Availability for High-end Computing

Highly Reliable Linux HPC Clusters: Self-awareness Approach

A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

INUVIKA TECHNICAL GUIDE

High Availability Solutions for the MariaDB and MySQL Database

Cisco Active Network Abstraction Gateway High Availability Solution

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

How To Fix A Powerline From Disaster To Powerline

Simple Introduction to Clusters

Deploying Exchange Server 2007 SP1 on Windows Server 2008

Achieving High Availability

Pervasive PSQL Meets Critical Business Requirements

HAOSCAR 2.0: an open source HA-enabling framework for mission critical systems

Dell High Availability Solutions Guide for Microsoft Hyper-V

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

Synology High Availability (SHA)

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Synology High Availability (SHA)

High Availability Design Patterns

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Availability Digest. Stratus Avance Brings Availability to the Edge February 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Fault Tolerant Servers: The Choice for Continuous Availability

ELIXIR LOAD BALANCER 2

High Availability and Clustering

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

The Benefits of Virtualizing

FioranoMQ 9. High Availability Guide

High Availability Cluster for RC18015xs+

Red Hat Enterprise linux 5 Continuous Availability

Integration of PRIMECLUSTER and Mission- Critical IA Server PRIMEQUEST

Veritas Cluster Server

NETAPP WHITE PAPER USING A NETWORK APPLIANCE SAN WITH VMWARE INFRASTRUCTURE 3 TO FACILITATE SERVER AND STORAGE CONSOLIDATION

QuickStart Guide vcenter Server Heartbeat 5.5 Update 2

Box Leangsuksun+ * Thammasat University, Patumtani, Thailand # Oak Ridge National Laboratory, Oak Ridge, TN, USA + Louisiana Tech University, Ruston,

High Availability Essentials

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

The functionality and advantages of a high-availability file server system

High Performance Cluster Support for NLB on Window

Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.

High Availability Database Solutions. for PostgreSQL & Postgres Plus

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

System-level Virtualization for High Performance Computing

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

High-Availability User s Guide v2.00

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Automatic Service Migration in WebLogic Server An Oracle White Paper July 2008

Whitepaper Continuous Availability Suite: Neverfail Solution Architecture

Best practices for fully automated disaster recovery of Microsoft SQL Server 2008 using HP Continuous Access EVA with Cluster Extension EVA

HRG Assessment: Stratus everrun Enterprise

Andrew McRae Megadata Pty Ltd.

High Availability Storage

Adaptec: Snap Server NAS Performance Study

LSKA 2010 Survey Report Job Scheduler

Application Persistence. High-Availability. White Paper

Shared Parallel File System

HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime

High Availability of the Polarion Server

Downtime, whether planned or unplanned,

short introduction to linux high availability description of problem and solution possibilities linux tools

Oracle Database Scalability in VMware ESX VMware ESX 3.5

PATROL Console Server and RTserver Getting Started

CMS Tier-3 cluster at NISER. Dr. Tania Moulik

Managing Live Migrations

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Resilient NAS by Design. Buffalo. A deployment guide to disaster recovery. Page 1

MAKING YOUR VIRTUAL INFRASTUCTURE NON-STOP Making availability efficient with Veritas products

Astaro Deployment Guide High Availability Options Clustering and Hot Standby

A High Availability Clusters Model Combined with Load Balancing and Shared Storage Technologies for Web Servers

Building Reliable, Scalable AR System Solutions. High-Availability. White Paper

SQL Server Database Administrator s Guide

Linux High Availability

RESOLVING SERVER PROBLEMS WITH DELL PROSUPPORT PLUS AND SUPPORTASSIST AUTOMATED MONITORING AND RESPONSE

- An Essential Building Block for Stable and Reliable Compute Clusters

EMC VPLEX FAMILY. Continuous Availability and Data Mobility Within and Across Data Centers

Oracle BI Publisher Enterprise Cluster Deployment. An Oracle White Paper August 2007

Doc. Code. OceanStor VTL6900 Technical White Paper. Issue 1.1. Date Huawei Technologies Co., Ltd.

How to Choose your Red Hat Enterprise Linux Filesystem

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

Deploying Global Clusters for Site Disaster Recovery via Symantec Storage Foundation on Infortrend Systems

Administrator Guide VMware vcenter Server Heartbeat 6.3 Update 1

Google File System. Web and scalability

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

IBM Security QRadar SIEM Version High Availability Guide IBM

SteelEye Protection Suite for Linux v Network Attached Storage Recovery Kit Administration Guide

Real-time Protection for Hyper-V

Parallels Cloud Storage

A New Approach to Developing High-Availability Server

Running a Workflow on a PowerCenter Grid

Adaptec: Snap Server NAS Performance Study

Open Source High Availability on Linux

Leveraging Virtualization in Data Centers

WHITE PAPER. Best Practices to Ensure SAP Availability. Software for Innovative Open Solutions. Abstract. What is high availability?

Reboot the ExtraHop System and Test Hardware with the Rescue USB Flash Drive

Linux clustering. Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University

High Availability Design Patterns

Transcription:

REM-Rocks: A Runtime Environment Migration Scheme for Rocks based Linux HPC Clusters Tong Liu, Saeed Iqbal, Yung-Chin Fang, Onur Celebioglu, Victor Masheyakhi and Reza Rooholamini Dell Inc. {Tong_Liu, Saeed_Iqbal, Yung-Chin_Fang, Onur_Celebioglu,Victor_Masheyakhi, Reza_Rooholamini}@dell.com Chokchai (Box) Leangsuksun Louisiana Tech University box@latech.edu Abstract. Commodity Beowulf clusters are now an established parallel and distributed computing paradigm due to their attractive price/performance. Beowulf clusters are increasingly adopted in downtime-sensitive environments requiring improved fault tolerance and high availability (HA). From the fault tolerance prospect, the traditional Beowulf cluster architecture consists of a single master node which potentially renders a single point of failure (SPOF) in the system. Hence to meet the minimal downtime requirements, HA enhancements to the cluster management system are critical. In this paper we propose such enhancements based on the commonly used Rocks management system, called Run Environment Migration Rocks (REM-Rocks). Our previous experiences [4][14] in HA-OSCAR release suggests a significant HA improvement. REM-Rocks is resilient to multitude levels of failures and provides mechanisms for graceful recovery to a standby master node. We also discuss the architecture and failover algorithm of REM-Rocks. Finally, we evaluate failover time under REM-Rocks. 1. Introduction In the past decade, the raw computational power of commodity Beowulf clusters [1] has dramatically increased. The trend has propelled clusters as the architecture of choice for parallel and distributed computing in academic, research and industrial environments. The availability of mature and stable open source operating systems, such as Linux, have also helped this trend immensely. Currently, there are several open source high performance Linux based clustering solutions available, Rocks [2] and OSCAR [3] are those among the most popular open source tools to build clusters. The contemporary Beowulf cluster architecture has a single master node which creates a single point of failure (SPOF) [4]. In an event of a master node failure, the whole cluster will be out of service. This scenario is unacceptable for many mission critical applications which guarantee their users certain quality of service (QoS). The

2 Tong Liu, Saeed Iqbal, Yung-Chin Fang, Onur Celebioglu, Victor Masheyakhi, Reza Rooholamini and Box Leangsuksun SPOF is one of the main impediments to clusters being commonly deployed for applications requiring high availability. Fault tolerance improvement in Beowulf clusters is urgently required. High Availability (HA) [6] implies that system resources and services must be operational with high probability. HA in servers can be achieved by adding redundant hardware subsystems. However, the servers are much more expensive. An economical solution is to use redundant machines made of commodity components with special software to transfer control between machines. Such techniques are gaining popularity and are in demand nowadays. The latter approach can be employed to eliminate the SPOF in clusters at a much lower cost. Furthermore, this class of HA solutions improve clusters survivability from failures in the system by triggering failover [4] to redundant nodes. In this paper we developed and evaluated an economical high availability enhancement called REM-ROCKS to the traditional Beowulf architecture. Our early lessons learnt from dual-headed HA-OSCAR demonstrate a cost-effective solution. REM-Rocks enables two master nodes, called primary master node and standby master node. REM-Rocks executes a failure detection module on the standby node which can detect failures on the primary master node quickly and immediately initiate migration of services from the primary master node to the standby node. This paper is organized as follows: Section 1 gives an introduction. Section 2 gives further details about the SPOF and other relevant background. Section 3 gives details of the REM-Rocks architecture and failure detection algorithm. Section 4 gives details of the hardware setup and implementation of the algorithm. Section 5 outlines some related projects and future enhancements. Section 6 gives a summary of the paper. 2. Backgrounds: SPOF in Rocks Beowulf Cluster Rocks, developed by San Diego Supercomputer Center, is an open source package which has been widely utilized to build Beowulf cluster varying from a few to several hundred heterogeneous compute nodes. Installing Rocks is a simple process which requires minimal expertise in Linux system administration or cluster architecture. After providing simple cluster configuration data on front-end node, all the compute nodes start installation automatically. To allow adding software components more easily after initial cluster deployment, Rocks provides a roll CD scheme which customizes them to meet software distribution requirements. Four clusters have been reported built by Rocks in the Top500 list in November 2004 [5]. Figure 1 shows the architecture of a traditional ROCKS Beowulf cluster. The main components of the cluster are a single master node, compute nodes, communication network switch and storage disks. The master node controls all job assignments to computes nodes. The master node receives requests from users, through submit

REM-Rocks 3 nodes, and distributes jobs to the specific compute nodes based on the decision created by a job scheduler and resource manager. In the event of a failure at the master node due to outages such as defected hard-drives or failed services, the master node is unable to perform its operations like scheduling and communication with the public network. For example, if the public network interface fails at the master node, job requests cannot be distributed due to a nonfunctional job scheduler daemon. In addition, running jobs may also crash. Such failures usually render catastrophic events and the whole cluster is out of service. Thus, a normal root cause of such down times is the single point of failure (SPOF) of the master node. Fig. 1. Architecture of Rocks Beowulf Cluster 3. REM-Rocks: Architecture and Fault Tolerant Algorithm In REM-Rocks similar to HA-OSCAR, the SPOF is eliminated by adding redundant master node. The multiple master nodes with proper software support ensure that in case of master node failure all of its functionality can be taken over by a standby master node. Figure 2 shows the REM-Rocks architecture with a primary master node and a standby master node. Both master nodes have network connections to the same private and public networks. In addition, they have access to the same network attached storage, so that either can take control of the cluster, after a failover or failback [8]. Either master node can provide us the same application data and environment settings which make system failure recovery almost transparent to users. In REM-Rocks, the standby master node is designed to monitor health of primary

4 Tong Liu, Saeed Iqbal, Yung-Chin Fang, Onur Celebioglu, Victor Masheyakhi, Reza Rooholamini and Box Leangsuksun master node at all time. If the standby master node detects a failure at the primary master node, it triggers a failover operation. Furthermore, dual master nodes not only improve system availability but also help cluster maximize its performance by load balancing. The primary master node in a Beowulf cluster can become overloaded since many applications and daemons used to distribute jobs and manage cluster are executing on it. Introducing the secondary server can share some of loads with the primary master node. In our architecture, standby master node works as code compiling node and cluster monitoring node to take some load from primary master node. As an important feature, REM-Rocks support High Availability Network File System to ensure that both master nodes can access the same application data after failover or failback. Hence, no jobs resubmission is required after master node is swapped. Fig. 2. Architecture of REM-Rocks Beowulf Cluster 3.1. Failure Detection, Failover and Failback Algorithm A key issue is an accurate failure detection that aims to minimize false failure detection. Existing failure detection algorithms like the Linux-HA [9], Kimberlite [10] and Linux FailSafe [11] have a tendency to report false failures. We have improved these detection algorithms and developed a scheme which can minimize or avoid false failure detections (The details are given below). In order to handle various kinds of failures that might occur at the master node, we employ a multi-level detection algorithm. To detect failures, a REM-Rocks daemon monitors a predefined list of services and devices. Any failure detected in these components is reported to the REM-Rocks management module. The management module triggers an

REM-Rocks 5 appropriate alert operation. To further understand the operation of our algorithm, let us consider the following example scenarios: Scenario 1: Network Failure Compute nodes in a Beowulf Cluster frequently communicate with the master node via the network. Obviously, if master node s network goes down, all users will be unable to access the cluster and all current jobs will fail. Figure 3 shows a flow chart of our failover algorithm when a failure at the master node. Fig. 3. Failover and failback algorithm based on network failure The REM-Rocks monitor daemon on standby master node periodically sends a health query message out and it assumes a failure happened on the network interface of primary master node if a successful response is not returned within a time limit. However, no response received can be due to a bad internal network interface on standby master node. To avoid this false alarm, we add a local network interface selfchecking function prior to all operations. If the local self-checking fails and reports an error, the REM-Rocks management module will restart the local network interface. All actions are stopped until a successful local self-check is received. This will prevent an unnecessary failover operation from executing triggered by local failure. Once an actual failure occurs, our solution will first check with the previous result. If the preceding return value is OK, that indicates a failure at the primary master node. REM-Rocks will start a failover function to convert critical network configurations on standby master node to primary master node s settings and start cluster services accordingly. On the contrary, if the previous result is Error, REM-Rocks will not perform failover since it must have been already executed.

6 Tong Liu, Saeed Iqbal, Yung-Chin Fang, Onur Celebioglu, Victor Masheyakhi, Reza Rooholamini and Box Leangsuksun Meanwhile, if the standby master node receives successful reply from the primary master node, REM-Rocks will generate return value as OK. This value has to be used to compare with the previous return value similar to what we stated in last paragraph. In the event of OK received right after Error, REM-Rocks will perform a failback operation on the standby master node and then change network configurations back to the original value and stop running cluster services. At this time, the primary master node takes back all its responsibilities for the whole cluster. If OK comes after OK, REM-Rocks will skip over and trigger no operation. Scenario 2: Cluster Service Failure To enable the master node to survive a resource failure, we implement a cluster service failure detection module deployed on the primary master node. Its flow is illustrated in Figure 4. In the primary master node, the detection module keeps checking status of demons of the cluster programs and services selected by the administrator. If a failure response is returned, REM-Rocks will refresh the failed daemon by restarting. This feature furthers improves cluster fault resilience. If REM- Rocks is unable to restart the module, it will deactivate all the network interfaces and the cluster running services on primary master node. This will in turn prevent a response from the stopped services to the standby master node. Hence, the standby master node will initiate a failover operation occurring on the standby master node in scenario 1. Fig. 4. Cluster service failure detection algorithm

REM-Rocks 7 4. Implementation and Experimental Evaluation 4.1. Hardware Setup Our testing environment is based on 6 Dell PowerEdge PE2650s (two master nodes and four compute nodes) with the following specification on each: Each node has two IA32 based Intel Xeon processors running at 3.2 GHz (1 MB L2 cache, 533 MHz front side bus (FSB)) with 4GB memory (266MHz DDR SDRAM). Each node has two integrated Intel Gigabit Ethernet adapters. Hyper Threading was turned off during all experiments. 4.2. Software Installation First, we built our testing cluster with ROCKS 3.3. After system initialization, we installed REM-Rocks package and duplicated the original master node. One was chosen as the primary master node and the other was the standby master node. The primary master node held all job management responsibilities and the standby master node was set to monitor the running status of all the nodes. The Sun Grid Engine (SGE) is used as a resource manager. 4.3. Approach: Simulate Failure Scenarios To evaluate performance of the failover algorithm, our approach is to simulate failures on the experimental setup and measuring failure over time. Consider the following simulate failures: 4.3.1. Simulated Network Failure In this scenario, we performed two test cases. First, we directly unplugged the network cable from the primary master node s private network port, so that the standby master node lost communication within the detection interval. Once the unsuccessful communication was reported, REM-Rocks s management module will trigger a failover operation immediately. Consequently, the standby master node took the control of the cluster network file system and all the compute nodes and resumed providing access for external users. Figure 5 shows time measurements of the corresponding failover operation.

8 Tong Liu, Saeed Iqbal, Yung-Chin Fang, Onur Celebioglu, Victor Masheyakhi, Reza Rooholamini and Box Leangsuksun Controlling Cluster Not Controlling Cluster Primary Master Node Standby Master Node Time 07:50 11:50:22 11:50:27 14:40:15 14:40:23 Fig. 5. Timing of failover due to the simulated network failure We started this test from 7:50 am. Initially the primary master node worked correctly. At time 11:50:22, we pulled out the network cable from eth0 on the primary master node. After 5 seconds, the standby master node was functioning as the frontend node for our test cluster. At time 14:40:15, we plugged network cable back to eth0 of the primary master node which would resume working within 8 seconds at 14:40:23. This experimental result suggests that REM-Rocks can reduce system downtime from hours to a few seconds. In order to verify the self-checking function of REM-Rocks, we manually disabled a network interface on the standby master node. Shortly, that network interface was successfully reactivated since REM-Rocks performed the self-checking function and activated it without producing an improper failover operation 4.3.2. Simulated Cluster Service Failure To simulate this failure, we planned to monitor a job scheduler management tool, SGE (Sun Grid Engine) [13]. First we added names of SGE server daemons, sge_qmaster and sge_commd, to the configuration file of REM-Rocks resource monitoring module. Then we run command: service rcsge stop to shut down SGE scheduler. After several seconds, we checked SGE running status and found the server daemons, sge_qmaster and sge_commd, resumed working since REM-Rocks restarted SGE scheduler automatically right after the failure was detected. In addition to resumable cluster services, we conducted another test case for verifying REM-Rocks failover function based on an unrecoverable resource failure. In this scenario, we deleted all SGE files so that REM-Rocks reported an error after an attempt to restart sge_qmaster and sge_commd. Followed by the error, eth0 and eth1 on the primary master node were disabled which caused the standby master node to trigger a failover operation subsequently. When we restored all those files back to the same directory on the primary master node, REM-Rocks successfully started the SGE daemons and activate eth0 and eth1 which in turn make the standby master node rebuilt connection with the primary master node and return the system control back.

REM-Rocks 9 5. Related and future work There are several High Availability Linux Cluster related works which are dedicated to providing failover mechanisms for common applications such as Samba, Apache, Databases, etc. Kimberlite is an open source cluster technology developed by Mission Critical Linux. It provides data integrity and application availability which is suitable for NFS servers, web servers and database applications. SteelEye's LifeKeeper for Linux is a software solution that allows applications to failover to other servers in the cluster. Similar to Kimberlite, it only provides application availability for common web servers and database clusters. Linux-HA has a widely used package called Heartbeat, which performs a failover by using an alias IP address takeover on Linux systems. Its default configuration is applicable only for 2-node clusters and supports web servers, Mail servers, File servers, Database servers, DHCP servers, etc. However, all these solutions may not be suitable to the large-scale Linux Beowulf cluster. HA-OSCAR [4] is an open source software solution constructed based on OSCAR package which provides system, service monitoring and alert management capability for Linux Beowulf cluster systems. It runs MON [7]on one master node to check the other master node health periodically. If there is no response for service access request or ICMP ECHO_REQUEST from a node in a specified time, its service monitor will consider the node is dead and report the failure to its alert management module. Consequently, user customized operations will be triggered, such as failover or sending email to the system administrator. When the outage node is recovered, the alert management module will get notified and trigger failback operation accordingly. In order to make a Linux Beowulf cluster a fully fault-tolerant system, we are considering an automatic job checkpointing and restarting mechanism to enable a transparent system recovery when a failover takes place on the master node. In addition, building active-active master nodes architecture is also in demand. If two or more master nodes can handle user requests, the cluster performance will be further improved. 6. Conclusion We have developed and evaluated an economical HA architecture for Beowulf clusters. REM-Rocks can be used to improve the availability of the traditional Beowulf Cluster. Normally, a system outage occurs when there are failures on the master node. However, our experiments indicate that REM-Rocks ensures a successful failover operation to enable the standby master node functioned as the master node. Our solution improves cluster s total up time and lessen the load on the single master node of the traditional Beowulf cluster.

10 Tong Liu, Saeed Iqbal, Yung-Chin Fang, Onur Celebioglu, Victor Masheyakhi, Reza Rooholamini and Box Leangsuksun 7. References [1] Beowulf Cluster, http://www.beowulf.org. [2] NPACI Rocks, http://www.rocksclusters.org. [3] OSCAR, http://oscar.openclustergroup.org. [4] C. Leangsuksun, Lixin Shen, Tong Liu, H. Song, S. Scott, Availability Prediction and Modeling of High Availability OSCAR Cluster, IEEE International Conference on Cluster Computing (Cluster 2003), Hong Kong, December 2-4, 2003. [5] http://www.top500.org. [6] Evan Marcus, Hal Stern, Blueprints for High Availability: Designing Resilient Distributed Systems, John Wiley & Sons, 1 edition, Page 15-25, January 31, 2000. [7] MON, http://www.kernel.org/software/mon. [8] Failback, http://www.faqs.org/docs/evms/x2912.html. [9] Linux-HA, http://www.linux-ha.org. [10] Kimberlite, http://oss.missioncriticallinux.com/projects/kimberlite. [11] Failsafe http://www.sgi.com/products/software/failsafe. [12] PMB, http://www.pallas.com/e/products/pmb. [13] SGE, http://gridengine.sunsource.net. [14] HA-OSCAR, http://xcr.cenit.latech.edu/ha-oscar