HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime

Similar documents

Pervasive PSQL Meets Critical Business Requirements

HRG Assessment: Stratus everrun Enterprise

Intro to Virtualization

Bosch Video Management System High Availability with Hyper-V

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed and Cloud Computing

High Availability & Disaster Recovery Development Project. Concepts, Design and Implementation

Fault Tolerant Solutions Overview

RackWare Solutions Disaster Recovery

Basics of Virtualisation

Creating A Highly Available Database Solution

Red Hat enterprise virtualization 3.0 feature comparison

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

Lab Validation Report

Technical Paper. Leveraging VMware Software to Provide Failover Protection for the Platform for SAS Business Analytics April 2011

Virtualization with Windows

StruxureWare TM Data Center Expert

Availability Digest. Stratus Avance Brings Availability to the Edge February 2009

RED HAT ENTERPRISE VIRTUALIZATION

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Hypervisor Software and Virtual Machines. Professor Howard Burpee SMCC Computer Technology Dept.

What s New with VMware Virtual Infrastructure

Hyper-V vs ESX at the datacenter

Vocera Voice 4.3 and 4.4 Server Sizing Matrix

High Availability Solution

G-Cloud 6 brightsolid Secure Cloud Servers. Service Definition Document

Parallels Plesk Automation

PARALLELS SERVER 4 BARE METAL README

Leveraging Virtualization in Data Centers

PARALLELS SERVER BARE METAL 5.0 README

HELSINKI METROPOLIA UNIVERSITY OF APPLIED SCIENCES. Information Technology. Multimedia Communications MASTER S THESIS

How To Use Vsphere On Windows Server 2012 (Vsphere) Vsphervisor Vsphereserver Vspheer51 (Vse) Vse.Org (Vserve) Vspehere 5.1 (V

Open-E Data Storage Software and Intel Modular Server a certified virtualization solution

OpenSAF and VMware from the Perspective of High Availability

The VMware Administrator s Guide to Hyper-V in Windows Server Brien Posey Microsoft

A quantitative comparison between xen and kvm

Peter Ruissen Marju Jalloh

VIRTUALIZATION 101. Brainstorm Conference 2013 PRESENTER INTRODUCTIONS

Windows Server 2008 R2 Hyper V. Public FAQ

Deployment Options for Microsoft Hyper-V Server

SAP NetWeaver High Availability and Business Continuity in Virtual Environments with VMware and Hyper-V on Microsoft Windows

A Project Summary: VMware ESX Server to Facilitate: Infrastructure Management Services Server Consolidation Storage & Testing with Production Servers

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

VMware System, Application and Data Availability With CA ARCserve High Availability

Microsoft Hyper-V chose a Primary Server Virtualization Platform

Virtualization. Pradipta De

Hyper-V R2: What's New?

Distributed Systems. Virtualization. Paul Krzyzanowski

Table of contents. Matching server virtualization with advanced storage virtualization

Announcing the product launch of a Mitel Virtual Mitel Communication Director (Virtual MCD)

Mirror File System for Cloud Computing

13.1 Backup virtual machines running on VMware ESXi / ESX Server

RED HAT ENTERPRISE VIRTUALIZATION FOR SERVERS: COMPETITIVE FEATURES

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

TechTarget Windows Media

IOS110. Virtualization 5/27/2014 1

Virtualization: Know your options on Ubuntu. Nick Barcet. Ubuntu Server Product Manager

CA ARCserve Replication and High Availability Deployment Options for Hyper-V

Red Hat Enterprise linux 5 Continuous Availability

SanDisk ION Accelerator High Availability

Virtual Machine Synchronization for High Availability Clusters

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

FOR SERVERS 2.2: FEATURE matrix

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

White Paper. Recording Server Virtualization

vsphere 6.0 Advantages Over Hyper-V

Optimization, Business Continuity & Disaster Recovery in Virtual Environments. Darius Spaičys, Partner Business manager Baltic s

6422: Implementing and Managing Windows Server 2008 Hyper-V (3 Days)

Preparation Guide. How to prepare your environment for an OnApp Cloud v3.0 (beta) deployment.

short introduction to linux high availability description of problem and solution possibilities linux tools

Expert Reference Series of White Papers. Visions of My Datacenter Virtualized

HPC performance applications on Virtual Clusters

Best Practices for Virtualised SharePoint

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Windows Server 2012 R2 Hyper-V: Designing for the Real World

Implementing and Managing Windows Server 2008 Hyper-V

VMware Infrastructure 3 and Stratus Continuous Availability:

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Red Hat Enterprise Virtualization - KVM-based infrastructure services at BNL

Step-by-Step Guide. to configure Open-E DSS V7 Active-Active iscsi Failover on Intel Server Systems R2224GZ4GC4. Software Version: DSS ver. 7.

NET ACCESS VOICE PRIVATE CLOUD

Table of Contents. Server Virtualization Peer Review cameron : modified, cameron

High Availability of the Polarion Server

ENTERPRISE HYPERVISOR COMPARISON

Getting the Most Out of Virtualization of Your Progress OpenEdge Environment. Libor Laubacher Principal Technical Support Engineer 8.10.

The functionality and advantages of a high-availability file server system

EMC Virtual Infrastructure for SAP Enabled by EMC Symmetrix with Auto-provisioning Groups, Symmetrix Management Console, and VMware vcenter Converter

High-Availability Fault Tolerant Computing for Remote and Branch Offices HA/FT solutions for Cisco UCS E-Series servers and VMware vsphere

Virtualization Management the ovirt way

Enterprise Deployment: Laserfiche 8 in a Virtual Environment. White Paper

DeltaV Virtualization High Availability and Disaster Recovery

Transcription:

HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime Memmo Federici INAF - IAPS, Bruno Martino CNR - IASI

The basics Highly available (HA) Computer Systems: aim to the continuity of services; in other words, in case of faults, in a fault tolerant system the services provisions must be restored in a very short time (tending to 0); a properly designed Hw and Sw architecture can reduce the probability of faults and their effects

System design Design requirements for our HA Computing System: general purpose; multitask; multiuser; cheap; ease of management; highly recyclable. and, in case of faults, guarantees: a very short time (tending to 0) to restore the services provisioning the system guarantees the preservation of services functionality and the consistency of the data an automatic failback (after repair the fault)

The context Acquisition and processing of data from satellite We need to: manage processes of uninterrupted analysis and data acquisition; keep the intermediate results of the computation; keep the result of the computation for long periods and share them with the scientific community of reference; minimize the human resources needed to manage the system.

The alternatives Windows Server Failover Clustering expensive Sw licenses are required (about $ 900 for each processor); twin Hw and Sw is required for the primary and the secondary servers; introduces a potential loss of data due to discrete-time mirroring every 5 to 10 minutes; automatic failback is not supported. VMware vsphere expensive Sw licenses are required (starting from $ 4000); requires certified Hw to run; the management node requires a dedicated server; It is not easy to set up.

The alternatives Red Hat Cluster Manager (Red Hat Enterprise Linux Server) Sw licenses are required ($ 500 per year) ; in case of faults the services are restored through the use of snapshots; the system has to be restarted in case of faults; the automatic failback is not supported;

Our solution: automatic failover and failback HAVmS is a HA system that guarantees the continuity of service through effective automatic mechanisms of failover and failback It is a two server system based on: XEN VMs synchronized with each other through DRBD (protocol D) and Remus Applications e.g.: Matlab, IDL, Compilers, Web server ecc. From the point of view of users the two servers are viewed as a single server each VM has its own IP

Server Sw Hw architecture Software totally open source: Linux Ubuntu Server 10.04 64 (kernel customized to support XEN) XEN Hypervisor DRBD (protocol D) Remus Our custom scripts for automatic Failback Hardware: I7 8cores Intel Processor 8 GB RAM DDR3 Services and Applications Virtual Machines Remus XEN DRBD Ubuntu Server 2 TB Hard Disk 2 network interfaces

Xen Hypervisor XEN: a virtualization platform of wide spread packaged into all major Linux distributions also adopted by commercial Sw

Distributed Replicated Block Device(DRBD) What it is: distributed mass storage system for the GNU/Linux platform. Makes it possible, among other things: the implementation of HA storage systems consisting of two (or more) servers connected by a dedicated link. Operating logic: Assigns to one of the two servers the role of primary, secondary to the other; the primary server is the only authorized to write on disks; DRBD synchronizes the data among the active server and the sleeper server at every checkpoint (40 msec)

Remus What it does: realizes a HA system through the control of virtual machines "VM managed by Xen. performs a checkpoint every 40 msec among the two machines and gives the trigger to DRBD sync. on the primary server keeps running its active VMs; the sleeper server is ready to take over in case of failure. The VMs on the sleeper server are an exact copy of the VMs running on the active server. In case of failure of the active server: the VMs continues to run on the sleeping server as if failure had never occurred (thanks to the TCP protocol the data flow is kept alive without losses).

Fault

Failover On both servers: DRBD stops the synchronization of the storage systems. On the active server: the execution of Remus stops On the sleeping server: VMs resume their execution in a state and with a storage consistent with the last committed checkpoint As result of the failover: all applications that were running on the VMs on the active server, continue to run on the sleeping server without any interruption; all active connections from the outside to the VMs are preserved.

Failback The problem: Is not automatically managed by Remus The criticality: avoid potential inconsistencies between the state of the VMs and their storage The solution: after repairing the fault a set of custom scripts that, without any action by the system manager, create the conditions for a live migration of the VMs to the active server and reestablish the conditions prior to fault

The data storage of INTEGRAL in HA INTEGRAL: INTEGRAL (INTErnational GammaRay Astrophysics Laboratory) is an European satellite dedicated to the observation from space in the energy range between 15 kev and 10 MeV AVES cluster for INTEGRAL Data Analisys (IAPS Distributed computing laboratory)

The data storage of INTEGRAL in HA Data storage subsystem (DSS): downloads data from IDSC backups these data in real time provides these data to AVES In case of fault: the dowload of data is interrupted the backup is no longer synchronized HAVmS guarantees to users the continuous availability of data (16TB) the running scientific analysis are interrupted

Continuity of service at IASI The services provided by the systems of IASI: attendance control; centralized computing; centralized storage; centralized backup; printing services management. The solution: the management of VMs is provided by HAVmS HAVmS guarantees to users the continuity of IASI s services each service is run on a dedicated VM.

Potentiality: high end solution cost : 1500 $ motherbard 1600 $ processors RAM 8 $ X GB hypothesis: 32 core 120 GB ram about 5000 Euro With 10000 $ is possible to buy the computing power necessary to provide high reliability in the services for the operation of a hospital supermicro motherboard series mp Xeon X9Qxxx

Further work Implementing a 3 nodes HA system, one active, one sleeping, the third one wake on LAN reactivable: in case of fault, the reactivable server is automatically (Wake on LAN) awakened and replaces the the damaged server. This allows for a repair with a more relaxed timing. The migration of the system toward Ubuntu 12.04 LTS recent version of Xen and DRDB (Xen 4.1.4, DRBD 8.3.11) does not require recompiling the kernel of the OS

A little gem: HAVmS 3 nodes system Solar PU Wake on LAN Raspberry PI * 700 MHz Low Power ARM1176JZ-F Applications Processor Dual core GPU Sensors *Single unit cost: 35 $ sleeper active RF unit low power ( 3 W ) unattended measurement system, for extreme environments using a three server configuration. Network switch

Conclusions HAVmS : provides an active redundancy, without interruption of services, completely automatic failover and failback and intervention times close to zero; is very cheap, it needs to carry out its functions HA only doubling hardware is extremely versatile: supports any operating system installable on XEN (including Mac OSX); supports the execution of multiple VMs: the limit it is imposed only by the number of cores and the RAM size; is general purpose and recyclable.

Thanks to Franco Giovannelli Pietro Ubertini IAPS Giuliano Sabatino IAPS Fabio Guglietta IASI Carlo Gaibisso IASI