Fault Tolerance in the Internet: Servers and Routers



Similar documents
High Availability and Clustering

DHCP Failover. Necessary for a secure and stable network. DHCP Failover White Paper Page 1

High Availability Solutions for the MariaDB and MySQL Database

Cloud Computing Disaster Recovery (DR)

Configuring Redundancy

A SURVEY OF POPULAR CLUSTERING TECHNOLOGIES

Informix Dynamic Server May Availability Solutions with Informix Dynamic Server 11

How To Handle A Power Outage On A Server Farm (For A Small Business)

Fault-Tolerant Framework for Load Balancing System

DATA CENTER. Best Practices for High Availability Deployment for the Brocade ADX Switch

BME CLEARING s Business Continuity Policy

Recommended IP Telephony Architecture

HRG Assessment: Stratus everrun Enterprise

Building Reliable, Scalable AR System Solutions. High-Availability. White Paper

Integrated Application and Data Protection. NEC ExpressCluster White Paper

A SWOT ANALYSIS ON CISCO HIGH AVAILABILITY VIRTUALIZATION CLUSTERS DISASTER RECOVERY PLAN

SCALABILITY AND AVAILABILITY

High Availability and Disaster Recovery Solutions for Perforce

Creating Web Farms with Linux (Linux High Availability and Scalability)

Cisco Disaster Recovery: Best Practices White Paper

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

Stateful Network Address Translators (NAT) Xiaohu Xu Dean Cheng IETF75, Stockholm

ADVANCED NETWORK CONFIGURATION GUIDE

Disaster Recovery (DR) Planning with the Cloud Desktop

Module: Business Continuity

Westek Technology Snapshot and HA iscsi Replication Suite

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Chapter 14: Recovery System

Network Attached Storage. Jinfeng Yang Oct/19/2015

Executive Summary WHAT IS DRIVING THE PUSH FOR HIGH AVAILABILITY?

IT Service Management

ACHIEVING 100% UPTIME WITH A CLOUD-BASED CONTACT CENTER

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Pervasive PSQL Meets Critical Business Requirements

Review: The ACID properties

Fault Tolerant Servers: The Choice for Continuous Availability

Availability and Disaster Recovery: Basic Principles

GE Intelligent Platforms. PACSystems High Availability Solutions

DNS ROUND ROBIN HIGH-AVAILABILITY LOAD SHARING

- Redundancy and Load Balancing -

Cisco PIX vs. Checkpoint Firewall

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Replication on Virtual Machines

SAN Conceptual and Design Basics

OVERVIEW. CEP Cluster Server is Ideal For: First-time users who want to make applications highly available

Achieving High Availability

IP Telephony Deployment Models

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Astaro Deployment Guide High Availability Options Clustering and Hot Standby

Availability Digest. Redundant Load Balancing for High Availability July 2013

Eloquence Training What s new in Eloquence B.08.00

Administrator Guide VMware vcenter Server Heartbeat 6.3 Update 1

A New Approach to Developing High-Availability Server

Availability Digest. Stratus Avance Brings Availability to the Edge February 2009

short introduction to linux high availability description of problem and solution possibilities linux tools

MOC 5047B: Intro to Installing & Managing Microsoft Exchange Server 2007 SP1

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features

Introduction to ServerIron ADX Application Switching and Load Balancing. Module 5: Server Load Balancing (SLB) Revision 0310

NAS 307 Link Aggregation

SY system so that an unauthorized individual can take over an authorized session, or to disrupt service to authorized users.

Software Architecture Case Study. Air Traffic Control - Designing for High Availability

Efficient and low cost Internet backup to Primary Video lines

FioranoMQ 9. High Availability Guide

CSE 3461 / 5461: Computer Networking & Internet Technologies

TechBrief Introduction

Traditionally, a typical SAN topology uses fibre channel switch wiring while a typical NAS topology uses TCP/IP protocol over common networking

Troubleshooting and Maintaining Cisco IP Networks Volume 1

Canadian Securities Exchange enhances Trading Network by adding a FIX Protocol Router Appliance

VXLAN: Scaling Data Center Capacity. White Paper

High Availability and Load Balancing Cluster for Linux Terminal Services

High Availability Essentials

High Availability for Citrix XenApp

Fully Managed Secure Data Sharing (a cloud service)

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February

Security+ Guide to Network Security Fundamentals, Fourth Edition. Chapter 13 Business Continuity

McAfee Endpoint Encryption Hot Backup Implementation

The functionality and advantages of a high-availability file server system

Firewall Load Balancing

Oracle Database 10g: Backup and Recovery 1-2

Virtual PortChannels: Building Networks without Spanning Tree Protocol

Brian LaGoe, Systems Administrator Benjamin Jellema, Systems Administrator Eastern Michigan University

Disaster Recovery for MESSAGEmanager

Geographic redundant VoIP systems with Kamailio

Fault Tolerant Solutions Overview

STRATO Load Balancing Product description Version: May 2015

Feature Comparison. Windows Server 2008 R2 Hyper-V and Windows Server 2012 Hyper-V

DISASTER RECOVERY WITH AWS

KeyLock Solutions Security and Privacy Protection Practices

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

Data Sheet. V-Net Link 700 C Series Link Load Balancer. V-NetLink:Link Load Balancing Solution from VIAEDGE

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Transcription:

Fault Tolerance in the Internet: Servers and Routers Sana Naveed Khawaja, Tariq Mahmood Research Associates Department of Computer Science Lahore University of Management Sciences

Motivation Client Link Failures Server Outages Router Crashes Server

Fault Tolerance Ability to consistently deliver service: Even in the presence of any unexpected errors Recover from and restore normal operation after any failure Operational Requirements: Scalability Inter-operability No loss of data Transparency

Fault Tolerant Configurations A fault tolerant system can be implemented using one of the following three fault tolerant models or configurations: Hot Standby Cold Standby Warm Standby

Hot Standby Configuration Fault tolerance ensured by: hardware redundancy software redundancy Duplication function Active Backup Complete Synchronization between the active and the backup

Merits/Demerits of Hot Standby Efficient and quick response time of the backup module The strict synchronization requirement creates an overhead Wastage of resources: the backup also has to be online with the primary.

Cold Standby Configuration The active processes the incoming traffic updating the internal state changes to a reliable storage medium. When the active crashes, the backup is updated with the state information in the repository. Storage State updates Active State updates Standby/Backup

Merits/Demerits of Cold Standby Completely asynchronous solution The backup does not have to be online and could be doing something else as long as the active is up Relatively long recovery time resulting in a noticeable time window May not be completely transparent to the client May not be suitable for routers where a delay will result in the neighboring routers becoming aware of the fault Extra non-volatile storage requirements

Warm Standby Configuration The backup is not in sync with the active (as in the hot backup scheme) and it is also not completely ignorant of what the active is doing (as in the case of cold backup) Check-pointing Periodic State updates Active Standby/Backup

Merits/Demerits of Warm Standby The backup can be kept offline until the primary fails Suitable for applications where state change is not very frequent (e.g. routers) Approximate restoration: when the active crashes, the backup takes over and starts providing the service from the last check point The transaction between the last check point and the crash (no matter how small) is lost

Case Study FTTCP: Challenges Scalability Inter-operability No loss of data Client Transparency TCP session replication and synchronization Successful determination of application failures Switching of roles within a certain time window No change in existing protocols at the client end (ideally)

Case Study: FTTCP For cold and warm backup, we need to investigate the minimal and optimal (TCP and application level) state information that is sufficient to provide fault tolerance for TCP connections. In case of the hot backup, this is automatically taken care of. We developed FTTCP based on the hot-standby configuration.

FTTCP Modules Lock-Step Synchronization Session Duplication Data duplication Non-determinism Commit Protocol Failure detection Process Health Monitoring Heartbeats Switchover IP Spoofing Connection Takeover

Implementation (Linux Kernel 2.4) Lock-Step Synchronization Session duplication: TCP sequence numbers, window size Data duplication: packet duplication at the MAC layer Non-determinism: lock step synchronization of system calls Commit protocol: control (IP) packets Failure detection Process health monitoring at the local machine Heartbeats to monitor the health of the other server Switchover IP Spoofing: incoming and outgoing Connection takeover: backup packets are no longer suppressed Application System Call TCP IP MAC

Application Modes Active Only crashes Active AND Backup Crash Error Active, Backup Active XOR Backup crashes Transient updated with state info. Active Only Crashed server back online Active, Transient Active crashes Transient crashes

Applications Fault tolerant servers TFTP, echo server (for testing) Web servers Email servers Fault tolerant routing Soft routers running BGP Client FT Server

Issues Health monitoring Deadlock detection Multithreading, signal handling Non-deterministic scheduling Inherent non-determinism Entails modification of applications Generic solution to fault tolerance is still an open challenge

Questions? Sana Naveed Khawaja, Tariq Mahmood Research Associate LUMS