On Reliability of COTS Hardware



Similar documents
Managing Availability and Failure Avoidance

Introduction to IT Infrastructure Components and Their Operation. Balázs Kuti

10/15/2015. Tadiran Telecom SCTC 2015

Optimizing IT costs with Cloud. Tarmo Tikerpäe Datacenter SSP Microsoft Corporation

Cloud Computing Continued. Jan Šedivý

BICSI 002 / TIA 942 / Uptime

Introduction to Virtualization. Paul A. Strassmann George Mason University October 29, 2008, 7:20 to 10:00 PM

Designed for uptime. Defining data center availability via a tier classification system

Scality RING High performance Storage So7ware for pla:orms, StaaS and Cloud ApplicaAons

C24 - Inside the Data Center Andrew J. Luca

System Availability and Data Protection of Infortrend s ESVA Storage Solution

The Service Provider will monitor the VM for the Customer and provide notifications on an opt in basis, which is strongly recommended.

COST-BENEFIT ANALYSIS: HIGH AVAILABILITY IN THE CLOUD AVI FREEDMAN, TECHNICAL ADVISOR. a white paper by

Implementing High-Availability (HA) Solutions for Siebel ebusiness Applications

Best Practices for the Installation and Operation of an Automation Change Management Software (CMS) System

Data Centre Issues supported in TIA-942 : "Telecommunications Infrastructure for Data Centre s"

Data Center Commissioning: What you need to know

Transforming Control System to a Virtualized Platform, including On Process Migration. Anneke Vemer ExxonMobil

Designing Fault-Tolerant Power Infrastructure

Creating A Highly Available Database Solution

Specialty Environment Design Mission Critical Facilities

i. Maintenance of the operating system, applications, content on the server, or fault tolerant network connections

Toolbox.com Live Chat

Executive Summary WHAT IS DRIVING THE PUSH FOR HIGH AVAILABILITY?

Validating Cloud. June 2012 Merry Danley

Modular architecture for high power UPS systems. A White Paper from the experts in Business-Critical Continuity

COMMODITIZING THE DATACENTER. Exploring the Impacts of the Shift to Virtualization and Cloud Computing

Power protection for data centers

High Availability Solutions with MySQL

SQL Server High Availability: After Virtualization SQL PASS Virtualization Virtual Chapter September 11, 2013


TIBCO StreamBase High Availability Deploy Mission-Critical TIBCO StreamBase Applications in a Fault Tolerant Configuration

plantemoran.com {Data center. } Design standards

Westek Technology Snapshot and HA iscsi Replication Suite

NEC Corporation of America Intro to High Availability / Fault Tolerant Solutions

Cloud Based Application Architectures using Smart Computing

INDIA September 2011 virtual techdays

Simplifying the Transition to Virtualization TS17

Design and Implementation of High Availability OSPF Router *

SharePoint 2013 Best Practices

UTC3100 and 3170 POS RAID Information

Building a Tier 4 Data Center on a Tier 1 Budget

Security. Environments. Dave Shackleford. John Wiley &. Sons, Inc. s j}! '**»* t i j. l:i. in: i««;

SSC2016: SharePoint 2016 Administrator s Survival Camp

How Microsoft Designs its Cloud-Scale Servers

Creating the Conceptual Design by Gathering and Analyzing Business and Technical Requirements

IT White Paper. N + 1 Become Too Many + 1?

Lecture 26 Enterprise Internet Computing 1. Enterprise computing 2. Enterprise Internet computing 3. Natures of enterprise computing 4.

Business Requirements Document Template. SR OASDI Employee Rate Change

TOPOLOGY-INDEPENDENT IN-SERVICE SOFTWARE UPGRADES ON THE QFX5100

SCALABILITY AND AVAILABILITY

Opportunity of Energy Saving in electrical path is lesser as compared to other areas

Nutanix Tech Note. Failure Analysis All Rights Reserved, Nutanix Corporation

Five Essential Components for Highly Reliable Data Centers

Disk Array Data Organizations and RAID

Community Anchor Institution Service Level Agreement

Rob Zoeteweij Zoeteweij Consulting

Infrastructure. Oren Barshishat IT Support Manager. Adi Finkels Customer Support Manager. Sparta Systems

Solving Data Loss in Massive Storage Systems Jason Resch Cleversafe

System Requirements and Server Configuration

Data Center Services. Uncovering Colocation & Managed Hosting Opportunities

PATCHING WINDOWS SERVER 2012 DOMAIN CONTROLLERS. Prepared By: Sainath K.E.V MVP Directory Services

How To Design A Data Center

High Availability Solution

The Aspect Unified IP Five 9s Environment

Increased power protection with parallel UPS configurations

California Department of Technology, Office of Technology Services AIX/LINUX PLATFORM GUIDELINE Issued: 6/27/2013 Tech.Ref No

INFRASTRUCTURE AS A SERVICE (IAAS) SERVICE SCHEDULE Australia

Data Center Designs and Hospital Operations

Site Preparation Management Co., Ltd. March 1st 2013 By Nara Nonnapha, ATD UPTIME

TPS Virtualization and Future Virtual Developments. Paul Hodge

LinuxWorld Conference & Expo Server Farms and XML Web Services

Zero Downtime In Multi tenant Software as a Service Systems

Step-By-Step Guidelines to Formulate a High Availability Strategy for Your Business Objects Landscape Eric Vallo EV Technologies

Dat Da a t Cen t Cen er t St andar St d andar s d R o R undup o BICSI, TIA, CENELEC CENELE, ISO Steve Kepekci, RCDD

Mail-SeCure Load Balancing

Microsoft SharePoint 2010 Administration

Fax Server Cluster Configuration

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

Implementing Microsoft Windows 2000 Clustering

OPNFV Summit 2015 Presentation. Coexistence of Commercial Solutions with OpenSource OPNFV Platform

The Importance of Software License Server Monitoring White Paper

How To Make A Cloud Based System A Successful Business Model

Microsoft s Open CloudServer

TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS

Core and Pod Data Center Design

System Aware Cyber Security Architecture

Managing Cloud Infrastructure

Transcription:

On Reliability of COTS Hardware Dr. Li Mo Chief Architect, CTO Group

Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks

Comparing Telecom Hardware and COTS (Commercial of the Shelf) Hardware Telecom Hardware COTS Hardware Strong fault detecaon and fault isolaaon capabiliaes at hardware level Well established tradiaons on sooware upgrade, patching, and maintenance Reliably Central Office assumed May have smaller mean Ame between failure (MTBF) RelaAve smaller mean Ame to repair (MTTR) COTS procedures for sooware upgrade, patching, and maintenance contribute more to scheduled down Ame Different grade of reliability for data centers Page 3

New Item to Consider for COTS Site Down Time COTS Hardware Site downame (scheduled, non- scheduled) with varying duraaon and varying intervals Types of DC Type 1 Type 2 Type 3 Type 4 Reliability 99.671% 99.741% 99.982% 99.995% May have smaller mean Ame between failure (MTBF) RelaAve smaller mean Ame to repair (MTTR) COTS procedures for sooware upgrade, patching, and maintenance contribute more to scheduled down Ame Different grade of reliability for data centers Page 4

Common Mechanisms to Improve Reliability ApplicaKon Level Backup ZTE Proprietary Various Failures Hardwar Failure Hypervisor/OS Failure ApplicaAon Failure Master Server Data Synchronization Slave Server (Backup Server) Benefits of ApplicaAon Level Backup: Against failures Facilitate maintenance and upgrade Page 5

Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks

System availability P1 * P2 Availability: P1 VM VM VM VM VM COTS Server Availability: P2 vswitch 1 vswitch 2 VM VM VM VM VM COTS Server Page 7

Introducing COTS Focusing on Server Part of Reliability Providing 5 9s reliability with 4 9s availability per networking equipment Page 8

Various Backup Schemes Master Server Data Synchronization Slave Server Master Server Data Synchronization Slave Server (Backup 1) Data Synchronization Slave Server (Backup 2) Apps Apps Apps Apps Apps One Backup Two Backups Master Server 1 Slave Server 2 Data SynchronizaAon Master Server 2 Slave Server 1 Master Server 1 Master Server N Data Synchronization N Backup Servers Applica Aons 1:1 Load sharing ApplicaA ons Apps Apps Load Sharing Apps hdp://wwwen.zte.com.cn/endata/magazine/ztecommunicaaons/2014/3/ Page 9

Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks

A Simple Example Markov State TransiKon Model λ 1-λ M S0 S1 M 1- µ µ Legend: M Server is good M Current Master M Server is bad Chapman Kolmogorov EquaAon The Result or Page 11

1+1 System Markov State TransiKon Model Silent Error Probability Inverse of Server MTTR S 1 S 2 (1-λ)λ(1-s) S0 (1-λ)λ µ µ S 1 S 2 S 1 S 2 S1 S2 Legend: λ 2 +λ(1- λ)s µ S S Server is good Server is bad λ S 1 S3 S 2 λ Inverse of Server MTBF S Current Master Page 12

Solving the Global Balancing EquaKon for GeXng overall System Availability (1+1) ZTE Proprietary Page 13

System Unavailable Probability for various MTTR and server MTBF when s=0 in LOG scale ZTE Proprietary 0 µ MTTR=12 min MTTR=6 min 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15-2 - 4-6 - 8-10 - 12 1/100 1/1000 1/10000 1/100000 Page 14

Differences in Availability between TheoreKcal Data and SimulaKon for 1+1 Backup Case ZTE Proprietary MTTR=6 minutes Page 15

Improvement of 1+2 (dual backup) v.s. 1+1 (single backup) Defining the percentage of improvement as Page 16

ReverKve Maintenance Before Maintenance Start Maintenance when No Fault DC A Master Server Apps Slave Server (Backup) DC A Apps Data Synchronization Slave Server DC B DC B In Maintenance Apps AJer Maintenance Start Rever%ng when No Fault Page 17

The Impact of Site Maintenance is Negligible (ReverKve) Page 18

Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks

Four Types of Data Centers (ANSI/TIA- 942) Type 1 Type 2 Type 3 Type 4 Single non- redundant distribuaon path serving the IT equipment. Non- redundant capacity components. Basic site infrastructure with expected availability of 99.671%. Meets or exceeds all type 1 requirements. Redundant site infrastructure capacity components with expected availability of 99.741%. Meets or exceeds all type 2 requirements. MulAple independent distribuaon paths serving the IT equipment. All IT equipment must be dual- powered and fully compaable with the topology of a site's architecture. Concurrently maintainable site infrastructure with expected availability of 99.982%. Meets or exceeds all type 3 requirements. All cooling equipment is independently dual- powered, including chillers and heaang, venalaang and air- condiaoning (HVAC) systems Fault- tolerant site infrastructure with electrical power storage and distribuaon faciliaes with expected availability of 99.995% Page 20

1+1 System with Site Error Markov State TransiKon Model (ReverKve) ZTE Proprietary Page 21

Service Impact Error Probability for Various Data Centers 1x10-5 1/(DC MTBF) Page 22

Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks

Remarks It is possible to provide 5 9 availability with COTS hardware with applicaaon level backup COTS Hardware The Impact of MTTR is not significant if it is reasonably small (e.g. less than 10 minutes) for typical hardware MTBF The impact of data center scheduled maintenance is negligible 5 9 availability with can only be achieved via type 3 and type 4 data centers May have smaller mean Ame between failure (MTBF) RelaAve smaller mean Ame to repair (MTTR) COTS procedures for sooware upgrade, patching, and maintenance contribute more to scheduled down Ame Different grade of reliability for data centers Page 24 mo.li@zte.com.cn

Thanks!