On Reliability of COTS Hardware Dr. Li Mo Chief Architect, CTO Group
Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks
Comparing Telecom Hardware and COTS (Commercial of the Shelf) Hardware Telecom Hardware COTS Hardware Strong fault detecaon and fault isolaaon capabiliaes at hardware level Well established tradiaons on sooware upgrade, patching, and maintenance Reliably Central Office assumed May have smaller mean Ame between failure (MTBF) RelaAve smaller mean Ame to repair (MTTR) COTS procedures for sooware upgrade, patching, and maintenance contribute more to scheduled down Ame Different grade of reliability for data centers Page 3
New Item to Consider for COTS Site Down Time COTS Hardware Site downame (scheduled, non- scheduled) with varying duraaon and varying intervals Types of DC Type 1 Type 2 Type 3 Type 4 Reliability 99.671% 99.741% 99.982% 99.995% May have smaller mean Ame between failure (MTBF) RelaAve smaller mean Ame to repair (MTTR) COTS procedures for sooware upgrade, patching, and maintenance contribute more to scheduled down Ame Different grade of reliability for data centers Page 4
Common Mechanisms to Improve Reliability ApplicaKon Level Backup ZTE Proprietary Various Failures Hardwar Failure Hypervisor/OS Failure ApplicaAon Failure Master Server Data Synchronization Slave Server (Backup Server) Benefits of ApplicaAon Level Backup: Against failures Facilitate maintenance and upgrade Page 5
Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks
System availability P1 * P2 Availability: P1 VM VM VM VM VM COTS Server Availability: P2 vswitch 1 vswitch 2 VM VM VM VM VM COTS Server Page 7
Introducing COTS Focusing on Server Part of Reliability Providing 5 9s reliability with 4 9s availability per networking equipment Page 8
Various Backup Schemes Master Server Data Synchronization Slave Server Master Server Data Synchronization Slave Server (Backup 1) Data Synchronization Slave Server (Backup 2) Apps Apps Apps Apps Apps One Backup Two Backups Master Server 1 Slave Server 2 Data SynchronizaAon Master Server 2 Slave Server 1 Master Server 1 Master Server N Data Synchronization N Backup Servers Applica Aons 1:1 Load sharing ApplicaA ons Apps Apps Load Sharing Apps hdp://wwwen.zte.com.cn/endata/magazine/ztecommunicaaons/2014/3/ Page 9
Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks
A Simple Example Markov State TransiKon Model λ 1-λ M S0 S1 M 1- µ µ Legend: M Server is good M Current Master M Server is bad Chapman Kolmogorov EquaAon The Result or Page 11
1+1 System Markov State TransiKon Model Silent Error Probability Inverse of Server MTTR S 1 S 2 (1-λ)λ(1-s) S0 (1-λ)λ µ µ S 1 S 2 S 1 S 2 S1 S2 Legend: λ 2 +λ(1- λ)s µ S S Server is good Server is bad λ S 1 S3 S 2 λ Inverse of Server MTBF S Current Master Page 12
Solving the Global Balancing EquaKon for GeXng overall System Availability (1+1) ZTE Proprietary Page 13
System Unavailable Probability for various MTTR and server MTBF when s=0 in LOG scale ZTE Proprietary 0 µ MTTR=12 min MTTR=6 min 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5 14 14.5 15-2 - 4-6 - 8-10 - 12 1/100 1/1000 1/10000 1/100000 Page 14
Differences in Availability between TheoreKcal Data and SimulaKon for 1+1 Backup Case ZTE Proprietary MTTR=6 minutes Page 15
Improvement of 1+2 (dual backup) v.s. 1+1 (single backup) Defining the percentage of improvement as Page 16
ReverKve Maintenance Before Maintenance Start Maintenance when No Fault DC A Master Server Apps Slave Server (Backup) DC A Apps Data Synchronization Slave Server DC B DC B In Maintenance Apps AJer Maintenance Start Rever%ng when No Fault Page 17
The Impact of Site Maintenance is Negligible (ReverKve) Page 18
Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks
Four Types of Data Centers (ANSI/TIA- 942) Type 1 Type 2 Type 3 Type 4 Single non- redundant distribuaon path serving the IT equipment. Non- redundant capacity components. Basic site infrastructure with expected availability of 99.671%. Meets or exceeds all type 1 requirements. Redundant site infrastructure capacity components with expected availability of 99.741%. Meets or exceeds all type 2 requirements. MulAple independent distribuaon paths serving the IT equipment. All IT equipment must be dual- powered and fully compaable with the topology of a site's architecture. Concurrently maintainable site infrastructure with expected availability of 99.982%. Meets or exceeds all type 3 requirements. All cooling equipment is independently dual- powered, including chillers and heaang, venalaang and air- condiaoning (HVAC) systems Fault- tolerant site infrastructure with electrical power storage and distribuaon faciliaes with expected availability of 99.995% Page 20
1+1 System with Site Error Markov State TransiKon Model (ReverKve) ZTE Proprietary Page 21
Service Impact Error Probability for Various Data Centers 1x10-5 1/(DC MTBF) Page 22
Agenda Differences between Telecom Hardware and COTS Hardware Analysis Framework Enhancing ApplicaAon Reliability via Backups TheoreAcal With one backup (1+1) With Different Types of Data Centers (type 1 type 4) Remarks
Remarks It is possible to provide 5 9 availability with COTS hardware with applicaaon level backup COTS Hardware The Impact of MTTR is not significant if it is reasonably small (e.g. less than 10 minutes) for typical hardware MTBF The impact of data center scheduled maintenance is negligible 5 9 availability with can only be achieved via type 3 and type 4 data centers May have smaller mean Ame between failure (MTBF) RelaAve smaller mean Ame to repair (MTTR) COTS procedures for sooware upgrade, patching, and maintenance contribute more to scheduled down Ame Different grade of reliability for data centers Page 24 mo.li@zte.com.cn
Thanks!