Innovate Integrate Transform Interaction Information Networks Managing Availability and Failure Avoidance Phil Edholm President and Principal
Business Continuity is Availability Most organizations have some level of requirement to understand and document Business Continuity System availability Disaster recovery Operational risk Availability of a VoIP/UC system is a critical part of Business Continuity planning and documentation Many Organizations EXPECT high availability of their communications solutions Often Availability is Defined by Five Nines 9/25/2015 Consulting 2015 2
9/25/2015 Consulting 2015 3
Understanding Nines Five Nines is 99.999% Availability Or.00001% Unavailability Or 5.2 Minutes per Year of Outage (24/365) 9/25/2015 Consulting 2015 4
What the Nines Mean Availability in Percentage Availability in Number of "Nines" Equivalent Average Unavailability (in Minutes) Equivalent Average Unavailability (in Hours) 99.99990% 6 0.5256 0.00876 99.9990% 5 5.256 0.0876 99.990% 4 52.56 0.876 99.90% 3 525.6 8.76 99.0% 2 5,526 92.1 9/25/2015 Consulting 2015 5
VoIP/UC is Major Change from TDM TDM Trunks SIP Trunks SBCs and Gateways TDM Trunks Communications System UPS/Generator Communications System Core 16 169 10 Years Servers/VMs Data Center Dist Data Center UPS/Generator IP Access (WAN, ISP, MPLS, etc) Core Data Network Devices WC Distribution Wiring Closet UPS/Generator Power Power 9/25/2015 Consulting 2015 6
Communications Application and Equipment Achieving Availability 99 98 200 Hrs Availability is the sum of the parts Multiple Items can cause a failure Redundancy avoids the impact of a failure 99.9 99.99 99.98 2 Hrs 99.8 20 Hrs 99.9989 99.985 4-8 Hours Per Year 99.999 99.99 99.9 99 5 Mins 53 Mins 9 hours 92 Hrs Data Network 9/25/2015 Consulting 2015 7
Communications Application and Equipment Achieving Availability 99 98 200 Hrs Even if the VoIP core system exceeds 5 nines, the data network will define the availability 99.9 99.99 99.98 99.8 20 Hrs 99.99 99.85 1-4 Hours Per Year 2 Hrs 99.999 99.99 99.9 99 5 Mins 53 Mins 9 hours 92 Hrs Data Network 9/25/2015 Consulting 2015 8
Availability is Built In Availability System Design Reliability & Redundancy Operations 9/25/2015 Consulting 2015 9
Impact of Redundancy Example of the impact of redundancy. Shown with 75% Availability, 25% Unavailability for illustration. Element 1 Element 2 Element 3 25% probability that Element 2 fails while Element 1 has failed 25% probability that Element 3 fails while Element 1 and 2 have failed The resulting Unavailability is a failure probability of 25% x 25% x 25% or 1.56%. Resulting Availability is 99.44% or 2 Nines 9/25/2015 Consulting 2015 10
System Area Average System Failure Contact Center Example of Process Mid-sized System with 4,000 users 3600 in remote sites 300 Contact Center Agents Dual Data Centers VM Clusters UPS no generators Comms System Redundant Comm Mgrs in DC1, BU in DC2 Redundant Session Manager in DC1 and DC2 Gateways in both DCs Trunking is SIP and TDM BU Data Network has dual paths between sites (MPLS and fiber) Remote sites on MPLS and cable BU Single Ethernet switches and remote site routers Power Core Comms Data Network Voice Trunking Availability Overall 9/25/2015 Consulting 2015 11
Calculating Availability in a Complex System Step 1 - Define Elements of the System Power Data Center Servers/VMs Network Communications System Core Branches Devices Trunking Data Network Core Edges Building WAN/MPLS Internet Access Define all of the Elements that are required to deliver services Define all of the redundancy Elements that are built into the architecture Define potential inter-element interactions (power fails in DC 1 and a server fails in DC2) 9/25/2015 Consulting 2015 12
Calculating Availability in a Complex System Step 2 Define Element MTBF/MTTR Use MTBF Data to define expected failure rate for each Element class Vendor Data Industry data Operational reports (good for power) Add in factor for Operator error to hardware and software MTBFs Typically in Comms Systems 30% of outages are caused by operator error Use MTTR Data to define repair timing Maximum and average Operational plans and commitments Experience Industry data Operational reports (good for power) Combine to generate Availability and Unavailability data for Element Unavailability is Minutes per Year Unavailable the average number of minutes per year the Element will be out of service 9/25/2015 Consulting 2015 13
Calculating Availability in a Complex System Step 3 Calculating Element Unavailability and Availability Add in the Operator factor for Elements that have an operator not calculated in the MTBF (typically these are hardware, software, or other Elements like servers and Ethernet switches or apps. Calculated MTBF = Base MTBF x (1- Operational factor) (generally 30%) Unavailability (%) = Unavailable minutes per year Total minutes per year Unavailability (%) = Average MTTR (Hours) x 60 Calculated MTBF (Years) Minutes per year (365 x 24 x 60 = 525,600) Calculated Availability = 100% - Unavailability % 9/25/2015 Consulting 2015 14
Calculating Availability in a Complex System Step 4 Define Failure Sequences For Each Element area define the potential failure sequences and calculate unavailability and availability for each 9/25/2015 Consulting 2015 15
Calculating Availability in a Complex System Step 5 Define Impacts For each failure type, define the impacts of the failure based on the redundancy System Failure - RED Reduced Capacity - ORANGE No Impact - GREEN 9/25/2015 Consulting 2015 16
Calculating Availability in a Complex System Step 6 Total Impacts for Area Sum up the minutes of impact per year for each type of failure and generate availability for that system area 9/25/2015 Consulting 2015 17
Calculating Availability in a Complex System Step 7 Sum the Impacts for all areas Sum up the minutes of impact per year for each type of failure and generate availability for that system area 9/25/2015 Consulting 2015 18
Calculating Availability in a Complex System Step 8 Show Impact and Percentages 9/25/2015 Consulting 2015 19
Calculating Availability in a Complex System Step 9 Analyze and Recommend Evaluate key failure areas Recommend changes Architectural Structural Operational Mitigation Steps for Network Cellular Redundancy Wireless PC redundancy Wireless Multiplicity Multiple data channels Create Survivability Tool 9/25/2015 Consulting 2015 20
Spreadsheet Demo and Review 9/25/2015 Consulting 2015 21
Survivability Tool Enables Support Organizations to rapidly understand what to do in a failure situation Shows what steps to take Red tag critical operational elements Steps to take for survivability Easy to understand Avoid Cascading Failures 9/25/2015 Consulting 2015 22
Cascading choices Top level location Data Center(s) Core Network Building System Communications Core Video Core Network Failed Element Switch Server App Gateway.. 9/25/2015 Consulting 2015 23
Structured Choices DC 1 DC 2 Net Core Building Power Network Video VoIP Server Comm App SM App Gateway. Normal Primary Normal operational Element Current redundancy Element Failure Protection Protection Actions (to keep running) Restoration actions (to get operational) User Impact First Redundancy Current redundancy Element Failure Protection Protection Actions (to keep running) Restoration actions (to get back to the primary) User Impact Second Redundancy Current redundancy Element Failure Protection Protection Actions (to keep running) Restoration actions (to get back to the primary) User Impact 9/25/2015 Consulting 2015 24
Tool Demo 9/25/2015 Consulting 2015 25
Summary There is a major disconnect between expectations and reality in Communications System Availability Achieving five nines of availability is hard Analyzing your customers networks for avilabilty will lead to design and operational choices An availability audit will reduce the chance of blame if issues occur 9/25/2015 Consulting 2015 26
Tools and Partnering Send me an email and I will send either/both of the Spreadsheets Contact me to partner for a Business Continuity Analysis for any of your clients 9/25/2015 Consulting 2015 27
Innovate Integrate Transform Interaction Information Networks Thank You and Questions