NFV: THE MYTH OF APPLICATION-LEVEL HIGH AVAILABILITY



Similar documents
BUILDING THE CARRIER GRADE NFV INFRASTRUCTURE Wind River Titanium Server

Network Operations in the Era of NFV & SDN. Chris Bilton - Director of Research & Technology, BT

7 FALLACIES OF NETWORK FUNCTION VIRTUALIZATION

ETSI NFV Management and Orchestration - An Overview

ETSI NFV ISG DIRECTION & PRIORITIES

white paper Introduction to Cloud Computing The Future of Service Provider Networks

AN OPEN PLATFORM TO ACCELERATE NFV. A Linux Foundation Collaborative Project

Evolution of OpenCache: an OpenSource Virtual Content Distribution Network (vcdn) Platform

ETSI NetworkFunctionsVirtualisation(NFV): Overview

Realizing Network Function Virtualization Management and Orchestration with Model Based Open Architecture

Software-Defined Network (SDN) & Network Function Virtualization (NFV) Po-Ching Lin Dept. CSIE, National Chung Cheng University

NFV and its Implications on Network Fault Management Abhinav Anand

ENSURING CARRIER GRADE SERVICE LEVELS IN THE NEW VIRTUALIZED WORLD

End to End Network Function Virtualization Architecture Instantiation

NFV Management and Orchestration: Enabling Rapid Service Innovation in the Era of Virtualization

END-TO-END OPTIMIZED NETWORK FUNCTION VIRTUALIZATION DEPLOYMENT

MaximumOnTM. Bringing High Availability to a New Level. Introducing the Comm100 Live Chat Patent Pending MaximumOn TM Technology

Network Functions Virtualization and Diameter Signaling Controllers. Enabling the telecom network migration to the Cloud!

Service Availability Metrics

NSN Liquid Core Management for Telco Cloud: Paving the way for reinventing telcos for the cloud

High Availability for Citrix XenApp

SOLUTION WHITE PAPER. Align Change and Incident Management with Business Priorities

Software Defined Security Mechanisms for Critical Infrastructure Management

THE FRINGE OF ETSI ISG NFV Leveraging Proofs of Concept (PoCs) and reconciling SDN and NFV

Open Source and Network Function Virtualization

NFV Quality Management Framework Proposal

OPNFV Summit 2015 Presentation. Coexistence of Commercial Solutions with OpenSource OPNFV Platform

ETSI GS NFV 002 V1.1.1 ( )

Confidently Virtualize Business-Critical Applications in Microsoft

Transforming Service Life Cycle Through Automation with SDN and NFV

NFV: Addressing Global Challenges for Telecom Service Providers

White Paper - Huawei Observation to NFV

OpenStack, OpenDaylight, and OPNFV. Chris Wright Chief Technologist Red Hat Feb 3, CHRIS WRIGHT OpenStack, SDN and NFV

WHITE PAPER. Best Practices to Ensure SAP Availability. Software for Innovative Open Solutions. Abstract. What is high availability?

How Proactive Business Continuity Can Protect and Grow Your Business. A CenturyLink White Paper

WANT TO SLASH DOWNTIME? FOCUS ON YOUR SERVER OPERATING SYSTEM

Cisco NFV Solution for the Cisco Evolved Services Platform

SQL Server High Availability: After Virtualization SQL PASS Virtualization Virtual Chapter September 11, 2013

Executive Summary WHAT IS DRIVING THE PUSH FOR HIGH AVAILABILITY?

RELIABILITY AND NFV BRIAN LEVY CTO SP SECTOR EMEA

High Availability with Windows Server 2012 Release Candidate

Impact of SDN and NFV on OSS/BSS

Oracle Maps Cloud Service Enterprise Hosting and Delivery Policies Effective Date: October 1, 2015 Version 1.0

Why Service Providers Need an NFV Platform Strategic White Paper

Confidently Virtualize Business-critical Applications in Microsoft Hyper-V with Symantec ApplicationHA

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

High Availability on MapR

Connect Converge / Converged Infrastructure

Introduction to Quality Assurance for Service Provider Network Functions Virtualization

Why Fails MessageOne Survey of Outages

Nine Considerations When Choosing a Managed Hosting Provider

Intel Network Builders

SBC Evolution to Virtualization and Cloud Deployments. December 2015

BACKUP ESSENTIALS FOR PROTECTING YOUR DATA AND YOUR BUSINESS. Disasters happen. Don t wait until it s too late.

The Aspect Unified IP Five 9s Environment

Cisco NFVI Solution. Djordje Vulovic Consulting Systems Engineer Cisco Day Belgrade, March 31 st, 2016

Organization Transformation for Network Function Virtualization Infrastructure As A Service (NFVIaaS)

Nokia Networks. Nokia Networks. telco cloud is on the brink of live deployment

Business resilience: The best defense is a good offense

Cloud communication and collaboration with Rapport on CloudBand

NFV: What Exactly Can Be Virtualized?

TEPZZ 94Z968A_T EP A1 (19) (11) EP A1 (12) EUROPEAN PATENT APPLICATION. (51) Int Cl.: H04L 29/08 ( )

Success or Failure? Your Keys to Business Continuity Planning. An Ingenuity Whitepaper

Management & Orchestration of Metaswitch s Perimeta Virtual SBC

NOTA DE PRENSA PRESS RELEASE

CLOUDBAND WITH OPENSTACK AS NFV PLATFORM STRATEGIC WHITE PAPER NFV INSIGHTS SERIES

Server Virtualization and Cloud Computing

Virtualization Essentials

WHITE PAPER. How To Compare Virtual Devices (NFV) vs Hardware Devices: Testing VNF Performance

RELIABILITY AND AVAILABILITY OF CLOUD COMPUTING. Eric Bauer. Randee Adams IEEE IEEE PRESS WILEY A JOHN WILEY & SONS, INC.

Veritas Cluster Server from Symantec

Skelta BPM and High Availability

At MWC 2014, network function. CloudIMS kicks off NFV for carrier networks. NFV: Revolutionizing network architecture. Industry Perspectives

OpenFlow-enabled SDN and Network Functions Virtualization. ONF Solution Brief February 17, 2014

An Integrated Validation Approach to SDN & NFV

Enabling rapid and adaptive network applications deployment

Introduction to Virtualization. Paul A. Strassmann George Mason University October 29, 2008, 7:20 to 10:00 PM

All the benefits of Public Cloud on Private, Dedicated Infrastructure. Benefits. Enterprise-Level Security. High Performance. Compliant and Audited

APPLICATION DEVELOPMENT FOR THE IOT ERA. Embedded Application Development Moves to the Cloud

Ensuring your DR plan does not Lead to a Disaster

All Clouds Are Not Created Equal THE NEED FOR HIGH AVAILABILITY AND UPTIME

COMPARISON OF VMware VSHPERE HA/FT vs stratus

NFV Forum Progression to Launch

Software-Defined Networks Powered by VellOS

Virtualized Hadoop. A Dell Hadoop Whitepaper. By Joey Jablonski. A Dell Hadoop Whitepaper

WHY FAILS DELL SURVEY OF OUTAGES. WHITE PAPER Dell Modular Services.

CA Cloud Overview Benefits of the Hyper-V Cloud

Intrusion Tolerance to Mitigate Attacks that Persist

EMC VPLEX FAMILY. Continuous Availability and data Mobility Within and Across Data Centers

E-Guide. Sponsored By:

Resilience in Networks: Elements and Approach for a Trustworthy Infrastructure. Andreas Fischer and Hermann de Meer

Network Functions Virtualization (NFV); Testing Best Practices

SK Telecom CUSTOMER CASE STUDY CUSTOMER CASE STUDY

Vess A2000 Series HA Surveillance with Milestone XProtect VMS Version 1.0

Top 7. Best Practices for Business Continuity

Unlocking virtualization s full potential

Ontology, NFV and the Future OSS September 2015

Customer Benefits Through Automation with SDN and NFV

Dynamic Service Chaining for NFV/SDN

DISASTER RECOVERY PLANNING GUIDE

Transcription:

NFV: THE MYTH OF APPLICATION-LEVEL HIGH AVAILABILITY INNOVATORS START HERE.

EXECUTIVE SUMMARY Telecom operators want to offer their customers services that perform as expected when the service is requested. In fact, detailed service level agreements (SLAs) make it a business imperative to do so. But achieving a high degree of service availability is difficult and challenging, requiring significant investment and a rigorous, methodical attention to detail. This paper serves as a primer for a best-practices solution for high availability (HA). Using relevant industry references as a guideline, this paper details why cutting corners is a bad idea, and offers suggestions on a more comprehensive way to achieve the high degree of service ability with Network Functions ization (NFV). TABLE OF CONTENTS Executive Summary.... 2 Introduction... 3 Terms and Definitions.... 3 Availability Challenges... 3 Prevention... 5 Layered Approach.... 5 Conclusion.... 6 2 White Paper

INTRODUCTION There is a growing misconception in the software industry that high reliability can be achieved entirely through application-level redundancy schemes (load balancing, check pointing, journaling, etc.). But while such methods provide a degree of protection from certain failure scenarios, application-level solutions alone are insufficient to meet the demanding high availability expectations of the competitive, SLA-driven telecommunications market. Simple, stateless applications (e.g., many web servers) may benefit from simple approaches to availability, but modern, stateful services require more comprehensive frameworks. There are numerous problems and failures that network service providers can expect to face, and no single one size fits all approach can easily address all of them. A holistic, multilayered solution is required. TERMS AND DEFINITIONS To avoid potential confusion or ambiguities in definition, we will begin by covering the meaning of key terms and phrases typically used in the study of mission-critical systems. For the purposes of this paper, the following industry-standard definitions have been adopted: Availability: The availability of an item to be in a state to perform a required function at a given instant of time or at any instant of time within a given time interval, assuming that the external resources, if required, are provided 1 Reliability: The probability that an item can perform a required function under stated conditions for a given time interval 1 Fault coverage: The proportion of the system s failure rate that is successfully detected and recovered 2 Resiliency: The ability of the NFV framework to limit disruption and return to normal or, at a minimum, acceptable service delivery level in the face of a fault, failure, or an event that disrupts the normal operation 3 AVAILABILITY CHALLENGES Customer-facing, end user applications in modern networks are not monolithic, single-layered entities. Rather, they are composed of a series of carefully architected and managed components, working in close harmony with one another. This is especially true with regard to applications deployed in an NFV environment. Whenever the operation of one component within that environment (a virtual network function [VNF]; the NFV infrastructure [NFVI]; a virtualization infrastructure manager [VIM], etc.) is disturbed in some way, there is a potential for a customervisible service impact. The number of customers impacted and the duration of that impact are key variables that both solution providers and operators seek to minimize. The following table provides a few examples of the types of problems that can occur during system operation. Table 1: Disruptive Events Class of Event Possible Impact Impact Severity Management system failure Single server or network node failure This list may be overly simplified, but it demonstrates a need for some form of failure mitigation to limit or, preferably, eliminate service impacts when failures occur. Unsurprisingly, industry evidence reveals that service impacts are highly costly, both in terms of lost productivity and revenue, and in terms of operator reputation. Loss of application oversight Loss of specific applications or services Low Low-medium (assuming some degree of redundancy) Network link failure Service disruption Low-medium (assuming some degree of redundancy) Server maintenance software patch or upgrade Application software error or crash Malicious network attack Network congestion or overload Catastrophic loss of site or point of presence Service disruption Service disruption or outage Service disruption or outage Service disruption or outage Loss of all underlying service provided by site Low-medium (assuming in-service software maintenance supported) Medium-high Medium-high Medium-high Highest According to a 2013 Heavy Reading study, 5 service providers are spending $15 billion per year dealing with network outages. 3 White Paper

Further, such events are the third largest cause of subscriber churn. Clearly, all possible efforts should be made to prevent outages. Failure mitigation through application action is one technique commonly used to limit the impact of failures, and it does provide some measure of resiliency when a select failure occurs. A simple example of such mitigation is the traditional application-level active/standby model. With this model, there are two instances of the application (or VNF, in the case of NFV): one that is actively providing service, and one that is not providing service, but is able to do so rapidly should the active, serving instance fail. Such a model makes one critical assumption, one which may not always be accurate that the active and standby instances are each hosted on different physical infrastructures, i.e. they are not both hosted on the same server. What if the underlying platform on which the application is running (e.g., OpenStack) has no knowledge of the fact that there is an active and a standby instance of the application, and proceeds to deploy both the active and the standby on the same physical server? Should that physical server suffer a sudden outage (e.g., power failure), then both instances of the application are instantly lost, and a customer-visible outage will almost certainly follow. While this set of circumstances may seem unlikely, a well-known network operator related just such an incident to the author of this paper during a conversation at an industry event in Europe in 2014. In that case, a critical network function was deployed on an enterprise cloud implementation. Both instances (active and standby) were on a shared physical server. A single fault on that server resulted in a complete outage of the application and the service it provided, with the predictable unhappy consequences for the operator. To avoid such scenarios, system-level protection mechanisms are required. To entrench this thinking, the ETSI NFV Expert Group on Availability and Resiliency stipulated the following requirement in its comprehensive NFV Specification: [Req.9.5.2] The VNFs with the same functionality should be deployed in independent NFVI fault domains to prevent a single point of failure for the VNF. In order to support disaster recovery for a certain critical functionality, the NFVI resources needed by the VNF should be located in different geographic locations; therefore, the implementation of NFV should allow a geographically redundant deployment. 6 Underscoring the importance of this requirement, the same underlying need is repeated elsewhere in the same specification: It shall be assured that mechanisms which contribute to reliability and availability have the same effect when virtualised. For instance, the availability gained by running two server instances in a load sharing cluster can only be preserved if the virtualisation layer runs the two virtualised instances on two unique underlying host server instances (i.e. anti-affinity). Despite good intentions, this example clearly demonstrates that application-level HA solutions alone are insufficient to protect the system in all circumstances; additional protection schemes are necessary. PREVENTION Another important method of maximizing availability is to proactively seek out and identify faults that can occur within a system. Once identified, recovery and self-healing mechanisms can be initiated, thereby preventing small problems from growing unchecked into large problems. For this reason, it is important to deploy a system with a broad, comprehensive fault coverage framework. To state the obvious, every component within a system contributes to faults, not just the application layer. Network interface cards, hard drives, cooling units, power supplies, data base management systems, critical operating system processes all experience faults that may ultimately result in service outages if left unchecked. 4 White Paper

What guidance does the ETSI NFV Resiliency Specification provide on this topic in relation to NFV? The basis for a resilient system is a set of mechanisms that reduce the probability of a fault leading to a failure (fault tolerance) and reduce the impact of an adverse event on service delivery. 4 In other words, if the end goal is a resilient system, multiple mechanisms are required to meet that expectation. While some faults may be detected at an application level, applications running in a virtualized, NFV environment are simply unable to detect all problems themselves. If there are no other protection mechanisms in place, availability will suffer. A more comprehensive, multidimensional solution is needed; one in which application-layer protections are reinforced by protections in other layers. LAYERED APPROACH Figure 1 depicts one of the standard NFV architectural references published by ETSI. 7 Among the main functional blocks listed are the NFVI, the NFV Management and Orchestration block (commonly referred to as MANO), and the block containing the VNFs themselves, with their respective element managers (EMs). Each of these blocks has distinct responsibilities and interfaces. Among the responsibilities of each of these blocks are various forms of fault detection, reporting, and remediation. In fact, the ETSI NFV Expert Group on Availability and Resiliency devoted an entire chapter to this important topic, Failure Detection and Remediation. 8 Areas covered include the role of the hypervisor in detecting hardware faults; the role of the VIM in detecting NFVI faults; the role of MANO in detecting VIM faults; and of course the role of the VNF in detecting its own faults (application-level HA). Further, there is even a section detailing how liveness checking 9 can be implemented to detect and react to failures immediately vs. waiting for an application to detect a problem. In total, this chapter contains an impressive 25 requirements on the subject. The message from ETSI NFV is clear: a comprehensive, layerby-layer structured approach is required to achieve the HA and resiliency targets expected by operators. NFV Management and Orchestration OSS/BSS Os-Ma NFV Orchestrator Or-Vnfm EM 1 EM 2 EM 3 VNF 1 VNF 2 VNF 3 Ve-Vnfm VNF Manager(s) Sercice, VNF and Infrastructure Description NFVI Vn-Nf Vi-Vnfm VI-Ha Computing Storage isation Layer Network Nf-Vi ised Infrastructure Manager(s) Or-Vi Hardware Resources Computing Storage Network Execution reference points Other reference points Main NFV reference points Figure 1: NFV architectural framework 5 White Paper

The Wind River Titanium Server NFVI software platform was purpose-built to achieve these goals. Titanium Server implements a multilayered, proactive fault detection and recovery system. Through policies and metadata, it understands and enforces application deployment models; it constantly monitors system operations at the NFVI, VIM, VNF, and VNFM levels. Upon failure detection, Titanium Server autonomously reacts, taking selfhealing and service preservation actions. Standard telco alarms are generated, informing operators of system status and remediation in progress or completed. Performance metrics are pegged and available for off-board retrieval and analysis, if so desired. CONCLUSION By implementing a layered approach to HA and following recommendations from ETSI, Titanium Server has set a high bar with a platform availability target of six nines (99.9999%), or no more than 30 seconds planned plus unplanned downtime per year. This enables services deploying on Titanium Server to achieve five nines (99.999%) availability, or no more than five minutes downtime per year (planned plus unplanned). Achieving such targets without a comprehensive, multilayered approach to HA is not possible it s simply a myth. REFERENCES 1. Network Functions isation (NFV); Resiliency Requirements, ETSI GS NFV-REL 001 V1.1.1, page 9. 2. Wind River Titanium Server Reliability, Availability, Maintainability (RAM) Modeling Analysis, Version 4.0, KerNett Consulting Inc., page 7. 3. Network Functions isation (NFV); Terminology for Main Concepts in NFV, ETSI GS NFV 003 V1.2.1, page 8. 4. Network Functions isation (NFV); Resiliency Requirements, ETSI GS NFV-REL 001 V1.1.1, page 33. 5. Heavy Reading, Mobile Network Outages and Degradations, October 2013. 6. Network Functions isation (NFV); Resiliency Requirements, ETSI GS NFV-REL 001 V1.1.1, page 42. 7. Network Functions isation (NFV); Architectural Framework, ETSI GS NFV 002 V1.2.1, page 14. 8. Network Functions isation (NFV); Resiliency Requirements, ETSI GS NFV-REL 001 V1.1.1, page 42. 9. Network Functions isation (NFV); Resiliency Requirements, ETSI GS NFV-REL 001 V1.1.1, page 46. Wind River is a world leader in embedded software for intelligent connected systems. The company has been pioneering computing inside embedded devices since 1981, and its technology is found in nearly 2 billion products. To learn more, visit Wind River at www.windriver.com. 2015 Wind River Systems, Inc. The Wind River logo is a trademark of Wind River Systems,Inc., and Wind River and VxWorks are registered trademarks of Wind River Systems, Inc. Rev. 05/2015