Service Level AgreementMonitoring for Resilience in Computer Networks

Service Level AgreementMonitoring for Resilience in Computer Networks Noor-ul-hassan Shirazi, Alberto Schaeffer-Filho and David Hutchison School of Computing and Communications InfoLab21, Lancaster University Lancaster, UK Email: n.shirazi@lancaster.ac.uk, asf@comp.lancs.ac.uk, dh@comp.lancs.ac.uk Abstract Today s communication networks are migrating from resource based networks to service based network. These services could include, for example, financial, trading and other corporate services. Failing of such services would have significant impact, therefore it is important to ensure the resilience of a service as attributed in Service Level Agreements. In this paper we propose design of a test suite for stakeholders to investigate resilience and security requirements that have been defined in a Service Level Agreement. Also, the service providers can use this design to check compliance before delivering service and thus increase their performance. This paper will also stimulate the discussion of need of categorizing resilience parameters in existing SLAs and integration of SLA with policy based control for better management. Keywords-Service Level Agreement; policy based control; resilience; computer networks I. INTRODUCTION Service Level Agreements (SLAs) are the main drivers in today s service oriented networks and act as binding between service provider and customer. SLAs are actually part of wider process called Service Level Management (SLM). In the UK, during late 1980s the important work is carried out by Central Computer and Telecommunication Agency (CCTA), now fully subsumed into the Cabinet Office¹which resulted in the collection of best practices that is called Information Technology Infrastructure Library² (ITIL). ITIL framework essentially defines various aspects of service management and introduces as one of the main concepts for service level agreement. ITIL is the most widely accepted approach to IT service management today. However, it does not address the monitoring and auditing of service level agreements. SLAs management and specification is becoming a key differentiator in service provider s offerings [1] due to shift of the goal of a SLA from being technical contract towards a mechanism for the management of customer s expectations. Typically, SLAs contain contracted quality or performance of service being sold to customers in simple plain language. They spell out performance details in terms of service levels such as serviceability, performance, operation and other attributes of the service. Since growing number of service oriented networks such as cloud computing- where key strength lies in shared resources, it is a growing commercial interest to have a contract which contains not only performance but resilience and security parameter such as availability (MTBF - Mean Time between Failures, MTTR - Mean Time to Repair), cost to repair, lost/affected traffic, resilience to single/multiple failures and similar measureable attributes. Resilience is the ability of the network to maintain acceptable level of operations in the face of challenges such as malicious attacks, operational overload, mis-configurations or equipment failures [2]. Assuring resilience in heterogeneous environment is challenging task in terms of monitoring of SLA compliance and effective measurement of service levels detailed in contract. We consider this problem in the context of a general twophase high level network resilience strategy, called D²R² + DR: Defend, Detect, Remediate, Recover + Diagnose, Refine [2]. The first phase comprises the use of defensive measure to protect the network from foreseeable challenges, the ability to detect in real-time and subsequently remediate their effects before network operation is compromised and finally disengaged recovery procedures. The second phase primarily involves improving service levels through the diagnosis and refinement of operations. This high-level resilience strategy can serve as design blue print for resilient networks. In [3] the author has used this resilience strategy (D²R² + DR) to explore the use of policies framework to define configurations of mechanisms that can ensure the resilience of multi-service networks. They have used policies to realize high-level requirement defined in SLA to enforce the resilience of network under DDoS attack. A similar strategy has been used in [4] which presented a novel solution that enables the progressive multi-stage deployment of resilience strategies, based on incomplete challenge and context information. The author demonstrated their approach via simulation of a resource starvation attack on Internet Service Provider infrastructure. Therefore, in the context of SLA the problem is two phased, firstly analyzing how service levels provided by service providers can be measured efficiently and secondly the auditing and compliance of service in accordance with service level objectives. This paper presents architecture for SLA monitoring based on the above mentioned conceptual control loop for network resilience (see Figure 1). ¹Cabinet Office: http://www.cabinetoffice.gov.uk/resource-library/best-management-practice-portfolio ²Information Technology Infrastructure Library (ITIL): http://www.itil-officialsite.com/ ISBN: 978-1-902560-26-7 2012 PGNet

Moreover, all networks have their own SLAs each extending only as far as their borders such as Verizon service level agreement for private IP satellite access [8]. From resilience perspective it is fundamental to have framework which bounds service providers for upgraded protection over entire service duration. Figure 1. Resilience Strategy This paper is structured as follows. Section II presents related work on SLA. Section III explains the proposed architecture and use case is presented in section IV. Future work and some foreseeable issues for the design are discussed in section V and finally section VI summarizes the paper. II. RELATED WORK Service Level Agreement Specifications SLAs provides basis for the contract and set out minimum acceptable standards for service and the customer requirements that have to be met. In practice SLA contains (SLSs) service level specifications, which quantifies the minimum acceptable level of service required by customer. Therefore, they are equally important to customers and service providers. The extent of detail in service level specifications depends on the importance and complexity of service and can be of different types such as SLS for security services, SLS for QoS guarantees etc. In current SLA specification approaches [5], SLAs are not more than formal written agreement between customer and service provider. In [5], the author has explained the stepwise procedure for the development of SLA. However, the process of Service level specification is only described in very general terms. The Web Service Level Agreement (WSLA) project addresses service level management issues in web services environment on SLA specification [6]. However, WSLA only covers the agreed common view of a service between the parties involved. In [7], the author has explained some underlying concepts in SLA specification based on preunderstandings approach but does not specify operational SLA concepts. Until, now there is no SLA has been defined that are adapted to the specific needs of resilience such as lost/affected traffic, mean time to recover, resilience to single/multiple failures etc. Therefore current SLA specifications do not meet requirements of service subscribers such as in cloud computing where relative to SLA, is the difficulty in determining root cause for service interruptions due to complex nature of the environment. Service Level Agreeement Monitoring SLAs are offered by several ISPs (e.g., [9] and [10]) containing different service level objectives such as round-trip delay, packet loss and throughput. However, their scope is limited and is valid for only pre-defined condition. In [11] the author has analyzed how service level provided by ISPs can be measured and why monitoring of SLA is important. In [12, 13], authors have developed a component for SLA monitoring but its main focus is to monitor service availability, success of user registration and session setup. It does not measure the performance of service in challenged environment. In [14] PBNM - policy based network management model is discussed which translate service priorities into QoS policies and enforce them. However, PBNM lacks in concrete implementation which doesn t solve problem of high level mapping of SLA to network level. Service Level Agreement Compliance and Auditing There are also different auditing systems which check the service behavior by comparing of nominal and actual values where nominal values are provided by SLAs such as Cisco IOS IP service level Agreements [15] observe IP performance between two routers. In [16], the author has presented an auditing system for QoS enabled networks, which are closely coupled with policy controlled QoS provisioning in the network. They have introduced online test cases which detect service level specification violations. But detailed explanation of how these test cases should be constructed and how the measurement of service levels should be performed is missing. Also, the generic framework for automated auditing called AURIC is presented in [17] where actual fact finding is ignored. SLA is now central focus because networks are transforming from network centric to service centric and due to this fact we believe there is a need for SLA monitoring and compliance system based on high resilience strategy. This paper presents conceptual architecture for SLA monitoring and compliance based on network resilience control loop. III. PROPOSED ARCHITECTURE In this section we present overall design of the architecture which is illustrated in Figure. 2 and map our proposed architecture to network resilience control loop strategy, which served as blueprint for our design. Many business situations can be recognized in which service providers feel the need for monitoring SLAs. Sometimes there is a need to monitor service levels for services that are already being delivered but in other situations service provider may need to check compliance for service which they are going to deliver and still in its design phase. We will

Figure 2. Service Level Agreement Monitoring Architecture restrict ourselves to later situation to emphasize on the point that if incorporating this model in early stage for which SLA still has to be written can lead to better service level objectives. 1) Defend: To support business objectives, service providers rely on IT services those are built upon the technical infrastructure. These services are designed with customer focused approach and have some pre-stages such as specification, design, development. These pre-stages are important indicators for defining service levels and are usually part of a broader framework for service management. These stages give insight before service level agreements are drawn and written. Such as information about environment where service has to run, other dependent process and resource utilization. Once the service is designed SLA would be written with clear service level objectives thus translating SLA information into SLA metrics and store this information into agents. ThisSLA will also contain some behavior specification such as environment details as explained above which will be used to draw test cases to find root cause of service failure during diagnosis and refinement operations. 2) Detect: The core of proposed architecture is decision module which controls and communicate with all agents those may exist in multiplicity according to underlying infrastructure of enterprise such as servers, network devices, systems and applications. These agents would store local SLA data attributed from SLA metrics. Decision module further consists of two sub modules: Monitoring and Policy Decision Point (PDP). Monitoring sub module will request agents for service level reports and appropriate change detection will be calculated. If any immediate policy control is required it will invoke policy management module and policies will be pushed via policy enforcement point (PEP). We assume policies will realize high-level requirements for resilience e.g, in terms of the availability of a server farm and the services it provides. 3) Remediate and Recover: To have high degree of flexibility in the policy actions policy enforcement is isolated from policy decision point. Policy analysis can help to ensure correct specification of resilient strategies [3]. Policy which will be pushed onto components as a result of non-compliance will be via policy enforcement point. Also, this module will store definition of all policies extracted from SLA and other policies which are specific to underlying architecture or infrastructure where service has to run.

4) Diagnose and Refine: To find root cause of the failure certain test cases need to be run such as connectivity check if server is unavailable due to high load on network. These test cases will be drawn from behavior specification extracted from service level agreement. Detail functionality of each module is as follows: A. Decision Module Decision module is to verify the compliance of the state of the system with respective rules which are agreed upon by customer and service provider. It consists of two sub modules i.e., monitoring and policy decision point as explained above. Monitoring module will communicate with agent in enterprise environment and request event stats and service level reports. If service levels are not in accordance with SLA then it will trigger policy management module for refinements such as reconfiguration of some component in enterprise environment. All events and statistics from environment based on component and service levels will be collected by this module via agents and appropriate change detection will be calculated. If any of immediate policy control is required, it will be sent to enterprise via policy enforcement sub module. E.g. in case of DoS attack there is a need to block IP address on core router which will be requested from existing policy mapping definition and will be pushed to component level. B. Agents Agents are operational heart of the overall architecture and they may exist in multiplicity as required according to underlying topological architecture and logically reside on customer and service provider end. Their functionality is as follows: Agents store local SLA data attributed from SLA metrics and are embedded in environment where service has to run. There could be multiple agents running in enterprise at various levels. They are sending events and stats to monitoring module of detection unit and are triggered only when deviation is occurred. They will monitor resilience parameters such as bandwidth, compares against stored SLA and will send alarm to decision module. Since, management policies for enterprise are derived from business goals thus SLA would contain service management policies. Policy management module will store these policies and therefore, will provide clear semantic representation of the policies. C. Test Cases To meet Service levels such as delay packet size, jitter, bandwidth environment will be pre-set for service to run. In case there is deviation detected from service level agreement as a result of challenges faced by environment. Certain test cases need to be performed in order to find root cause, e.g. if client is running video conferencing application which require priority, and certain CPU utilization. During any attack such resources will not be available then test cases module will invoke certain test cases to measure the expectation of service which is changing over time. Any violations detected as a result of test case would identify possible cause of fault and would indicate non-compliance. Policy control may invoke corrective actions to repair from fault so system returns compliance with existing SLA D. Behaviour Specifications To derive test cases we will require information knowledge of network environment which will be drawn from underlying architecture of that environment. Normally, such information can also be drawn from pre development stages of service such as speciation and design phase. It becomes really challenging to derive such information for widely heterogeneous environments. We will use behavior specification to draw this knowledgebase form existing SLA. This will enable us to define more refined test cases. E. Policy Management Module This will contain specifications of service to policy mapping. This module will store policy definition and provide Policy Enforcement Point which set control instructions to invoke corrective actions to repair from fault, so that system returns compliance with existing SLA. IV. USE CASE Service providers offer many real time services to keep up with high demands of service oriented networks such as VoIP, (VC) - Video conferencing and e-commerce etc. To illustrate the proposed architecture we will use an example where service provider offering VC service and need to test compliance of service level agreement prior to delivering this service. Service level agreement could state details of service levels such as 96% availability, 100 frames per second, CPU cycles, bandwidth requirement etc. To address customer expected service levels, service providers make sure that service running on their infrastructure entails resource management. Therefore, fair estimates are made by developers during pre-deployment phases which we referred as pre-stages; specification, design and development. These stages are important to draw behavior specification and test cases for our monitoring architecture. Thus service levels will be translated into metrics which will be stored in agents running in service provider network. Agents can be placed at various levels such as component, server and client side because noncompliance can be observed due to client running too many processes thus not getting required service level i.e., frames per second for our scenario. Once service goes live decision module will request event and stats from agents in the form of current SLA report. Based on the information received appropriate change will be calculate to return compliance. Imagine server offering VC goes under DoS attack, in that case agent running at server side will not be responding due to unexpected load on network. In such scenario some corrective action is required

and policy decision point will invoke policy management and certain policies can be used to remediate in real time such as blocking IP address on core router or firewall for system to return compliance. Moreover, based on symptoms certain test cases can also be executed to identify exact violation location such as current network load, ICMP message for connectivity, round time etc. As we can see compliance to service level, agreement can be observed and verified using agents and decision module based on effective policy management. V. DISCUSSION AND FUTURE WORK The proposed architecture currently would serve the needs of corporate customer and its usefulness for domestic customers will be further explored down the track. Some of the other foreseeable issues of proposed architecture are as follows: SLA metrics are required to describe desired performance standard and thus form the basis of architecture for monitoring delivery of service. These metrics need to be developed in a way that should provide actionable information. Currently they are too general to be meaningful. Accurate translation of SLA to develop this metrics would be challenging. Evaluating current and existing means of collecting and analyzing such metrics. SLA mapping and its configuration for agents still further need to be researched. Design of simulation model which could provide actual states i.e. measurements such as (jitter, throughput, roundtrip and availability etc.) Drawing test cases from behavior specification and characterization of SLA is integral and would require further research. VI. CONCLUSION SLA is central focus point for service oriented industries and there is no framework for SLA monitoring and compliance during the service usage in order to react quickly in case of noncompliance is detected from agreed service levels and as far as anticipate as earlier as possible any degradation of service levels, by using auditing means. The problem is compounded by the scale for internet, where service levels are harder to specify. Therefore, it is complex to construct a model which might quantify resilience parameters attributed from SLA and monitor its compliance for heterogeneous network across different levels. In this paper we have presented a conceptual architecture for such model considering network resilience strategy control loop to stimulate the discussion on its importance for next generation internet where SLA compliance, monitoring and management based on policies are integral. REFERENCES [1] Marilly, E., Martionot, O., Papini, H., Goderis, D., Service Level Agreements: A main challenge for Next Generation Networks, Adaptive Networks and Services (ADANET), May 2003. [2] J.P.G. Sterbenz, D.Hutchison, E.K. Cetinkaya, A.Jabbar, J.P.Rohrer, M.Scholler, and P.Smith, Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines, Computer Networks: Special Issue on Resilient and Survivable Networks (COMNET), vol. 54, no. 8, pp. 1245-1265, June 2010. [3] P. Smith, A. Schaeffer-Filho, A. Ali, M. Schöller, N. Kheir, A. Mauthe, D. Hutchison. "Strategies for Network Resilience: Capitalising on Policies". In: Proceedings of the 4th International Conference on Autonomous Infrastructure, Management and Security (AIMS 2010), ser. LNCS. Zurich, Switzerland. Springer, June 2010. p. 118-122. [4] Yu, M. Fry, A. Schaeffer-Filho, P. Smith, and D. Hutchison, "An Adaptive Approach to Network Resilience: Evolving Challenge Detection and Mitigation". In: Proceedings of the 8th International Workshop on Design of Reliable Communication Networks (DRCN 2011), Krakow, Poland. October 2011. p. 172-179. [5] Facilities Societies, Service Level Agreements and Service Level Specifications,http://www.facilities.ac.uk/j/cpd/62-facilitymanagement/119-slas-and-service-specifications. [6] A. Dan, et.al,; Web Services on demand: WSLA-driven Automated Management, IBM Systems Journal, Special Issue on Utility Computing, Volume 43, Number 1, pages 136-158, IBM Corporation, March, 2004. [7] Bouman, J., Trienekens, J.; Specification of Service Level Agreements, clarifying concepts on the basis of practical research, Software Technology and Engineering Practice,1999, pp 169-178. [8] Verizon, Private IP Satellite Access Service Level Agreement, http://www.verizonbusiness.com/terms/us/products/satellite_services/pri vate_ip/. [9] BYTEMark Hosting; "Service Level Agreements", http://www.bytemark.co.uk/company/sla. [10] British Telecom, BTNet Schedule Service Level Agreement", http://www.productsandservices.bt.com/btbusiness/btbusinessproducts/p dfs/prd200022_service_level.pdf. [11] Mobility and Differentiated Services in a Future IP Network; Fifth European Union Framework Programme (FP5), Information Society Technologies (lst) Project, IST-2000-25394. [12] Kurtansky, P., et.al.; "Extensions of AAA for Future IP Networks", Wireless Communications and Networking Conference (WCNC), 2004. [13] Din, G., Hayakawa, A, Schieferdecker, I., Deussen, P.;"An Auditing System for QoS-Enabled Networks", 23'd International Conference on Distributed Computing Systems Workshops (ICDSW'03), 2003. [14] IETF-Internet Engineering Task Force, PBNM-Policy based Network Management Model http://www.ietf.org/proceedings/46/46th-99novietf-92.html. [15] Cisco los Service Level Agreement Data Sheet, http://www.cisco.com/en/us/technologies/tk648/tk362/tk920/technologi es_white_paper0900aecd8017531d.html. [16] Eyermann, F., Stiller, B.; "A Protocol to Support Multi-domain Auditing of Internet-based Transport Services," Second International Conference on Internet Monitoring and Protection (ICIMP 2007), 2007. [17] Hasan, H., Stiller, B.; "A Generic Model and Architecture for Automated Auditing", 16th IFIP/IEEE International Workshop ondistributed Systems: Operation and Management (DSOM 2005), Barcelona, Catalunya, Spain, October 24-26, 2005, pp 121-129.