SEcure Cloud computing for CRitical infrastructure IT

Transcription

1 SEcure Cloud computing for CRitical infrastructure IT Contract No Deliverable: 4.2 Resilient Cloud Management AIT Austrian Institute of Technology ETRA Investigación y Desarrollo Fraunhofer Institute for Experimental Software Engineering IESE Karlsruhe Institute of Technology NEC Europe Ltd. Lancaster University Mirasys Hellenic Telecommunications Organization OTE Ayuntamiento de Valencia Amaris

2 Document Control Information Title Deliverable Resilient Cloud Management Editor Steven Simpson (ULANC) Author(s) Steven Simpson (ULANC), Noor-ul-hassan Shirazi (ULANC), Simon Oechsner (NEC), Michael Watson (ULANC), Andreas Mauthe (ULANC), David Hutchison (ULANC), Markus Tauber (AIT), Christian Jung (IESE), Manuel Rudolph (IESE), Reinhard Schwarz (IESE), Matthias Flittner (KIT), Roland Bless (KIT) Red highly sensitive information, limited access for: Classification Yellow restricted limited access for: Green restricted to consortium members White Public Internal reviewer(s) Paul Smith (AIT), Thomas Puschacher (AMARIS) Review Status Draft WP manager accepted Coordinator accepted Requested deadline 2014/12/19.. Versions Version Date Change Comment/Editor /12/08 Version for Steven Simpson (ULANC) review /12/18 Incorporated Steven Simpson (ULANC) comments from review /12/19 Updated header Steven Simpson (ULANC) 2 out of 85

3 Summary Cloud environments make resilience more challenging due to the sharing of non-virtualized resources, frequent reconfigurations, cyber-attacks and legal aspects. We present a Cloud Resilience Management Framework (CRMF), which models and then applies an existing resilience strategy to a cloud operating context to diagnose anomalies. The framework uses an end-to-end feedback loop that allows remediation to be integrated with existing cloud management systems. We demonstrate the applicability of the framework with use cases for effective cloud resilience management. This framework will be further refined and modified according to results from implementations of a prototype and experiments in the context of the demonstration scenarios being considered in project. Contents List of Figures 5 List of Tables 5 1 Introduction Motivation objectives Purpose of the deliverable Legal Issues Background architecture Network resilience Resilience in the cloud Policy-based Resilience Management Related Work Cloud Resilience Management Framework (CRMF) Architecture Design requirements Deployment function SLA parser Placement function Deployment template generator Anomaly detection Data collection engine Network analysis engine System analysis engine Resilience policies (RP) Policy Engine IND 2 UCE Framework Data Usage Control Policies Policy Engine in CRMF CRMF component detail 28 3 out of 85

4 4.1 Deployment function SLA parser Placement function Deployment template generator Interaction with tools for Audit Trails and Root-Cause Analysis (AT-RCA Online anomaly detection Data collection engine One-class SVM for SAE Recursive density estimation Resilience policies Data-usage policies Quantitative evaluation Test-bed Software tools Evaluation Metrics Tuning the SAE detection algorithm Proposed RDE algorithm for NAE Evaluation of SAE with malware samples Detecting Kelihos Detecting Zeus Evaluation of NAE with network-level attacks Qualitative evaluation Failure recovery of a virtual machine with minimum interruption to a service Anti-affinity Service of traffic management system is not available due to Denial-of- Service attack Conclusions and Future Work 67 Appendices 69 A Linkage to other project results 69 B Detection results of portscan using RDE 70 C Detection results of netscan using RDE 76 4 out of 85

5 List of Figures 1 Objective mapping to CRMF Architectural Framework Components mapping to architecture The D 2 R 2 +DR resilience strategy D 2 R 2 components A Resilience-oriented View Collaborative interface for resilience management An overview of the CRMF architecture The flow of information and hierarchy of engines within a CRMF IND 2 UCE Framework CRMF system architecture Internal structure of the deployment function Parameter Classes A monitoring-oriented view identifying different interfaces Overview of Data Collection Engine An Overview of Policy Engine Experimental setup test-bed Time taken to train the classifier vs. training data-set size Time taken to output a class vs. training data-set size Original RDE Modified RDE The performance metric comparison of original vs modified algorithm Results of detection for Kelihos-5 using end-system features and a variety of kernel parameters Detection of Zeus-1433 using end-system features with a tuned classifier Detection of various Zeus samples using end-system data and a tuned classifier Depiction of migration during normal period, with anomaly directed at static host (NM-MT0) Depiction of migration during anomalous period, with anomaly directed at static host (AM-MT0) Depiction of migration during normal period, with anomaly directed at migrating host (NM-MT1) Depiction of migration during anomalous period, with anomaly directed at migrating host (AM-MT1) Legend for depictions of anomaly/migration interaction Detection results of high-intensity DoS attack when migration happening during anomalous period Detection results of high-intensity DoS attack when migration happening during normal period The Mirasys test case for VM failure recovery Organisations involved in use case Results from simulations that implement an incremental detection and remediation approach to a DDoS attack [YFSF + 11] out of 85

6 List of Tables 1 Summary of characterisation Detection results of DoS attack with MT0 under high and low intensity Detection results of DoS attack with MT1 under high and low intensity Linkage to Deliverables out of 85

7 1 Introduction Due to the recent trend of using cloud computing for Critical Infrastructure (CI) services, it is becoming increasingly important for clouds and cloud-based services to be resilient, i.e., able to maintain operations and services even in the presence of challenges. To do this, a number of resilience-supporting mechanisms are needed at various locations and layers (tenant and physical infrastructures), such as provisioning of redundant components, traffic capturing, anomaly detection and classification, and network reconfiguration. Moreover, these mechanisms require appropriate coordination to identify the occurrence of a challenge. Characteristics of the challenges to be detected determine which mechanisms can be then used, and where they should be place. More than one detection algorithm could be used, for example, having an always-on, low-cost anomaly detection alogorithm, alongside an offline, more computationally expensive classifier to diagnose the root cause of an anomaly. Therefore, it is non-trivial to co-ordinate such mechanisms for several reasons: They might work on different timescales. For example, the deployment function of cloud orchestration might foresee the need for scalable components and react to load changes on a longer timescale by scaling out a service, whereas reacting to an anomaly requires short-term decision-making to mitigate a potential attack. Distinct mechanisms might also have different objectives, and attempt to perform conflicting actions. A reaction to an anomaly might be to migrate VMs away from an affected node, but this migration might conflict with standing policies to keep some VMs apart (because they belong to rival tenants) or together (to reduce latency). Reconfiguration of the mechanisms could unexpectedly produce such conflicts and discrepancies. This deliverable introduces a Cloud Resilience Management Framework (CRMF), which has the goal to facilitate the interactions of components that implement resilience mechanisms, as well as the (re-)configuration of such a system. The components of this framework aim to detect challenges and events in the system, such as an attack or a network overload, and to react to these events by reconfiguration, in order to maintain the operation of the system and, therefore, the critical services running on it. Many of the decisions to react to events are expressed as Event-Condition-Action (ECA) policies, which bind the components and mechanisms together. Policies sit idle until associated events occur and their conditions are met. Their actions may cause the re-configuration of the system, including the deployment or activation of new components, which can in turn generate new events, and trigger further policies. This continuous process will constitute a policy-driven resilience feedback control-loop, which allows (for example) an initial network anomaly to activate closer inspection of traffic, eventually leading to reconfiguration of a firewall. Such a process thus automatically handles a challenge to the system, without having expensive mechanisms (such as the closer traffic inspection) active at all times. It is of paramount importance that cloud administrators are able to effectively manage the underlying technologies which enable configurations, elasticity, dynamic invocation of services, and monitoring activities such that security and resilience requirements of the critical service users and providers are met. The set of policies defining the possible 7 out of 85

8 reconfiguration actions is not fixed, and different policies may be loaded or unloaded over time to reflect better resilience practices or a better understanding of the challenges. The expression of resilience mechanisms and strategies as policies allows new policies to be studied and analysed off-line before deployment, to determine correct interaction with existing policies. 1.1 Motivation The motivation for the work presented here comes from previous work [SFSM11, SHS + 11a] and objectives from work package 3 & 4 of the project which state: To develop a suitable policy vocabulary, policy mechanisms, and a policy editor that supports the user in specifying his security requirements and in deriving the resulting, machine enforceable security policy. To develop concepts for policy deployment and re-deployment to the cloud in a secure manner. Anomaly-based detection techniques to discover deviations in system and network behaviour. This is essential to disclose cyber-attacks and predict their impact. Moreover, the high, stringent security requirements of critical infrastructures demand reliable detection of any deviation from normal behaviour. Anomaly-based detection techniques are used to detect such deviations in, e.g., system and network behaviour, at run time. Deviations can, for example, be caused by cyber-attacks or component failures. However, in this work, our focus is not on specific anomaly detection techniques, but to explore the concept of resilience in cloudsm i.e., timely detection (possibly on-line) and identification of anomalies and remediation actions to ensure continuous operations. This analysis will help further research into more robust cloud management systems for critical infrastructures provisioning, and to identify how and where in cloud environments those components of resilience management interfaces can be operationally applied. 1.2 objectives The overall objectives of the project include the following, which are directly relevant for this deliverable: 1. Understand Cloud-Computing risks. 2. Understand Cloud behaviour. 3. Establish best practices for secure Cloud service implementations. From the Description of Work (DOW) for Work Package 4: For understanding and managing risks associated with cloud environments at operation time, will provide anomaly-based detection techniques to discover deviations in system behaviour. This is essential to disclose cyber-attacks and predict their impact. Figure 1 below shows the mapping of objective to proposed resilience management framework. 8 out of 85

9 Objectives 1. Definition of legal guidelines 2. Understand cloud behaviour 3. Understand cloud computing risks 4. Establish best practices for secure cloud service implementations 5. Demonstration of in real use cases Work Package 4 Objectives 1. Anomaly based detection techniques to discover deviations in system and network behaviour 2. Technologies for implementing policy enforcement 3. Concepts for policy deployment and re-deployment Frameworks 1. Anomaly Evaluation Framework for Cloud Computing (AEFCC) 2. Cloud Resilience Management Framework (CRMF) 3. Integrated Distributed Data Usage Control Enforcement (IND²UCE) Resilient cloud computing Protection against attacks and other anomalies Policy framework Remediation Detection Data usage policies attempted actions constraints Resilience policies detailed events Fine grain analysis (FGA) coarse events Anomaly detection (AD) re-configuration data(nw, VM) constrained actions physical infrastructure CRMF Policy framework (IND²UCE) Figure 1: Objective mapping to CRMF 1.3 Purpose of the deliverable This deliverable describes the cloud resilience management framework (CRMF). This introduces management interfaces, logical functions, and a description of the data flow of the various sub-components to provide complete end-to-end resilience. This framework supports security requirement specifications in terms of resilience policies and is composed of different components which take input from various monitoring and detection mechanisms to ensure high-level security and resilience requirements are being met. The cloud resilience architecture we describe satisfies the roles of the inner loop elements by realising them as software engines that together form a Cloud Resilience Manager. This deliverable has the following structure. Background and related work is discussed in Section 2, with reference to the architecture. Section 3 presents the resilience management framework and Section 4 provides more specific detail of its components, focusing on online anomaly-detection techniques for providing end-toend resilience. Section 5 presents a quantitative evaluation of anomaly detection, while Section 6 evaluates the framework qualitatively against project use cases. Section 7 concludes the deliverable. 1.4 Legal Issues This brief section summarizes the legal issues that have been addressed in the preparation of this deliverable. This mainly focuses on the anomaly detection component, since the deployment function and the policy framework have no direct contact with user data. 9 out of 85

10 Collection and storage of personal data The anomaly-detection techniques described herein use data acquired by observing network traffic, including any sensitive data contained in headers or payload, even if these aspects are not necessary. Reason for collection This data is collected for statistical analysis. Minimization of quantity of personal data collected Most information from any observed packet is immediately discarded (e.g., payload) or condensed to inferred information (e.g., packet size). Furthermore, of the remaining information (such as IP addresses), these are aggregated into entropy measurements, from which the original data cannot be recovered. Prevention of unauthorized access Collection of data and its processing does not need to leave the physical network of a data centre. Real-time tracing of data There is no personal data to be traced. Assured deletion of data Most personal data is effectively deleted, masked or reduced at the point of collection, or kept within the private physical networks from where the data was collected. Derived data does not need to be retained beyond a time span of (say) 20 minutes. (This does not eliminate the possibility that organizational policies with a broader scope than anomaly detection would keep records for longer for administrative purposes such as forensics and logging.) Configuration of non-functional requirements Anomaly-detection systems could be misconfigured such that unprocessed records containing personal data could be exported from the networks where they have been captured, and exposed to misuse. Care should be taken to be aware of this potential if requirements appear to dictate such unusual configurations. Ordinarily, there should be no reason not to eliminate personal data before leaving the private network. The configuration of database servers holding personal data (for example) forms part of the deployment function. Therefore, non-functional requirements such as the location of such servers within specific countries is to be included in the functionality of this function. Tracing of failure No personal data is exposed in a recoverable form. Forensic value/credibility of data No personal data is exposed in a recoverable form. However, there may be indirect actions related to anomaly detection that lead to other personal data being stored. For example, if an anomaly is detected, more detailed data might be collected for other purposes, e.g. to counter the anomaly, or to log the issue. However, this is beyond the scope of this document. 10 out of 85

11 2 Background This section considers the positioning of the CRMF in an overall cloud architecture in 2.1, defines resilience for networking 2.2 and clouds 2.3, discusses the user of policies for resilience 2.4, and finally discusses related work in this area in architecture The Consortium has developed an architectural framework for deploying critical infrastructure services in the cloud [BFH + 14], which provides a basis for the development of technical solutions such as Cloud Resilience Management. The reference architecture used in (Figure 2) divides activities in a cloud environment into four layers [sec13]. The term tenant refers to cloud customers who use virtual resources provided by the physical infrastructure provider. Tenants can use them to provide services to their own users; we will refer to users as those who use the services of tenants. In, we use the more specific term cloud infrastructure provider instead of cloud provider, because we are assuming that the latter offers Infrastructure as a Service (IaaS). Several tenants may host critical services on a single physical infrastructure, with varying security and resilience requirements. For example, not all components of a single tenant might have the same importance and different components of different tenants might be mixed-and-matched in the physical infrastructure. The security and resilience mechanisms offered by critical service providers need to ensure continuous system operation even in the presence of these challenges. Abstraction Level User Level Critical Infrastructure (CI) Service Level Component A Resources Client Devices CI Service Component B Service Components Component C Provides Service (SaaS /Paas) manages service resources Stakeholder CI Service User SLAs CI Service Provider Tenant Infrastructure Level Physical Cloud Infrastructure Level Compute Storage Network Virtual Compute Resources Virtual Storage Virtual Network Tenant Infrastructure Provides Virtual Infrastructure (IaaS /PaaS) manages virtual resources Provides Virtual Resources (IaaS) manages cloud resources Tenant Infrastr. Provider Cloud Infrastructure Provider Cloud Infrastructure (Data Centre) Figure 2: Architectural Framework 11 out of 85

12 Deployment function/ Orchestration Deliverable D4.2 With respect to security and resilience requirements, the deployment function as part of the orchestration component (and a component of the CRMF) translates high-level service descriptions and Service Level Agreements (SLAs) into automatically deployable descriptions such as Heat 1 templates in the OpenStack 2 environment. Subsequently, the anomaly-detection engine of the CRMF uses input from network and system activity to detect challenges which are then remediated using the policy engine. The activity is measurable at the physical (cloud-infrastructure provider) layer, which has physical networks and machines, and has an external view of system activity in VMs. Moreover, network activity can be measured in the tenant-infrastructure layer by observing traffic on virtual networks, which could be performed by the tenant by running anomaly detectors on VMs that have access to these networks. A mapping of components to architecture can be seen in Figure 3 below. Abstraction Level Resources User Level Critical Infrastructure (CI) Service Level Component A Client Devices CI Service Component B Service components Component C Tenant Infrastructure Level Physical Cloud Infrastructure Level Policy framework Tenant infrastructure Anomaly detection Cloud infrastructure (Data centre) Figure 3: Components mapping to architecture 2.2 Network resilience Network resilience [CMH + 07] is a wide-ranging concern, and can be defined as the the ability of a network to provide an acceptable level of service in light of various challenges [SHÇ + 10]. A target level of service can be defined in a Service Level Agreement (SLA) using resilience metrics, such as performability and dependability metrics. The 1 Heat Orchestration Template: 2 OpenStack: 12 out of 85

13 challenges to normal operation considered under the umbrella of network resilience transcend those typically considered within any one discipline, and include attacks by malicious adversaries, large-scale disasters, software and hardware faults, environmental concerns and human mistakes. Based on work by [SHÇ + 10], the ResumeNet 3 project devised a framework whereby a number of resilience principles are defined, including the resilience strategy D 2 R 2 +DR: Defend, Detect, Remediate, Recover, Diagnose and Refine, which is outlined in Figure 4. Diagnose Defend Detect Recover Rem ediate Refine Figure 4: The D 2 R 2 +DR resilience strategy At its core is a control loop comprising a number of phases that realise the real time aspect of the D 2 R 2 +DR strategy and consequently implements cloud resilience. Based on the resilience control loop, other necessary elements of our framework are derived, namely deployment function, anomaly detection and policy engine that aim to build situational awareness, and multilevel information sharing and control mechanisms. Under the D 2 R 2 +DR framework, there must exist components capable of reconfiguring underlying infrastructure in response to challenges using policies (Fig. 5). Reconfiguration need not apply to the same components on which the detection was based. A policy engine is responsible for mapping detection events to reconfigurations, accepting a resilience strategy as a collection of policies [SHS + 11a, SHS + 11b]. We can apply D 2 R 2 to the architectural framework (detail will follow in 2.1) to provide a resilience view of it (Fig. 6). At the physical layer, the cloud-infrastructure operator has access to physical nodes and network, which can be monitored to inform the detection process. The operator can also reconfigure these devices, in response to detected challenges using policies. Cloud-infrastructure D 2 R 2 may exist as monitoring and reconfiguration points on physical hosts and networks, and on some virtual com- 3 ResumeNet: 13 out of 85

14 Figure 5: D 2 R 2 components Figure 6: A Resilience-oriented View 14 out of 85

15 ponents. Resilience managers and detectors need not exist on any physical equipment used directly to provide virtual resources to the above layer. At the tenant-infrastructure layer, the tenant has access to VMs, and possibly virtual taps on VNs, which can inform detection. In response to challenges, the tenant may reconfigure the hosted machines, and some functionality of the virtual networks might also be exposed. Thus, tenant-infrastructure D 2 R 2 is spread across components visible to this layer. Within the inner D 2 R 2 loop, some interaction between these layers may exist in the form of events and reconfigurability exposed by the lower layer. For details on policy and resilience view points, we refer reader to the architecture white paper [sec13]. 2.3 Resilience in the cloud We reinterpret and define cloud resilience as the ability to maintain an acceptable level of system operation and service even in the presence of challenges. Resilience is already supposed to be a fundamental property of Cloud service provisioning platforms. However, a number of significant outages have occurred that demonstrate Cloud services are not as resilient as one would hope, particularly for providing critical infrastructure services [Nea11]. In a number of cases, these outages are caused by problems in the network infrastructure. As a recent ENISA 4 report on the security and resilience of Governmental Clouds suggests: the availability of a cloud service is often dependent on the network used to access it... and measures should be taken to ensure the resilience of access networks [Cat11]. To the best of our knowledge, there is limited or no understanding about how to provide a resilient Cloud infrastructure that can collectively address challenges in a coordinated manner these are treated as separate concerns. In general, one of the benefits of deploying services in the Cloud is the relative simplicity due to its obviation of the direct purchase and maintenance of physical infrastructure. However, this comes at the cost of having to rely on COTS and loss of control over parts of the service deployment process. Meanwhile, economies of scale brought about by a unified service offering built on top of a virtualised infrastructure enable Cloud service providers to exist and make a profit. However, the deployment of critical services in the Cloud potentially breaks this economic model by requiring additional configurations to address security and resilience concerns related to a particular organisation and service. To help mitigate this problem, we propose to extend the concept of resilience patterns [SFSM + 12] for critical service provisioning in the Cloud. Resilience patterns are specifications of reusable strategies, or configurations, to commonly understood challenges, e.g., they specify the relationship between various detection and remediation mechanisms in order to mitigate a challenge. A pattern can be evaluated using off-line tools, e.g., a simulator, for its suitability to mitigate a particular challenge, such as a DDoS attack, and subsequently deployed and configured for a specific deployment at run-time. Specifying patterns and sharing those that are known to be effective can reduce the overhead of addressing security and resilience concerns, maintaining one of the benefits of utilising the Cloud, i.e., that of reduced deployment costs and complexity. 4 European Union Agency for Network and Information Security: 15 out of 85

16 2.4 Policy-based Resilience Management Management and resilience of the cloud environments are closely linked. The extra layer of resource virtualization makes it difficult to plan effective management due to varying user demands, co-hosted VMs and the arbitrary deployment of multiple applications. Generally, management policies are used to govern the behaviour of a system. These management policies can be mostly looked upon as: the constraints and preferences on the state or the state transition, of a system and is a guide on the way to achieving the overall objective which itself is also represented by a desire system state [Goh97]. When using policy-based management, it is critical that rules being specified actually stem from the higher level requirements and that they are implementable. Challenges to the operation of a Cloud Infrastructure can occur rapidly and with little warning, requiring a fast response in order to maintain acceptable service levels. In order to mitigate a challenge, complex multi-phase strategies are required, which combine various monitoring and detection mechanisms that influence the behaviour of remediation mechanisms. To address these issues, advocates the use of policy-based network management techniques for the configuration of resilience strategies. These techniques allow descriptions of real-time adaptation strategies, which are separate from the implementation of the mechanisms that realise the strategy. This separation allows changes to be made to strategies without the need to take resilience mechanisms off-line. In short, two forms of policy are supported: authorisation (or access control) and obligation policies. Mainly policies are categorised as obligation and authorisation polices which are also supported by IND 2 UCE, which is the policy environment used by our resilience framework. Obligation policies specify management operation that must be performed when a particular event occurs given some supplementary conditions being true. These policies follow the Event-Condition-Action (ECA) paradigm and are of the form: on <event> if <conditions> do <target> <action>; Therefore, the occurrence of the specific event is a necessary condition for the mandated operation to be performed. The event is a term of the form e(a 1,... a n, where e is the name of the event and a 1,..., a n are the names of its attributes. The condition is a boolean expression that may check local properties of the nodes and the attributes of the event. The target is the name of a role (i.e., a placeholder) where the action will be executed and so the service or resource assigned to the target role must support an implementation of the action. The action is a term of the form a(a 1,..., a m ), where a is the name of the action and a 1,..., a m are the names of its attributes. To simplify notation an obligation policy can have a list of target-action pairs, all evaluated when the event is true and the condition holds. The attributes of an event may be used for evaluating the condition (to decide whether to invoke the action or not), or they may be passed as arguments to the action itself. Implicitly the role to which the obligation policy belongs is the subject of the obligation i.e. the entity enforcing the policy, and the action is invoked on a target role. Note the target may be the same as the subject i.e. a role may perform actions on itself. Obligations can also be used to load other policies (obligations or authorisations) into the system or existing policies may be enabled/disabled to change the management strategy at run-time. 16 out of 85

17 Authorisation policies specify what actions a subject is allowed (positive authorisation) or forbidden (negative authorisation) to invoke on a target. The subject and the target are role names. The action and the condition are defined like in obligations. Authorisation decisions could be made by one or more specific roles in the network, but commonly implementations are based on the target making decisions and enforcing the policy as it is assumed that target roles wish to protect the resources they provide to the network. auth[+/-] <subject> if <condition> then <target><action>; 2.5 Related Work The need for real-time and more dynamic nature of cloud infrastructure has made the task of defining resilience framework very challenging. The policy based management has proven to be very effective for complex system management as evident in previous literature. The policy based approach to network and system management proposed in [MAP96, HAW96] defines a framework for management for polices, policy hierarchies and policy transformation. In [RPB93], authors propose to enrich managed objects with policy goals as required by the management policy. It describes policies in two parts active part, containing application specific functionality and passive part which can be re-used without any change. In [KLM + 99] authors used approach to enforce policies by means of rules, but the understanding of a rule is more restrictive. The above works mentions are in context of specification and implementation of policies and recently SLA, little focus has been given to the refinement of the high-level requirements into low level polices. The Verman presented an approach to policy translation that is based on a set of tables [Ver00]. The tables identify the relationships between users, applications, servers, routers and classes of service supported by network. Whilst this technique offers the advantage of being fully automated it is inflexible approach, only supporting a very specific type of high-level SLA policy and low level device configuration policy. The work in [CMBG00], outlines a policy authoring environment that provides a policy took, called POWER, for refining policy. Where domain expert first develops a set of policy templates, express as Prolog Programs, and the policy authoring tools have an integrated inference engine that interprets these programs to guide the user in selecting the appropriated elements from the management information model to be included in the final policy. The main limitation of this approach is the absence of any analysis capabilities to evaluate the consistency of the refined policies. Similarly, work presented in [BCV04] allows translation of service-level objectives into configuration parameters of a managed system. The transformation engine takes the service requirements of the user as input, and search the database to determine the optimal parameters values that provide level of service limitation of this technique include its dependence on a rich enough database which is only possible by observing the system for some period of time; and the inability to deal with situations where a given requirement specification results in different configurations. There are several relevant projects which are highlighted below as of interest. 1. ResumeNet [res] defines a multi-level systematic framework to network resilience. 17 out of 85

18 2. TClouds [tcl] was an EU FP7 project aimed at developing a cloud infrastructure that achieves security, privacy and resilience. It objectives include to identify and address legal and business issues, define an security architecture for cloud, and provide resilient middle-ware for adaptive security on the cloud-of-clouds. 3. PRECYSE [pre] is an EU FP7 project. The strategic goal of PRECYSE is to define, develop and validate a methodology, an architecture and a set of technologies and tools to improve by design the security, reliability, and resilience of the information and communication technology (ICT) systems that support critical infrastructures (CIs). 4. Cloud Controls Matrix (CCM) [Tea10] is specifically designed to provide fundamental security principles to guide cloud vendors and to assist prospective cloud customers in assessing the overall security risk of a cloud provider. 5. OrBAC [KBB + 03] was developed inside the RNRT MP6 project (communication and information system models and security policies of healthcare and social matters). The purpose of this project is to define a conceptual and industrial framework to meet the needs of information security and sensitive healthcare communications. ResumeNet provides blueprints and design guidelines for our cloud resilience management framework. The proposed resilience strategy in the ResumeNet is validated by detailing the guidelines which can be applied to the problem of channel interference in wireless mesh network and to explore the implications of multi-staged and collaborative detection. The TClouds project targets cloud computing security and minimization of the widespread concerns about the security of personal data by putting its focus on privacy protection in cross-border infrastructures and on ensuring resilience against failures and attacks. They published work about an advanced cloud infrastructure that can deliver computing and storage that achieves a new level of security, privacy, and resilience. Their demonstration focus is on socially significant application areas such as energy and healthcare. OrBAC provides a well-defined access control policy model, which can be integrated into the CRMF framework, and shall enable fine-grained access control of the resources. The OrBAC API has been created to help software developers introduce security mechanisms into their software. This API implements the OrBAC model, which is used to specify security policies and also implements the AdOrBAC model [CM03], which is used to manage the administration of the security policies. The MotOrBAC [CCBC06] tool has been developed using this API to edit and manage OrBAC security policies. OrBAC has only been realized on homogeneous systems (such as firewall) or at software level. The CSA (Cloud Security Alliance) CCM (Cloud Control Matrix) provides a controls framework that gives detailed understanding of security concepts and principles that are aligned to the Cloud Security Alliance guidance in 13 domains. The foundations of the Cloud Security Alliance Control Matrix rest on its customized relationship to other industry accepted security standards, regulations, and controls framework such as the ISO 27001/27002, ISACA CoBIT, PCI and NIST and will augment to provide internal control directions for SAS 70 attestations provided by cloud providers. This controls framework can possibly serve as as the backbone for evaluation of the security levels of the CRMF. 18 out of 85

19 3 Cloud Resilience Management Framework (CRMF) The Cloud Resilience Management Framework defines several automatic functions of an administrative cloud domain, at various levels of the architecture, and how they interact to make the domain resilient to unplanned challenges. It also describes how functions in separate domains can collaborate in dealing with challenges. Implementations of CRMF functions are additional units to be closely integrated with an existing cloud management framework. Conceptually, the resilience management can be instantiated in any virtual instance or level of an infrastructure service provider, addressing management in both single-and cross-domain cases given that the infrastructure service provider implements it according to its specific management purposes and with respect to the available infrastructure. CRMF () Existing management Framework Resources (Compute,Network, Storage) Collaborative interface Figure 7: Collaborative interface for resilience management Information exchanged between different administrative domains can be performed through the collaborative interface as shown in Figure 7 below possibly with the aid of Coordination and Organization Engine (COE). Typically, the kind of information available through these interfaces are detection metrics, performance measurements, configurations and enforcement actions, etc. 3.1 Architecture The CRMF consists of two main functions, the Deployment Function (DF) and Resilience Manager (RM). The Deployment Function is concerned with supplying the Virtual or Cloud Infrastructure Manager with configurations describing the creation and deployment 19 out of 85

20 of virtual machines for instantiating a service. On the one hand, DF acts as controller for RMs by placing virtual machines in parts of the infrastructure supervised by an RM instance. Alternatively, it can also provide input to the RMs in terms of types of instances started and their respective locations, which may allow detection mechanisms to work. On the other hand, and in addition, it directly creates and places instances according to the resilience requirements of the service. If high-availability service components needing (for example) backup VMs are defined, it deploys redundant instances for higher availability and places them so that the given levels of availability of the service components are met. Details of the deployment function are presented in Section 3.2 below. Each Resilience Manager (RM) is composed of three software components, or engines, which are shown in Figure 8. (A in the figure represents a single hardware node in the cloud, and B represents the RM, with the aforementioned software engines residing within it.) For simplicity only three nodes are shown and the network links that connect them have been omitted. The anomalies of particular interest are those caused by any activity associated with anomalous behaviour in cloud such as malware on cloud VMs or Denial-of-service attack on host. A B DF ADE PE C D Cloud infrastructure COE Cloud infrastructure Legend DF: Deployment function ADE: Anomaly detection engine PE: Policy engine COE: Coordination and organization engine Figure 8: An overview of the CRMF architecture The software components within each RMs are: The Anomaly Detection Engine (ADE), Policy Engine (PE) and the Coordination and Organization Engine (COE). The RM on each node performs local detection based on features gathered which is done using sub-component called Data collection Engine (DCE) from its nodes VM and its local network view; this is handled by the Software Analysis Engine (SAE) and Network Analysis Engine (NAE) components respectively. The PE component is in charge of remediation and recovery actions based on the output from the anomaly detection engine (ADE), which is conveyed to it by the COE. Finally, the COE component coordinates and disseminates information between other instances and the components within its own node. It is ultimately in charge of the maintenance of the connections between its RM s peers and embodying the self-organizing aspect of the overall system. In addition to physical node level resilience, the CRMF is capable of gathering and analysing data at the network component level through the deployment of network RMs as shown by C in Figure 8. The component is deployed on (D in the figure) is an in- 20 out of 85

21 gress/egress router for the cloud; the DCE is tightly coupled with the router/sdn and as such can gather features from all traffic passing through it. Another important component of RMs is the Coordination and Organization Engine (COE) since it performs a critical role in the maintenance of the overall system. The COE acts autonomously to control the other components within its RMs while at the same time communicating with its RM peers. Figure 9 shows the relationship between RM engines and the coordinating role that the COE plays. The dashed arrow in Figure 9 indicates communication between the local COE and the remote COEs of each peer RM. Defend Detect DF ADE ADE DCE NAE SAE PE PE COE COE Remediate & Recover Legend DF: Deployment function PE: Policy engine COE: Coordination and organization engine ADE: Anomaly detection engine NAE: Network analysis engine SAE: System analysis engine Figure 9: The flow of information and hierarchy of engines within a CRMF Self-organization in a system of RMs is achieved through the dissemination and exchange of meaningful information with respect to the system and network activities of each VM. In practice, and as depicted in Figure 8, there are various system/network interfaces that act as information dispatch points in order to allow efficient event dissemination [MHP10]. Here we highlight some design requirements of our resilience framework. More detail about individual components will follow in subsequent sections below Design requirements The following requirements of CRMF have been identified: 21 out of 85

22 The framework shall be reactive and pro-active to challenges for fast detection and identification, preferably before challenges causes noticeable degradation. The framework shall have a modular design that allows re-use of resilience strategies. This is vital for handling failure of surrounding components. The components of the framework shall easily integrate with each other to allow complete flexibility and adaptability. The framework should be easy to modify or integrate according to security and resilience requirements. The framework shall easily integrate with existing cloud management systems The framework shall provide functionality for efficient and effective management of resilience where needed. The framework shall use policies to realize overall resilience. The framework shall be composed of various distributed components to provide overall resilience which is desirable in the case of failure of individual components. 3.2 Deployment function The deployment function concerns itself with supplying the Virtual or Cloud Infrastructure Manager with configurations describing the creation and deployment of virtual machines for the critical infrastructure. Its task is to translate high-level service descriptions and SLAs into automatically deployable descriptions, such as Heat templates in the Open- Stack environment. It is therefore a part of a Tenant Infrastructure Management System (TIMS), being located between the cloud user and the Cloud Infrastructure Operator, cf. Figure 3. While orchestration, of which the deployment function is a part, generally concerns itself with lifecycle management, configuration and resource assignment, these issues are not in the focus of the particular Critical Infrastructure use cases relevant in. On the other hand, with respect to project objectives and use cases, the inclusion of resilience patterns and mechanisms is of particular interest, since these support high availability of the infrastructure. Resilience is currently normally handled by offering the concept of availability zones (or similar concepts) on the part of the cloud infrastructure providers. An availability zone denotes a collection of resources that is not included in other availability zones, thus ensuring isolation of parts of the infrastructure if instantiated in different zones. Thus, components that need a higher level of availability are recommended to be started redundantly in separate availability zones, which are promised to be fault independent and have a short interconnection delay (at least in the Amazon EC2 framework). Neither the resulting availability nor the actual delay (or bandwidth) is necessarily given in the form of an SLA agreement. Comparing this with the higher availability and synchronization requirements of critical infrastructure shows the need for enhancing this area of service deployment. In particular, a deployment function is needed that places instances of virtual resources according to the resilience and performance requirements of the service user. However, this should be done without necessarily disclosing the detailed infrastructure of the cloud architecture, since this may be information an infrastructure provider 22 out of 85

23 is not willing to share. Thus, to not discourage cloud providers from offering such a service, the deployment function needs to provide information on a relatively abstract level. However, the deployment should be documented and correlated to the infrastructure in order to provide another puzzle piece of the audit trails. In addition, the placement of virtual machines can be done in cooperation with the anomaly detection component, towards at least two ends: a) to place critical instances in AD-supervised parts of the infrastructure, and b) to provide input to the AD in terms of types of instances started and their respective locations, which may allow to conclude standard traffic and behavioural patterns of and between these instances. The aforementioned functionality can be separated into the following sub-components: SLA parser This subcomponent is responsible for parsing a user description of their service, i.e., the needed service components and their relation/connectivity, as well as the required performance and availability Placement function The requirements extracted by the SLA parser translate into a set of virtual machines that need to be deployed in the physical infrastructure. The placement function maps these instances to the offered infrastructure of the cloud infrastructure provider Deployment template generator Finally, the mapping must be provided in a form that is processable by the cloud management system. This means generating a deployment, e.g., a Heat template, using as input the placement and the additional service description parts generated by the SLA parser. 3.3 Anomaly detection The anomaly detection engine is one of the core components of CRMF and is comprised of three sub-components: Data Collection Engine (DCE), Software Analysis Engine (SAE) and Network Analysis Engine (NAE). DCE monitors and collect each VM and host for various metrics in order to produce feature vector for subsequent detection engines (i.e NAE and SAE). System and network level engines identify potential symptoms of anomalous behaviour without performing lot of computation. They denote the presence of anomalies, but merely suggest the possibility of one. Further analysis would be needed to confirm if anomaly is present and if so, classify it. The initial event is generated by looking at each metrics for system and network level in isolation. Hence, an event generated indicates that either host or VM is experiencing anomaly. A broader correlation between metrics can be very expensive as the number of such comparison grow exponential with the number of VMs and the monitored metrics. The Fine Grain Analysis (FGA) is then required that uses statistical algorithm to identify cause of anomalies and localize them. This analysis of the the monitored data is require to further understand anomalous behaviour and to narrow down the scope of re- 23 out of 85

24 mediation. It can also generate the event in case the anomaly can not be classified for which manual intervention is required. Below we give brief overview of sub-components Data collection engine An essential goal for resilience management is what information should be provided and where the information should come from. Having identified information which might affect the decision on remediation, this could serve as the basis to reason about effective resilience actions subsequently. Therefore, the DCE is designed in such a way that it can collect and processes various relevant metrics pertaining to system (such as CPU, memory etc.) and network (such as number of packets, number of bytes and throughput) for every VM and physical host. All metrics are collected at periodic intervals with configurable monitoring interval parameter. It has sub-components which perform normalization and smoothing of data and produce feature vectors which is a sequence of values over a fixed interval of time and forms the basic input into subsequent detection engines (i.e, NAE and SAE) Network analysis engine The purpose of the NAE is to detect anomalous traffic at the physical node level of the cloud. This is achieved by modelling normal traffic patterns and identifying anomalies through online/offline monitoring of traffic on the network interfaces of the cloud node with aid of DCE. The NAE provide reference implementation of different anomaly detection techniques for offline and online analysis System analysis engine The System Analysis Engine (SAE) is designed to detect anomalies through the observation of VM properties. THE SAE builds a model of normal VM operation and detects deviation from the normal through selected anomaly detection technique Resilience policies (RP) Policies are rule which govern the choices in behaviour of the Cloud system. These policies follow the Event-Condition-Action (ECA) paradigm and are of the form of certain re-configuration and actions based on the output of both SAE and NAE. The output depends on detection algorithm in use by the components, but is ultimately an indication of the current health of the VMs and the host on which VM is residing. The output could be as simple as alert or a boolean indicating whether the VM or host is behaving anomalously or not. Once this alert is received initial resilience policies (actions) (such as sandbox the VM, lower the bandwidth etc.) are instantiate, while in parallel Fine Grain Analysis (FGA) is performing more rigorous analysis to classify and localize the anomalies. Hence, narrowing down the scope of remediation for more fine-grained actions. 24 out of 85

25 3.4 Policy Engine In order to control resilience behaviour and to manage cloud environments, a policybased approach has been chosen. Policies are enforced by IND 2 UCE, a framework originally developed for the enforcement of usage control policies. IND 2 UCE supports context-aware policies and freely configurable monitoring and enforcement components IND 2 UCE Framework The policy enforcement framework IND 2 UCE consists of several components and is illustrated in Figure 10. This is a conceptual view on the infrastructure, as the components do not necessarily need to be on one system and not all components need to be instantiated at all. In the following a short description of the IND 2 UCE components is given, followed by an overview of the generic interaction behavior of the framework. Finally the components are described in detail. Figure 10: IND 2 UCE Framework Policy Administration Point, PAP The PAP is a user interface for the specification of policies and for interacting with the Policy Management Point for managing existing and new policies. Policy Management Point, PMP The PMP is the main management point in the framework, responsible for storing policies in the PRP as well as for deploying/revoking policies in PDPs. Additionally the PMP provides capabilities of registration and query of other framework components for providing a dynamic enforcement infrastructure. 25 out of 85

26 Policy Retrieval Point, PRP The PRP is a secure storage for specified policies and can be queried by the PDP for a policy deployment request. Policy Enforcement Point, PEP The technology dependent PEP intercepts system events and provides a common event representation to the PDP for the policy evaluation. Finally the decision drawn by the PDP with respect to installed policies is enforced. Policy Decision Point, PDP The PDP is the reasoning component in the policy enforcement framework. Policies installed by the PMP are instantiated and evaluated upon event notification by PEPs. Depending on the policies, additional information may be required from Policy Information Points and additional actions may be executed using Policy Execution Points. Policy Information Point, PIP The PIP is an independent component in the policy enforcement framework responsible for resolving additional attribute values. Among others, this includes for instance data flow information between several layers of abstraction and context information, e.g., of a mobile device or the cloud environment. Policy Execution Point, PXP The PXP is a component for executing concrete actions such as notifying a user, writing a log entry or sending a mail. The execution is triggered by the PDP based on installed policies Data Usage Control Policies We distinguish between preventive and detective policies. The difference between these two types of policies is the result after applying them and the level of controllability required on the system to enforce the policies. Detective policies report the policy violation which can result in some compensation or correction of the action. In the broadest sense, the correction can be the punishment by the court, which use detected and reported policy violations for their line of argument. In contrast to this approach are the preventive mechanisms. They take care that an undesired action will not take place. The following example clarifies the difference: Patient data is stored in central storage which has authentication, authorization and accounting mechanisms implemented. If an unauthorized doctor is accessing the data, this action is blocked by the system, and therefore, preventive enforcement has been applied. If he can access the file and the violation is reported to the responsible person, then detective enforcement has been applied. In general, preventive approaches require more control and possibly imply bigger performance issues, because every action needs to be stopped, transformed by the PEP and checked by the PDP. This results in some delaying overhead. The detective approach requires less control and imposes less performance overhead. However, it limits the policy to compensation or penalties in case of an observed violation. Finally, in some cases preventive policies cannot be enforced due to lack of support in the target systems and the best resort is to observe and react in case violations happen. 26 out of 85

27 We distinguish four enforcement mechanisms that are explained with a short example. For simplicity, we will assume a message that is checked by a firewall-like component enforcing the policy no personal data is allowed to leave the system. Enforcement by inhibition: The message is blocked or dropped. Enforcement by modification: All personal data related message fields are modified (e.g., censoring by inserting blanks). Enforcement by delay: The message is delayed until a compensation action, for instance, has taken place. Enforcement by execution: An action such as sending a notification message to the system administrator or deleting the data source. The enforcement by execution is the only possibility for detective enforcement, while inhibition, modification, and delay are exclusively types of enforcement performed by preventive mechanisms. A fifth enforcement mechanism possibility is to simply allow the message. This mechanism is usually the standard case. This is similar to the definition of firewall rules with a white-listing approach. In this case, the enforcement allow is used to permit the message exchange. IND 2 UCE usage control policies are specified in a concrete XML syntax following an XSD schema. In general, a policy consists of one or more enforcement mechanisms. These enforcement mechanisms refer to the introduced preventive and detective mechanisms and describe, what may (not) happen to the affected data item. They are specified in form of Event-Condition-Action rules (ECA): if a system event E is detected and allowing its execution satisfies condition C, then action A should be performed. An example of such a policy is given in Listing 1. <policy name="acp2"> <preventivemechanism name="acpfire"> <description>enocean DoorLocking Trigger</description> <timestep amount="3" unit="seconds" /> <trigger action="smartintegomessage" istry="false"> <parammatch name="msgtype" value="14" /> </trigger> <condition> <within amount="60" unit="seconds"> <eventmatch action="enoceantelegram" istry="false"> <parammatch name="id" value="180080e" /> </eventmatch> </within> </condition> <authorizationaction name="default"> <allow> <executeaction name="smartintegoaction" > <parameter name="action" value="shorttermactivation"/> </executeaction> </allow> </authorizationaction> </preventivemechanism> </policy> Listing 1: Exampe Data Usage Control Policy 27 out of 85

28 Each mechanism consists of three main parts following the mentioned ECA principle. Additionally it has an optional description and a timestep attribute, which is required for temporal evaluations. The event, i.e. the trigger event, is a concrete action executed in the system under observation. Once the trigger event is intercepted by a policy enforcement point, it is forwarded to the PDP and the specified condition is evaluated. Differentiating between preventive and detective mechanisms, the preventive ones additionally contain an authorization action part, describing the decision which has to be performed, if the condition evaluates to true. This decision may be an allowing of the intercepted event, potentially with one or more modifications of event parameters, a delaying of the event execution or a complete inhibition of the event. Hence, for preventive mechanisms the specification of a trigger event, i.e., the action that potentially has to be inhibited, is mandatory, whereas for detective mechanisms it is optional, because here only compensating, i.e., additional execute actions may be executed after occurrence of a specific situation. Such a situation can be the interception of a system event and/or specified as condition. Detective mechanisms can be activated by an event or a time trigger; preventive mechanisms can only be activated by an event trigger Policy Engine in CRMF The IND 2 UCE policy engine is based on the ECA principle. In the CRMF context, events from the anomaly detection (AD) or from the fine grain analysis (FGA) are used as trigger event for the policy engine that can perform remediation actions. Depending on the deployed policy mechanisms, remediation actions can be performed by PXPs in the system such as migrating virtual machines to a dedicated host in the cloud environment (sandboxing) or starting another instance of the virtual machine to compensate overload. 4 CRMF component detail Based on the presented CRMF design and the D 2 R 2 +DR resilience strategy described in Section 1, we now illustrate how a Cloud infrastructure could be enhanced to cope with challenges. The overall strategy we adopt to mitigate challenges is depicted in Figure 11. A more detailed description of CRMF components follows. 4.1 Deployment function As described in 3, the deployment function component can be logically separated into sub-components (cf. Figure 12), each with a specific task. We will describe these subcomponents in the following SLA parser A first sub-component is a parser for the input coming from the cloud user or the service operator. It needs to analyse the service description and extract the relevant information for further processing by the other sub-components. Ideally, the input parameters are formatted in a way that allows for an automatic analysis and translation into input for the 28 out of 85

29 specifies SLA Resilience target Mapping (RTM) Data usage Policies (DUP) Resilience Policies (RP) Security & resilience requirements (S&RR) Policy engine events Anomaly detection (AD) instructions to monitor Deployment function/ orchestration Heat API actions events feature vector time-series heat REST API OpenStack APIs Fine grain analysis(fga) Aggregation Data collection Engine (DCE) Ceilometer API CRMF scope pcap VMs VNs VSs reconfigurations/actions monitoring message bus (rabbit MQ) Figure 11: CRMF system architecture Figure 12: Internal structure of the deployment function virtual infrastructure manager and orchestrator. In [BSRS13], a YAML-based format was proposed, but no full specification of such a format exists yet. Therefore, we will in the following describe the general parts of such an input file and outline which parameters the deployment function should be able to handle and translate into a resource request and VM deployment. It should be noted that it will not be possible to implement all of these in the scope of this project (also due to the application-dependency of an implementa- 29 out of 85

30 tion). Nevertheless, the list should outline some basic input classes and parameters the deployment function should ideally be able to translate into a deployment template, cf. Figure 13. Figure 13: Parameter Classes Non-functional parameters This includes Geo-location constraints, e.g., due to legal reasons for the storage of personal data. Storage nodes can thus only be placed in zones that have the according location characteristic. As well, this includes the placement of virtualized functions and components under the supervision of anomaly detection (and possibly the Transparency Enhancement Framework, for which the need was identified in D5.1 [BSH + 13] and which is developed in WP5 and will be described in the future D5.3). This can be generalised to other security and management features the infrastructure might offer. While these components are functionality offered by the cloud infrastructure, they are not functional in the scope of the deployed services. Resilience parameters The protection model describes which resources and which configuration should be used to increase resilience. Examples are 1:1 or N:M protection with dedicated backup instances not normally used for production traffic, or active-active protection where all instances handle user load but can cope with the failure of instances by redirecting their load to the remaining instances. Related to the protection model is the required availability of the resilient components. This requires placing the protection instances in different availability zones in order to cope with failure of parts of the infrastructure. Depending on the required level of availability, this requires placing instances at a certain (topological) distance from each other, in contrast to the next class of requirements. 30 out of 85

31 Particularly important here is the description of the synchronisation requirements, i.e., the (networking) resources necessary to interconnect and synchronise the instances belonging to a protection scheme. This mainly means the maximum acceptable delay and minimum acceptable bandwidth for a connection between these instances. Since these connections are vital in keeping fail-over times short and loss of data low, the placement of these instances with respect to each other and the available network resources becomes of high importance. Scaling parameters Basic parameters for scaling include the thresholds denoting when to scale out or in (e.g., based on CPU load, memory usage and network utilisation of virtual machines), and minimum and maximum scaling group sizes. More advanced parameters are the relation of a scaling group with other instances, i.e., having to scale out or in when another group of instances scales out or in. Virtualized component parameters This includes the resources necessary for the instances themselves, i.e., CPU core number, memory, interfaces etc. This is typically mapped to the flavours of the virtual machines to be started, as supplied by the CIMS. The connectivity describes the regular network resources necessary for the deployed service, both between different parts and components of the service (excluding protection schemes) as well as connectivity to the outside world, as seen from the cloud infrastructure. This group also includes the necessary configuration scripts that need to be executed in order to set up and interconnect machines on application level for their role, e.g., whether a server is configured as the active or a passive backup server Placement function The main sub component is a function that maps the specific requested resources, and, if applicable, their resilience pattern, to a placement of instances in the cloud infrastructure. In particular, it takes the number of redundant or otherwise connected instances as well as their interconnection requirements (in particular delay and bandwidth) as an input from the service operator, via the SLA parser. A second input is a description of the cloud infrastructure with information about the availability of its components, the physical network characteristics as well as Geo-location and status of supervision by anomaly detection. This infrastructure description does not need to be made public or even visible to the service operator or cloud user, but should be disclosed only in case of a dispute. It is based on the measured and/or configured performance data of the cloud infrastructure provider, or could use input from services like ALTO [alt]. With these two inputs, the placement function maps the necessary instances to cope with the resilience requirements of the user to physical resources or resource pools in the infrastructure that provide the needed performance. To this end, it identifies combinations 31 out of 85

32 of resources that offer the requested performance and returns a selection of these zones (if they exist) as an output. With respect to the placement of the virtual machines, the goal is to re-use the concept of availability zones to place instances on specific servers. This necessitates a suitably fine-grained configuration of these zones on behalf of the cloud infrastructure provider, i.e., configuring an availability zone not per region or data centre, but per rack of servers. In addition, the same concept can be used to define zones that are under the supervision of the anomaly detection component. Then, the placement function can implicitly request monitoring of certain critical instances by placing them in such a zone, without needing an explicit communication with AD sub components. However, if more information is to be provided (such as perhaps the type of instance to be monitored and their communication pattern), then an additional sub component for this communication is needed (not shown in Figure 12), using output from both the SLA parser as well as the placement function Deployment template generator Finally, another sub component needs to format the output of the placement function and the SLA parser so that it can be processed by the cloud orchestration framework to actually provision the resources for the service. The specific output of the orchestrator component and deployment function as well as its format should in principle depend on the cloud orchestration framework used (e.g., OpenStack Heat 5 or VMWare vcenter/vrealize 6 ). In general, the orchestration component produces the requests and commands to reserve infrastructure resources, to start instances of virtual machines and network functions, and to configure them for the requested service. A part of this sub-component s functionality is the necessary configuration and interconnection of instances as backup or load-sharing components in order to increase resilience and availability. However, this aspect is highly application-dependent, since the specific protocols and synchronisation procedures used for these purposes depend on the virtual machines and applications deployed. They are thus probably needed to be already defined in the service description of the service provider. Therefore, no general guidelines can be given for this part of the function. However, we will provide an example for the use case described in Section 6. A similar approach would be taken for the Mirasys use case of redundant VMS servers. If no orchestration framework exists, the template generator sub component of the orchestrator would basically to take over this part of the process as well, i.e., create a set of more basic instructions (e.g., for OpenStack Nova 7 ). However, we will not consider this case in the scope of the project. For the example of OpenStack, a Heat template can be generated that contains all of these steps. This template contains information about, e.g., scaling groups and triggers, VM flavors to be started and configuration scripts to be run on instances. It also contains the placement of the virtual machines in terms of availability zones (as defined by OpenStack). While the precision of this information depends on the definition out of 85

33 of the zones by the cloud infrastructure provider (i.e., if only coarse-grained availability zones are defined, the location of an instance in this zone cannot be pinpointed very well), our proposal is to utilise the mechanism of availability zones to allow for a more fine-grained definition of these zones and thus a more precise placement of virtual machines on groups of physical resources. Here, a trade-off exists concerning the size of the configured zones: smaller zones mean a tighter control on the placement of the virtual machines, while at the same time restricting the flexibility of the virtual infrastructure manager framework (e.g., Nova) to place and migrate instances according to available resources. This template is also to be stored as part of the audit trail. 4.2 Interaction with tools for Audit Trails and Root-Cause Analysis (AT- RCA We believe a single metric alone is not sufficient to differentiate between different anomalous types. The assumption that anomalies manifest themselves in metrics monitored we are not able to detect anomalies which do not manifest as significant deviation, which is in fact limitation of any AD technique. Therefore, additional sources, auditing possibilities, audit trails and root cause analysis as introduced in Deliverable 5.1 [Con14] will complement the AD technique. As described in updated version of Deliverable 5.1, it would be really worthwhile to have a independent API to check requirements. While trying to achieve increased transparency with respect to operations of the cloud infrastructure, we simultaneously try to achieve the objective of a minimal disclosure of the cloud infrastructure provider s operational practices and resources as well as strict isolation between separate tenants. Figure 14 provides an overview of such an API for Tools for Audit Trails and Root Cause Analysis. Of particular interest is primarily the interface between Cloud Infrastructure Provider and Tenant Infrastructure Provider I T C. It is conceivable that such an interface or API not only provides information for tenants but also provides data for the anomaly detection. According to Deliverable 5.1, several sources come into consideration, Hardware, Operating System, Hypervisor, and Software. This could be any information about values from different devices like hosting cloud nodes, network devices (switch, firewall, IDS), CIMS, or from the auditing framework itself. Depending on the SLAs or legal requirements of a TIP other functions can be defined or the returned information can vary. For AD it is conceivable that some sources will be monitored through Tools for AT- RCA. Hence the AD can send a query directly to the API (I T C ) which is described in Deliverable 5.1 [Con14]. If for example the AD want to know the current CPU load it could easily send a enquiry to Tools for AT-RCA to monitor this source. The other way around it is feasible that Tools for AT-RCA could request some values from AD. If for example a tenant wants to audit their actual AD behaviour they could easily send an enquiry to Tools for AT-RCA which will figure this out for them. It goes without saying one must pay attention to not build a feedback loop during implementation. Just as it is imaginable that auditing information is exchanged on a local host basis. Assuming that AD and Tools for AT-RCA components are running per physical host and are able to communicate which each other. In this way, not every request has to be answered centrally, some requests could be answered locally. 33 out of 85

34 Resources Monitoring of SLA (at service level) Monitoring of SLA (at virtual tenant infrastructure level) Component A Client Devices CI Service Component B Service Components Tenant Infrastructure Application knowledge available Component Response Times C Throughput Availability Scope: This Tenant Interior View: System Internal Parameters Exterior View: Virtual Resource Parameters, e.g. CPU load, memory consumption, forwarded packets Scope: All tenants Substrate and Virtual Resource Parameters, e.g. CPU Load, Memory consumption, forwarded packets Cloud Infrastructure (Data Center) Service Operator Tenant Infrastructure Operator Operating Support System Tenant Infrastructure Management System Interior Interface Exterior Interface Cloud Infrastructure Management System Figure 14: A monitoring-oriented view identifying different interfaces I SO I ST I TC I TO I AC I AS I AT Auditor S Auditor T Auditor C Furthermore, it is worth mentioning that Tools for AT-RCA is only able to check if a specific technology is available, running, if a failure occurred or to get a specific value. Currently it is not possible to check service level functionality with Tools for AT-RCA. Especially not when it leaves the sphere of Cloud Infrastructure Provider. Therefore, it depends on the AD requirements, if Tools for AT-RCA could provide additional data sets. Therefore, Tools for AT-RCA will complement AD by providing an alternate source of features which can also be recorded for off-line analysis and can provide live information in case of failure of any data-collection component. Further interaction or an detailed overview between AD and Tools for Audit Trails and Root Cause Analysis (AT-RCA) or other RTD output will discussed in Deliverable 5.3 which will be available at the end of project. 4.3 Online anomaly detection The anomaly detection engine provides reference implementations of many anomaly detection techniques. We have chosen the one-class Support Vector Machine (SVM) algorithm for the implementation of System Analysis Engine (SAE) and Recursive Density Estimation (RDE) [AY11] is used for the implementation of Network Analysis Engine (NAE). This choice of these algorithms were based purely on their suitability for online detection and type of data they monitor. It is worth mentioning that for case of NAE we were particularly interested in a algorithm which allows to build, accumulate, and self-learn a 34 out of 85

35 dynamically evolving information model of normality. The process of feature extraction and a description of the algorithm is provided in the following sub-sections along with a description of the online detection implementation and our evaluation procedure Data collection engine The first stage of online detection is data collection which is comprised of set of scripts providing feature extraction and normalisation, which within the SAE is achieved through the use of Volatility 8 in conjunction with libvmi 9. At 8 second intervals 10 the Volatility tool is invoked with our custom plug-in that crawls VM memory for every resident process structure. From each process a number of raw features are extracted which include: the current size of virtual memory belonging to each process the peak virtual size (i.e. the requested memory allocation) of each process the number of threads belonging to each process the total number of handles belonging to each process (which includes process threads, file handles, registry entries, etc.) The raw features are per process, which is not useful if we are to consider each sample, or snapshot, as a single feature vector. Therefore, the raw features are used to build meta-features which include: the mean, variance and standard deviation of each feature across all processes. The result of feature extraction is a feature vector of the form x = (x 1, x 2,..., x n ), where n = 12 due to the three groups of four meta-features. At the network level the NAE collects traffic data through tcpdump 11 from each hosts network at bridge interface (br0). This traffic is then passed onto a Summary Extraction Script which is based on libpcap 12 and converts the traffic into normalised statistical properties on a per packet basis. In order to capture the dynamics of varying attack types, we extracted both volume-based features (e.g., count of bytes and packets) and distribution-based features (computed as the Shannon entropy of all values observed in the bin, as used in many seminal pieces of work [LCD05]). The resulting feature vector therefore has dimension n = 8 and are as follows: Number of packets Number of bytes Number of active flows in each bin Entropy of source IP address distribution Entropy of destination IP address distribution Entropy of source port distribution 8 Volatility framework: 9 libvmi: second is arbitrary binning of data, which can be user-defined 11 tcpdump/libpcap: 12 libpcap API: 35 out of 85

36 Entropy of destination port distribution Entropy of packet size distribution Figure 15 below shows the overview of data collection engine. instructions to monitor Deployment function DCE Resilience metrics Summary extraction Feature selection Pre/post processing (normalization) provision resources monitoring VMs VMs VMs Feature vector/time series resource cluster Figure 15: Overview of Data Collection Engine One-class SVM for SAE The supervised one-class SVM algorithm, as proposed by [SWS + 99], is an extension of the Support Vector Machine algorithm to handle cases using unlabelled data (i.e. novelty detection). The main goal of the algorithm is to produce a decision function that is able to return a class vector y for a given input matrix x based on the distribution of a train- 36 out of 85

37 ing data-set. This is achieved by solving the optimisation problem in Equation 1 using Lagrange multipliers as follows: (ω.φ(x i )) ρ ξ i ξ i 0 1 min ω,ξ i,ρ 2 ω νn n ξ n ρ i=1 subject to: for all i = 1,..., n for all i = 1,..., n (1) In Equation 1, ν characterises the solution by setting an upper bound on the fraction of outliers, and a lower bound on the number of support vectors. Increasing ν results in a wider soft margin meaning there is a higher probability that the training data will fall outside the normal frontier. The resulting decision function can be seen in Equation 2, where α i are the Lagrange multipliers. f(x) = N α i k(x, x i ) ρ (2) i=1 The function k(x, x i ) denotes the kernel function and can be chosen to suit a particular problem. Available kernel functions include linear, polynomial, RBF and sigmoid; however the RBF, or Gaussian Radial Base Function, is the best choice for applications with nonlinearly separable data and is what we use in our work. The RBF kernel function is defined as: k(x, x i ) = exp( γ (x x i ) 2 ) (3) The kernel parameter γ from Equation 3 is sometimes expressed as 1/σ 2, where a reduction in σ results in an decrease in the smoothness of the frontier between normal data and anomalous. It is therefore possible to produce a decision function which approximates a nearest neighbour classifier by increasing the value of γ Recursive density estimation The RDE concept was originally introduced by [AY11] and this concept uses a Cauchy function, which has similar properties to the Gaussian but can be updated recursively [Ang04] and is non-parametric. In addition, there is no need to make any assumptions about the distribution. This means that only a very small amount of data only the mean of all data samples, µ k and the scalar product quantity, Σ k calculated at the current moment in time k are required to be stored in the memory and updated. The current data sample, x k is also used, but it is available and there is no need to store or update it. This has significant implications, because it allows theoretically an infinite amount of data (infinitely large data sets or infinitely long and open-ended time-wise data streams) to be processed in real time. 37 out of 85

38 Let all measurable physical variables form the vectorx R n are divided into several clusters. Then, for any vector x R n, its Λ th cluster density value is calculated for Euclidean type distance as [Ang12]: d Λ = N Λ 1 (4) N Λ x k x i 2 i=1... where d Λ denotes the local density of cluster Λ; N Λ denotes the number of data samples associated with cluster Λ. In the case of anomaly detection, x k represents the feature vector with values for the instant k. The distance is calculated between a given data vector (e.g. measured at the time instant k) and other data vectors that belong to the cluster to which the data vector x belongs to (measured at previous time instances). It can be shown, that this formula can be derived as an exact (not approximated or learned) quantity as [Ang12]: D(x k ) = D k = x k µ k 2 + σ k µ k 2 (5)... where both, the mean, µ k and the scalar product, Σ k can be updated recursively as follows: µ k = k 1 k µ k k x k, µ = x 1 (6) Σ k = k 1 k Σ k k x k 2, Σ 1 = x 1 2 (7) The data is collected continuously, in on-line mode during the detection process. Some of the new data reinforce and confirm the information contained in the previous data. Other data, however, bring new information, which could indicate a change in operating conditions, development of an anomaly or simply a more significant change in the dynamic of the system [Ang02]. In order to detect anomalous behaviour, the variable D is, then, calculated as follows: D = D k D k 1 (8)... where D k is the density calculated for the current data sample (x k ) and D k 1 is the density calculated for the immediately previous data sample (x k 1 ). Note, that we only need to store one previous value of D. The mean of density (µ D ) is now calculated as follows: µ D = ( ks 1 µ D + 1 ) D( k ) (1 D) + D( k ) D (9) k s k s This information will be used as a measure for deciding whether the system should enter or exit anomalous state. Since it is based entirely on the concept of density in the data space, it is highly suitable and applicable for online anomaly detection approach such as in Cloud infrastructure, where it is not possible to pre-determine all possible anomalies. 38 out of 85

39 4.4 Resilience policies The central component of the CRMF is the Policy Engine (PE) as shown in Figure 16, which makes use of a policy-based decision, to activate management actions on the Cloud infrastructure at both at both the VM and host level, as well as functionality exposed by existing network elements, such as routers and switches. We propose an incremental approach to challenge identification, one of the outputs of situational comprehension, whereby an evolving understanding of the nature of a challenge is developed. The aim is to enable early remediation to protect Cloud Infrastructure from potential collapse, using imperfect information, and subsequently use more-specific remediation activities as a more detailed picture is constructed, i.e., as comprehension improves. This approach is similar to that proposed by [Gam09]. When a challenge has been identified, or an hypothesis about the nature of an ongoing challenge is reached, the anomaly detection engine generates an alert. As an example of how the architecture can be applied, consider a high traffic volume challenge, such as a Distributed Denial of Service (DDoS) attack or a flash crowd event. Initially, NAE generates an alarm using Recursive Density Estimation technique, which is computationally very fast as soon as the volume of traffic goes over a given threshold. On detection of high traffic volumes, the link on the host could be rate limited, e.g., by randomly dropping packets, to protect the challenge target. A Fine Grain Analysis (FGA) engine could then be invoked to determine the target of the challenge, leading to more specific rate limiting of traffic to this destination. Finally, actions can then be enforced using policy enforcement points such as resulting in flows being blocked if they are deemed to be malicious, or re-directed or no longer subjected to rate-limiting if they are seen to be benign. In Listing 2, we list a few example obligation policies that can be used to reconfigure resilience services in response to events generated by monitoring mechanisms, NAE and FGA. This example highlight the use of refined policies to response to challenge such as Denial of service (DoS) attack targeted at cloud infrastructure layer. In order to confront these challenges, a resilience mechanisms can be used that must co-ordinate and cooperate to ensure resilience. Clearly, it is important that an attack be mitigated rapidly to reduce the impact to the other tenants and protect the infrastructure. Such mechanisms include but are not limited to anomaly detector, flow classifier, resilience metrics reporter, malware differentiators etc. 4.5 Data-usage policies Tenants run critical infrastructure IT services on different machines (VMs) on a virtual datacenter. However, the services inside the VMs are not allowed to share the same physical resources. If a tenant or the cloud infrastructure operator starts migrating VMs running critical services from different tenants to the same physical host, an anti-affinity policy separates the VMs. The migration of critical services from different tenants to the same physical host results in an automatic migration of services from one tenant to another location. The threats, if not solving tenant s security requirements, could be as follows: unauthorised access from an untrusted service to a critical infrastructure IT cloud service 39 out of 85

40 Feature vector/time series ADE SAE Coarse grain analysis using SVM NAE Coarse grain analysis using RDE alerts COE PE FGA Fine grain analysis (classification) PEP Policy rules (ECA) configures PDP Figure 16: An Overview of Policy Engine 40 out of 85

41 unintended interference between VMs collateral damage from a tenant being attacked when insufficient tenant separation is employed Listing 3 presents such an example policy for cloud environments running VMware. The policy mechanism is triggered by an event (VmMigratedEvent) signalling that a virtual machine has been migrated. In the condition part, the mechanism evaluates whether the target host (physical hardware) is already running critical services. If the condition is satisfied, our framework requests two actions to be executed. First, it logs the policy violation with the log message Critical service violation detected.. Second, the virtual machine responsible for the violation is migrated to a host without critical services running. If no host is available without running critical services, the default host host-9 is chosen. 5 Quantitative evaluation 5.1 Test-bed We established a Cloud testbed in which two hosts serve as compute nodes for running multiple VMs. Another host acts as a controller to initiate migrations, and also to generate background traffic. A fourth host generates attack traffic. All are connected to a LAN, as shown in Fig. 17. VM VM VM VM VM virbr0 virbr0 anomalous traffic challenger clients Figure 17: Experimental setup test-bed 41 out of 85

42 Each physical node runs Kernel-based Virtual Machine (KVM) 13 as virtualization infrastructure, and the Quick EMUlator (QEMU 14 ) provides hardware emulation. Migration is achieved with libvirt 15. All VMs on a node are connected to a virtual bridge interface virbr0, so their own interfaces appear to be part of the LAN. For the experiments presented here, each VM runs Apache HTTPd. The client host runs custom scripts to initiate random HTTP requests from the VMs. The challenger host runs custom attack scripts to generate attack traffic directed towards the VMs address range for a selected attack type and intensity (i.e., the volume of traffic it generates). Tcpdump is used to simultaneously collect packet traces from the two virtual bridge interfaces, one in each physical node, and so these traces represent aggregate traffic to/from all VMs on a node. This set-up allows us to run experiments in which the legitimate traffic of several web servers is continuously emulated, while anomalous traffic is emulated by overlaying the legitimate traffic with attack traffic from the attack scripts that run during part of the experiment. Independently, one of the VMs running a webserver can be migrated live between the nodes during a period of either normal or anomalous traffic. Subsequently, traces obtained at the virtual bridges can be fed into anomaly detectors to observe their reactions to normal/anomalous traffic. Because we have control over when anomaly and migration occur, we can confidently label our obtained traces with ground truth about both conditions, and therefore assess performance of proposed detection techniques for SAE and NAE. 5.2 Software tools The data collection and analysis tools installed on each compute node include libvmi, Volatility, tcpdump, scikit-learn 16 and numpy 17 along with our custom scripts for traffic dissection and statistical summary extraction on a per packet basis for network level analysis. Overall, the data acquisition, feature extraction and anomaly detection performed on both the SAE and NAE components of our resilience framework are achieved through custom Python scripts that operate on VMs in real-time. The libvmi 18 library enables the key aspect of Virtual Machine Introspection (VMI) in our SAE since it allows fast and online access to the whole of the virtual memory of a VM. This allows memory forensic techniques, which are usually reserved for offline static analysis, to be employed in real-time scenarios. Volatility is a Python framework that is capable of a number of memory forensics functions and is extensible through the use of plug-ins. 13 Kernel-based Virtual Machine: 14 Quick EMUlator: 15 The virtualization API: 16 scikit-learn: 17 numpy: 18 Virtual Machine Introspection Library: 42 out of 85

43 5.3 Evaluation Metrics We evaluate NAE and SAE with several metrics as single metric alone is not sufficient to make conclusion about performance of underlying anomaly detection technique. For example, in order to compare the detection of the individual detectors by their sensitivities, we use both precision and recall. The precison and recall both are important for evaluation since there could be possibility that some of the data instances are few in numbers relative to other. For example if the portscan attack represented 95 of total traffic in our dataset and the normal data instances were 5. If all the predicted instances were portscan then the overall accuracy will be 5 in spite of the lost predicted normal class of data. Each input entry submitted to the NAE and SAE by DCE describes the features of monitored network traffic and VM activity during a given time period (bin), and the related detector (NAE/SAE) computes deviation from normal traffic based on selected technique. Therefore, the performance can be assessed by determining the difference between the class it produces for a given input and the class it should have. Correctly identified negatives are True Negatives (TN),incorrectly identified negatives are False Positives (FP),correctly identified positives are True Positives (TP) and incorrectly identified positives are False Negatives (FN). From this output it allows computation of the true-positive rate (TPR, sensitivity or recall; TP/(TP + FN )), the false-positive rate (FPR; FP/(FP + FN )), the precision (TP/(TP + FP)), the accuracy (TP + TN /TP + TN + FP + FN ), the F score (2 (Precision Recall) / (Precision + Recall)), the G mean ( Precision Recall), and the detection rate ((TP + FN )/(TP + FP + TN + FN ))). Accuracy is the degree to which the detector classifies data samples correctly; precision is a measure of how many of the positive classifications are correct, i.e. the probability that a detected anomaly has been correctly classified; and recall is a measure of the detector s ability to correctly identify an anomaly, i.e. the probability that an anomalous sample will be correctly detected. The final two metrics are the harmonic mean (F score) and geometric mean (G mean), which provide a more rounded measure of the performance of a particular detector by accounting for all of the outcomes to some degree. The Precision and Recall are important for evaluation since there could be possibility that some of the data instances are few in numbers relative to other. For example if the portscan attack represented 95% of total traffic in our captured traffic and the normal data instances were 5%. If all the predicted instances were portscan then the overall accuracy will be 5% in spite of the lost predicted normal class of data. 5.4 Tuning the SAE detection algorithm As discussed in the previous sub-section 4.3.2, the SAE employs parametric detection technique SVM to identify anomalous behaviour of virtual machines. This require tuning of detector for providing it with a data-set of normal samples and using it to generate a decision function that is capable of classifying novel samples. In general, the training process is determined by four factors: the size and content of the training data-set and the two parameters ν and γ. The training data-set size is determined by the length of time over which VM monitoring is conducted, after which it is possible to select subsets of the available data resulting in a refinement of training data and a reduction in data-set size if required. Data-set content is determined by the behaviour of the processes in the VM and is not accurately controllable, hence the only influence that can be imposed on 43 out of 85

44 the data is by varying the applications and the loads on each of them. In contrast, the parameters ν and γ can be finely controlled and are chosen at training time to alter the accuracy of the detector with respect to the available training data. The choice of algorithm parameters is not obvious a priori and a small change of ν or γ either way can result in a less accurate detection. However, by choosing the parameters based on how accurately the classifier classifies its own training data-set it is possible to optimise the detector for a particular VM profile. The process of parameter selection is conducted in an incremental manner by selecting the lowest reasonable values for ν and γ and incrementing the values of first ν and then γ in a pair of nested loops 19. The increment for γ need not be as fine as ν because, within our experimentation, we have found it to have much less influence on the accuracy of the detector. At each step the False Positive Rate (FPR) is calculated for the pair of parameter values according to the formula presented in section below 5.3. Overall, by conducting this iterative process we have found that once a minimum is reached there may be some parameter pairs that yield the same minimum, after which the FPR will rise again for all subsequent pairs of values. This is to be expected due to the fact that increasing both parameters past a certain point results in a frontier that fits too tightly to close neighbours in the training data and does not generalise well. Thus, a compromise needs to be reached between fitting the training data loosely with low values of the algorithm parameters, and being too restrictive with high values. Hence, with empirical experience of search times it is possible to stop the procedure long before the end of the exhaustive search and therefore reach an optimised set of parameters in reasonable time 20. Figure 18 shows the time it takes to train the classifier versus the size of the training data-set. For completeness we used a large data-set size range of up to rows, which would be impractical to obtain under normal circumstances. The data-set used in our experiments was around 200 samples, which resulted in a training time of between 2 and 10ms. Considering feature extraction takes in the order of 10 seconds to complete, the time taken to train the classifier is negligible, especially since it is only required to take place once during the lifetime of the classifier. Classification could also potentially hold up the process of obtaining a class for a particular vector and, like training, is dependent on data-set size. However, as Figure 19 shows, the time taken to produce a class is also negligible with respect to the time taken to obtain the feature vector itself, despite the fact that classification is carried out on every sample vector. 5.5 Proposed RDE algorithm for NAE The original implementation of RDE as proposed in [AY11] made the assumption that for a set of features, the normal behaviour of the system is not substantially oscillatory. 19 Since it is impossible for ν to be equal to 0 we begin the search with a value of and also increment in intervals of Since γ can be any non negative real number we begin at 0 and increment in intervals of Within our experimentation we found that this iterative process takes no more than 10 seconds on an average machine and need only be carried out once per training data-set. 44 out of 85

45 Figure 18: Time taken to train the classifier vs. training data-set size Figure 19: Time taken to output a class vs. training data-set size 45 out of 85

46 The cloud traffic which we captured off our testbed uses random client requests for http service that created high variation in background traffic. Therefore, to overcome this variation challenge we performed various experimental runs. Based on our analysis we have made modifications to original RDE algorithm with respect to the Cloud network traffic data. These modifications are explained below: 1. In order to have more defined normality regions, the mean of density µ D is recursively updated only for normal data. 2. Transition from one state to another is controlled by two tolerance thresholds T h 1 and T h 2, where; (a) T h 1, the percentage (%) of mean, is the upper-bound for density to declare system state from normal to anomalous. (b) T h 2, the percentage (%) of mean, is the lower-bound for density to declare system state from anomalous to normal. 3. We also introduce two windows of time-bins (W s 1, W s 2 ), which are intuitive enter/exit thresholds and these values represent a good trade-off between response time and robustness of the detection system. This was necessary to account for oscillatory behaviour in data because the decision about system state cannot be based on a single sample. 4. Once the system is back in a normal state the density of a current sample is reset back to one i.e, D k = 1, to mitigate the impact of current anomalous data density onto subsequent data density computation. For on-line anomaly detection, NAE starts with an assumption that system is in normal a state at start by initialising status variable to normal. In the same step, tolerance thresholds (T h 1,T h 2 ) and window sizes (W s 1 and W s2) are initialised and the current time steps k = 1, where k counts the number of data samples which are read (hence, the total number of iterations of the algorithm). From this point, the input data sample x k is read from interface provided by DCE. In our case, x k is an 8 dimensional feature vector. In the first execution (k == 1), the variables density (D k = 1.0), mean value of density (µ D = D k ), µ k and Σ k are initialised and time step k is incremented by 1 (k = k + 1). From the second time step (k > 1) onwards, the variables D k, µ k and Σ k are recursively updated by the equations 5, 6 and 7 respectively. The variable D and mean of density is calculated using equation 8 and equation 9 respectively. The later one is used as a measure for deciding whether the system should enter or exit the anomalous state. Since, it is recursively calculated, it does not need storing any previous values in the memory, which is appropriate for an on-line approach. The calculation of µ D follows the premise of 6, however, it is much less conservative, in the way that µ D is based on the past values of D, but also is sensitive to abrupt anomalous behaviours. The coefficient (1 D) will lead µ D to near the actual mean of density when there is a smooth change in the data, and D will lead µ D to near the new value of D k in the presence of an anomaly or abrupt change. As discussed earlier, to have a more defined normality region, we do not update the mean of densities when the anomaly is declared and start updating the mean when system goes back to normal again i.e, µ Dk = µ D(k 1). We found this important so that the decision about anomaly is not affected by mean drop. At this point, the following scenarios can occur: 46 out of 85

47 If the current status of the system is normal and D k µ D T h 1 for the W s 1 then change the status to anomalous. If the current status of the system is anomalous and D k µ D T h 2 for the W s 2 then change the status to normal and re-initialise D(k) = 1.0, µ Dk = D k, µ k = x k and Σ k = x k 2. Algorithm 1 Proposed anomaly detection algorithm for NAE Ensure: k = 1; Require: T h 1 && T h 2 && W s 1 && W s 2 1: %where T h 1 && T h 2 are tolerance threshold 2: %W s 1 && W s 2 are window sizes 3: status = normal 4: while x k = read next feature from DCE interface do 5: if k == 1 then 6: D(k) = 1.0; 7: µ Dk = D k ; 8: µ k = x k ; 9: Σ k = x k 2 ; 10: else 11: µ k = update by equation 6; 12: Σ k = update by equation 7; 13: D(k) = update by equation 5; 14: Dk = abs(d(k) D(k 1)); 15: if status = normal then 16: µ Dk = update by equation 9; 17: if D(k) µ D T h1 for the min window size W s 1 then 18: status = anomalous; 19: k s = 0; 20: end if 21: else 22: µ Dk = µ D(k 1) 23: if D(k) µ D T h2 for the max window size W s 2 then 24: status = normal 25: D(k) = 1.0; 26: µ Dk = D k ; 27: µ k = x k ; 28: Σ k = x k 2 ; 29: end if 30: end if 31: end if 32: k = k + 1; 33: end while We have witnessed better performance with the modified version of RDE algorithm as oppose to original version as can be seen from figures 20 and figure 21 respectively. Both figures presents density graph (blue line) of high intensity DoS attack where migration is happening during anomalous period (450th bin) and an attack is anomaly is at half way through (300th bin). The significant oscillatory behaviour of data can be seen in 47 out of 85

48 both normal (white) and anomalous (pink) regions. This indicates that decision about anomalous state based on single sample as proposed by original algorithm, leads to high false detections (as seen in Figure 20) where green circle shows signalled anomalous bins. Therefore, the introduction of window sizes in order to detect anomalies is suitable modification as seen in Figure 21 where true positive rate is high Anomalous region Migration region Signalled anomalies Density Scaled mean density Bins Figure 20: Original RDE The performance metrics (discussed in subsequent sub-section 5.3) comparison for original and modified version are shown in figure 22. The modified algorithm for on-line anomaly detection is detailed below. 5.6 Evaluation of SAE with malware samples This CRMF could not be evaluated without the ability to generate anomalies within a testing environment. For SAE, it was therefore essential to utilise appropriate samples of genuine malware in our experiments 21 Moreover, by employing Volatility for SAE and using it in conjunction with a custom plug-in it is possible to crawl memory snapshots for process structures. While it may be quicker to do this by referencing the process list belonging to the OS, rather than performing an exhaustive search, some malware samples are able to unlink themselves from this list and would otherwise go undetected. Features are then extracted from each process such that anomaly detection can be carried out in relation to the current operating behaviour of the VM and node. 21 The specific samples of malware used under experimental conditions are: Trojan.Kelihos-5, Trojan.Zbot- 1433, Trojan.Zbot-1023, Trojan.Zbot-18 and Trojan.Zbot-385, which were obtained from offensivecomputing.net 48 out of 85

49 Anomalous region Migration region Signalled anomalies Density Scaled mean density Bins Figure 21: Modified RDE Recall Precision Accuracy F score G score Original Modified Figure 22: The performance metric comparison of original vs modified algorithm 49 out of 85

50 5.6.1 Detecting Kelihos The first strain of malware is the Kelihos Trojan, which was chosen as a candidate for investigation due to its global impact and end-system effects. Upon execution of the Kelihos sample the malware spawns many child processes and subsequently exits from its main process. This is likely an obfuscation method to avoid detection, but has the effect of skewing system level features resulting in an obvious anomaly. The main purposes of these child processes are to monitor user activity and contact a Command and Control server (C&C) in order to join a botnet. To test the performance of the SAE component, we use the Trojan.Kelihos-5 sample of the Kelihos. Our trained and tuned SAE implementation was used in an online mode to classify feature vectors as they were collected from the test VM. The classifier was tuned according to the methods described in Section 5.4 and was trained using a dataset consisting of around 200 samples of normal behaviour gathered during normal server operation. The output class produced by the detector for each input vector was determined to be either correct or incorrect depending on the state of the malware sample at the time of feature extraction. The time-line for the experiment consisted of two phases: 10 minutes of normal activity, followed by 10 minutes of malware infection. Any positive detection classifications in the first phase were therefore false positives, whereas positive results in the second phase were true positives. Negative results were the opposite (i.e. true in the first phase and false in the second). The results of this experiment can be seen in Figure 23. The bar charts shown in the figure were produced by calculating various performance metrics for each set of parameters according to the formulae in Section 5.3. The tuned classifier can be identified by the kernel parameters nu = and gamma = The results of this experiment show that tuning an SVM classifier according to the method in Section 5.4 results in a more reliable detector for our particular scenario. The results also show that it is possible to reliably classify feature vectors as they are produced, which enables the algorithm to be used in an online capacity to detect anomalies in a target VM as they occur. Furthermore, the anomalies produced by Kelihos as a result of its execution behaviour were detectable using the features collected by our analysis engine at accuracies nearing Detecting Zeus The second strain of malware is Zeus, which has also seen a large global spread and has received considerably more media attention than Kelihos due to its use as a banking and key-logging Trojan. Four variants of this malware were chosen in order to test our hypothesis that the approach we take in this work is suitable for detecting multiple strains of malware, as well as variants within the same strain. Zeus, like most modern malware including Kelihos, exhibits obfuscation techniques, this time manifesting as a mechanism for tampering with security software installed on the host. Its first action is to inject itself into one of the main system processes and to 50 out of 85

51 Figure 23: Results of detection for Kelihos-5 using end-system features and a variety of kernel parameters subsequently disable anti-virus and security centre applications. This makes it appear to be a legitimate process and makes detection systems that exist outside the execution environment of the malware (such as the method used in this work) particularly applicable. Experiments using Zeus samples were conducted in the same manner as those using Kelihos. A sample was executed for the last 10 minutes of a 20 minute experiment,during which results were obtained from the classifier in real time. The first experiment using Zeus tested the ability of the SAE to detect samples other than Kelihos to verify that the method is not limited to one type of malware. The result of this investigation can be seen in Figure 24. The results indicate that the detector performs equally well when detecting either Kelihos or Zeus. The experiments thus far have tested the SAE against two strains of malware from different malware families. However, it is also important to test against different samples from the same strain in order to determine whether our approach is flexible in its classification of anomalous activity. Figure 25 shows experiments conducted with the same experimental procedure as the previous two experiments, but with each using a different sample of Zeus. The excellent detection results from each show that the method is suitable not only for detecting multiple strains of malware, but also variants of the same strain. 51 out of 85

52 Figure 24: Detection of Zeus-1433 using end-system features with a tuned classifier Figure 25: Detection of various Zeus samples using end-system data and a tuned classifier 52 out of 85

53 5.7 Evaluation of NAE with network-level attacks Using a feature set that is capable of encapsulating changes to the volumetric properties of traffic on the network we were able to detect Denial of Service (DoS) attacks on VMs using our NAE component. We first perform several experimental runs, where each yields a pair of packet traces which are labelled with the ground truth regarding the presence of attack traffic or migration in the trace. In each 10 minute run, background traffic occurs continuously at a fixed rate, and hence appears throughout a trace. At the first 5 minutes in the first run, an attack script starts, hence its traffic appears in each trace from the midpoint. At either 2.5 minutes or 7.5 minutes, a migration of one of the VMs is initiated. A run can therefore be characterised by the attack type and intensity, and whether the migration occurs during the attack or during the normal period (i.e. migration overlap ). Each trace from a run can be further characterised by whether the node it was taken from experienced an outward (MDout) or inward (MDin) migration of the VM. The anomaly types are denoted as DoS for the denial-of-service which is employed under a high (AH) or low intensity (AL). In order to aid the evaluation process, we also present detection results for portscan (NPS) and netscan (NS) in Appendices B and C. The experiment is also characterised by Migrant-targettedness (MT0) indicates that the DoS attack is targeted only at VMs that do not migrate during the experimental run whereas, (MT1) indicates that the attack targets at least one VM that migrates. Figure 26 and Figure 27 show the interaction of MT0 with NM and AM respectively. Figure 28 and Figure 29 show the interaction of MT1 with NM and AM respectively. (The legend for the above figures is given in Figure 30.) The Migration overlap is denoted by either NM (the migration occurred during the first half of the run, during the normal period), or AM (during the anomalous period). The fixed background traffic is denoted by BC0, to distinguish it from future runs with alternative background characteristics. BC0 involves five VMs running identical HTTP servers. Three run on one physical host, and two on the other. A host external to the VM infrastructure runs HTTP clients repeatedly connecting to each VM, two per VM. In order to obtain a coherent view in every experiment we run, each monitored packet trace on every experimental iteration was summarised as depicted in Table 1. The figures 31 and 32 below visually represents detection performance under high-intensity DoS attack with migration targettedness MT1(i.e., node experiencing attack is migrating) where migration happening during anomalous and normal period respectively. It can be seen from figures that anomalous region is precisely detected (i.e., green circles) with very few false-positives. Each packet trace is filtered to eliminate the related management traffic between the VM host nodes. The output of the NAE was used to produce evaluation metrics according to the formulae in Section 5.3. The results in Table 2 and Table 3 show that our choice of network features is appropriate and sufficient for detecting network based DoS attacks with high and low intensity. 53 out of 85

54 Figure 26: Depiction of migration during normal period, with anomaly directed at static host (NM-MT0) Figure 27: Depiction of migration during anomalous period, with anomaly directed at static host (AM-MT0) 54 out of 85

55 Figure 28: Depiction of migration during normal period, with anomaly directed at migrating host (NM-MT1) Figure 29: Depiction of migration during anomalous period, with anomaly directed at migrating host (AM-MT1) Figure 30: Legend for depictions of anomaly/migration interaction 55 out of 85

56 1 DoS AH M1 AM MT1 MDout 1 DoS AH M1 AM MT1 MDin Outward migration 0.7 Inward migration Density (%) Anomaly starts Migration of anomaly starts Density (%) Anomaly ends Bins Bins Figure 31: Detection results of high-intensity DoS attack when migration happening during anomalous period 1 DoS AH M1 NM MT1 MDout 1 DoS AH M1 NM MT1 MDin Outward migration 0.7 Density (%) Density (%) 0.6 Inward 0.5migration Anomaly starts Anomaly ends Bins Bins Figure 32: Detection results of high-intensity DoS attack when migration happening during normal period 56 out of 85

57 on highutilisation(link, VM_ID) do FlowExporter enable(link, VM_ID) && sandbox (VM_ID, newlocation); Flow exporter is disabled when link utilisation decreases (green policy) and VM will not be sandboxed for further analysis. on lowutilisation(link, VM_ID) if (LocalManager.anomalyList isempty(link, VM_ID)) do FlowExporter disable(link, VM_ID); Configuration policy for handling high risk alert on highrisk (link, src, dst, VM_ID) do { FlowExporter notify(highrisk(link, VM_ID)); LocalManager.anomalyList add(link, src, dst); } Configuration policy for handling high risk alert on highrisk (link, src, dst, VM_ID) if (LinkMonitor getutilisation() >= 75%) do RateLimiter limit(link, 60%); Configure local manager for handling Fine grain analysis on FGA_classification(flow, value, confidence, VM_ID) if ((value == DDoS) && (confidence < 0.4)) do { Visualisation notify(alert(high)); RateLimiter limit(flow.src, flow.dest, x%); } if ((value == DDoS) && (confidence >= 0.4) && (confidence <= 0.8)) do { Visualisation notify(alert(high)); RateLimiter limit(flow.src, flow.dest, y%); } if ((value == DDoS) && (confidence > 0.8)) do { Visualisation notify(alert(high)); Firewall block(flow.src, flow.dest); } Configure local manager for handling low risk alert (recovery) on lowrisk (link, src, dst) if ((LocalManager.anomalyList remove(link, src, dst, VM_ID)) isempty(link)) do { FlowExporter notify(lowrisk(link, VM_ID)); RateLimiter limit(link, 100%); } Listing 2: Example obligation policies for resilience 57 out of 85

58 <policy name="dedicatedhw1"> <detectivemechanism name="migrate1"> <timestep amount="30" unit="seconds" /> <trigger action="vmmigratedevent" istry="false" /> <condition> <pip:boolean name="vmware" default="false"> <param:string name="method" value="criticalserviceonhost" /> <param:event name="host" value="host.morvalue" /> <param:event name="ignorevm" value="vm.morvalue" /> </pip:boolean> </condition> <executeaction name="log"> <param:string name="msg" value="critical service violation detected." /> </executeaction> <executeaction name="migratevm"> <param:string name="priority" value="highpriority" /> <param:event name="vm.mortype" value="vm.mortype" /> <param:event name="vm.morvalue" value="vm.morvalue" /> <param:string name="host.mortype" value="hostsystem" /> <pip:string name="vmware" paramname="host.morvalue" default="host-9"> <param:string name="method" value="getfreehost" /> </pip:string> </executeaction> </detectivemechanism> </policy> Listing 3: Example Policy Dedicated Hardware 58 out of 85

59 59 out of 85 Background characterisation Anomaly characterisation Table 1: Summary of characterisation Experiment characterisation Type Type Intensity Migration overlap Migration direction Migrant-targettedness BC0 DoS AH AL NM AM MDin MDout MT0 MT1 5 HTTP servers; 2 clients each Denial-of-serviceHighLow Normal periodanomalous periodinwardoutward At least 1 targettarget VM VM migrates do not migrate Deliverable D4.2

60 60 out of 85 Scenario High-intensity(AH) BC0-DoS-AH- M1-AM-MT0- MDin BC0-DoS-AH- M1-AM-MT0- MDout BC0-DoS-AH- M1-NM-MT0- MDin BC0-DoS-AH- M1-NM-MT0- MDout Low-intensity(AL) BC0-DoS-AL- M1-AM-MT0- MDin BC0-DoS-AL- M1-AM-MT0- MDout BC0-DoS-AL- M1-NM-MT0- MDin BC0-DoS-AL- M1-NM-MT0- MDout # of correct normal detections Table 2: Detection results of DoS attack with MT0 under high and low intensity #of correct anomalous detections #of total predicted anomalies Recall Precision Accuracy F-mean G-mean Elapsed time Deliverable D4.2

61 61 out of 85 Scenario High-intensity(AH) BC0-DoS-AH- M1-AM-MT1- MDin BC0-DoS-AH- M1-AM-MT1- MDout BC0-DoS-AH- M1-NM-MT1- MDin BC0-DoS-AH- M1-NM-MT1- MDout Low-intensity(AL) BC0-DoS-AL- M1-AM-MT1- MDin BC0-DoS-AL- M1-AM-MT1- MDout BC0-DoS-AL- M1-NM-MT1- MDin BC0-DoS-AL- M1-NM-MT1- MDout # of correct normal detections Table 3: Detection results of DoS attack with MT1 under high and low intensity #of correct anomalous detections #of total predicted anomalies Recall Precision Accuracy F-mean G-mean Elapsed time NaN NaN 0.0 NaN Deliverable D4.2

62 6 Qualitative evaluation Apart from the quantitative evaluation described in the last section, another important task of is to show the feasibility and useability of the developed concepts in realistic use cases. There are a number of use cases reported in deliverable D1.2 [SF13], which are based on the real demands and requirements of the industrial partners in the project, Mirasys and the Valencia Traffic Control Centre. These use cases describe the type of Cloud infrastructure and services CRMF is expected to support. They are used to guide the further development of the CRMF. The use cases can be grouped broadly under two scenarios: Use case 1 Video surveillance in prevention, detection and solving of ordinary crime in urban areas The use case is based on a real life scenario, reflecting problems that can arise in a security environment where the critical infrastructure is distributed over wide areas (in this case a subway network) and where there are several interdependent parties in the value chain. It mainly focuses on issues arising in the Mirasys VMS infrastructure, a video surveillance management system that integrates recording devices such as cameras and processing, recording and processing servers, and other management servers. Use case 2 Moving mission-critical services to the cloud while preserving security This scenario is set at the Valencia Traffic Control Centre, where currently there is no data in the cloud, all information is stored at local servers as well as corresponding services and applications. The managers of the infrastructure would like to move, as a first step, some data to the cloud, they consider that it can save some operational and maintenance costs, but they are concerned about security, availability once data has been moved to the cloud, the occurrence of any malfunction or undesired behaviour in the network may cause failures in the system. Based on these use cases, test cases have been defined in D2.6 in which the research outputs of the different clusters are to be evaluated. In the following, we will describe the test cases relevant for the CRMF and how the different components of the CRMF are to be evaluated in them. 6.1 Failure recovery of a virtual machine with minimum interruption to a service The VM failure recovery test case is part of the video surveillance use case. In this test case, the Mirasys VMS service is to be resilient to the failure of an instance of the Master Server, a specific type of server which is connected to a back-end database and recorder servers, as well as providing an endpoint to clients, cf. Figure 33. The Master Server component thus provides a gateway functionality and should be highly available. The test case is used to show that the DF can translate this high availability requirement into a deployment which ensures the service resilience and availability. 62 out of 85

63 Figure 33: The Mirasys test case for VM failure recovery To achieve this goal, the deployment function will be used to provision redundant instances of Master Servers, i.e., one main Master server and one backup Master server on different physical machines (due to the scope of the test case and the size of the testbed, the placement function has only a limited degree of freedom). In addition, the deployment generated needs to configure OpenStack so that a fail-over between the main Master server and the backup Master server occurs automatically and transparently. New connections from clients are to be automatically established to the backup server, and from the backup server to the database and the recording servers. The backup server will thus for all intents and purposes become the new main server, and a new backup server might be instantiated in the background. This test case is therefore related to the use case described in [BSRS13]. There, a PBX use case is described that necessitates a similar high-availability deployment of SIP signalling servers. This latter use case poses higher requirements on the placement of some of the instances due to their state synchronisation. The challenge here lies in placing the redundant instances in the infrastructure so that a) the probability of all servers becoming unavailable is sufficiently low and b) the connection latency between them is low enough to allow for a fast synchronisation of session data. The latter requirement is meant to enable fast fail-over and ideally avoiding call drops. It is one example for specific requirements with regard to the placement of instances in the infrastructure defined by the service operator. Similar requirements would be the placement of instances only in certain countries or within a specially supervised part (e.g., by anomaly detection), such as defined in Deliverable D2.1 [CO13]. Due to the similarity of the cases, even if they do not match fully, the VM failure recovery test case described here was chosen to provide a proof-of-concept of the deployment function. The test case thus consists in generating the deployment, triggering the instantiation of this deployment in OpenStack, and generating a failure of the main server to show that the service interruption is minimised by conducting a fail-over to the backup server. In 63 out of 85