An Integrated CyberSecurity Approach for HEP Grids. Workshop Report. http://hpcrd.lbl.gov/hepcybersecurity/



Similar documents
Cisco Advanced Services for Network Security

SANS Top 20 Critical Controls for Effective Cyber Defense

Network Access Control in Virtual Environments. Technical Note

CS 356 Lecture 25 and 26 Operating System Security. Spring 2013

Enterprise Cybersecurity Best Practices Part Number MAN Revision 006

Goals. Understanding security testing

BUILDING A SECURITY OPERATION CENTER (SOC) ACI-BIT Vancouver, BC. Los Angeles World Airports

Vulnerability Audit: Why a Vulnerability Scan Isn t Enough. White Paper

Computer Security: Principles and Practice

A Database Security Management White Paper: Securing the Information Business Relies On. November 2004

NETWORK AND CERTIFICATE SYSTEM SECURITY REQUIREMENTS

Securing the Database Stack

Payment Card Industry Data Security Standard

Guideline on Auditing and Log Management

Critical Security Controls

University of California, Riverside Computing and Communications. IS3 Local Campus Overview Departmental Planning Template

THE TOP 4 CONTROLS.

Passing PCI Compliance How to Address the Application Security Mandates

Cyber Security In High-Performance Computing Environment Prakashan Korambath Institute for Digital Research and Education, UCLA July 17, 2014

Why Leaks Matter. Leak Detection and Mitigation as a Critical Element of Network Assurance. A publication of Lumeta Corporation

ADDING NETWORK INTELLIGENCE TO VULNERABILITY MANAGEMENT

nwstor Storage Security Solution 1. Executive Summary 2. Need for Data Security 3. Solution: nwstor isav Storage Security Appliances 4.

Vulnerability management lifecycle: defining vulnerability management

CONTINUOUS DIAGNOSTICS BEGINS WITH REDSEAL

NETWORK TO NETWORK INTERFACE PLAN

Course: Information Security Management in e-governance. Day 1. Session 5: Securing Data and Operating systems

The Protection Mission a constant endeavor

Incident Handling. Applied Risk Management. September 2002

The Business Case for Security Information Management

Fear Not What Security Can Do to Your Firm; Instead, Imagine What Your Firm Can Do When Secured!

Data Management Policies. Sage ERP Online

WHITE PAPER WHAT HAPPENED?

O N L I N E I N C I D E N T R E S P O N S E C O M M U N I T Y

INTRUSION DETECTION SYSTEM (IDS) D souza Adam Jerry Joseph I MCA

How To Protect A Network From Attack From A Hacker (Hbss)

Architecture Overview

Overview of Network Security The need for network security Desirable security properties Common vulnerabilities Security policy designs

Company Co. Inc. LLC. LAN Domain Network Security Best Practices. An integrated approach to securing Company Co. Inc.

CyberNEXS Global Services

GFI White Paper PCI-DSS compliance and GFI Software products

Injazat s Managed Services Portfolio

ABB s approach concerning IS Security for Automation Systems

COSC 472 Network Security

Managing Privileged Identities in the Cloud. How Privileged Identity Management Evolved to a Service Platform

Getting Ahead of Malware

The PCI Dilemma. COPYRIGHT TecForte

Keyword: Cloud computing, service model, deployment model, network layer security.

Network/Cyber Security

Secure and Safe Computing Primer Examples of Desktop and Laptop standards and guidelines

Threat Modeling. Frank Piessens ) KATHOLIEKE UNIVERSITEIT LEUVEN

HIGH-RISK SECURITY VULNERABILITIES IDENTIFIED DURING REVIEWS OF INFORMATION TECHNOLOGY GENERAL CONTROLS

A Look at the New Converged Data Center

Enterprise Cybersecurity: Building an Effective Defense

REPORT ON AUDIT OF LOCAL AREA NETWORK OF C-STAR LAB

Driving Company Security is Challenging. Centralized Management Makes it Simple.

The President s Critical Infrastructure Protection Board. Office of Energy Assurance U.S. Department of Energy 202/

Database Security Guideline. Version 2.0 February 1, 2009 Database Security Consortium Security Guideline WG

WIND RIVER SECURE ANDROID CAPABILITY

Beyond PCI Checklists:

TASK TDSP Web Portal Project Cyber Security Standards Best Practices

Principles of Information Security, Fourth Edition. Chapter 12 Information Security Maintenance

SERENA SOFTWARE Serena Service Manager Security

Advanced File Integrity Monitoring for IT Security, Integrity and Compliance: What you need to know

Cyber Security. BDS PhantomWorks. Boeing Energy. Copyright 2011 Boeing. All rights reserved.

i-pcgrid Workshop 2015 Cyber Security for Substation Automation The Jagged Line between Utility and Vendors

Remote Services. Managing Open Systems with Remote Services

Host Hardening. Presented by. Douglas Couch & Nathan Heck Security Analysts for ITaP 1

Breaking down silos of protection: An integrated approach to managing application security

Lifecycle Solutions & Services. Managed Industrial Cyber Security Services

IT ASSET MANAGEMENT Securing Assets for the Financial Services Sector

Hardware Inventory Management Greater Boston District

CS 356 Lecture 17 and 18 Intrusion Detection. Spring 2013

Mitigating Information Security Risks of Virtualization Technologies

Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group

Network and Host-based Vulnerability Assessment

Information Technology Audit & Forensic Techniques. CMA Amit Kumar

Firewalls Overview and Best Practices. White Paper

Defending Against Data Beaches: Internal Controls for Cybersecurity

DeltaV System Cyber-Security

Addressing the SANS Top 20 Critical Security Controls for Effective Cyber Defense

Complete Web Application Security. Phase1-Building Web Application Security into Your Development Process

Vulnerability Management

CPNI VIEWPOINT CONFIGURING AND MANAGING REMOTE ACCESS FOR INDUSTRIAL CONTROL SYSTEMS

Reducing Application Vulnerabilities by Security Engineering

SPEAR PHISHING UNDERSTANDING THE THREAT

An Introduction to Network Vulnerability Testing

Global Partner Management Notice

CMS Software Deployment on OSG

Managing Vulnerabilities for PCI Compliance White Paper. Christopher S. Harper Managing Director, Agio Security Services

Scalability in Log Management

Network Segmentation in Virtualized Environments B E S T P R A C T I C E S

Effective Methods to Detect Current Security Threats

2. From a control perspective, the PRIMARY objective of classifying information assets is to:

Strategic Plan On-Demand Services April 2, 2015

Protecting Organizations from Cyber Attack

Network Incident Report

Sygate Secure Enterprise and Alcatel

Embracing Microsoft Vista for Enhanced Network Security

Intro to NSX. Network Virtualization VMware Inc. All rights reserved.

USM IT Security Council Guide for Security Event Logging. Version 1.1

Transcription:

An Integrated CyberSecurity Approach for HEP Grids Workshop Report http://hpcrd.lbl.gov/hepcybersecurity/ 1. Introduction The CMS and ATLAS experiments at the Large Hadron Collider (LHC) being built at CERN, in Switzerland each involve approximately 2000 physicists from around the world. DOE, as the host of the US CMS and ATLAS tier 1 centers, is providing a key element of the global support infrastructure for these experiments. The LHC Grid will combine resources from many sites, including several very large compute clusters. Reliable and sustained access to these data, compute and communication resources holds the key to the productivity of the CMS and ATLAS communities. The challenges posed in protecting and maximizing the utility of the widely distributed ensemble of resources while providing open access to the community of physicists are significant. Current experience dictates that we must be able to quickly identify, isolate and react to intentional unacceptable use of any part of the computing infrastructure. The mere size and prominence of the LHC worldwide Grid attracts attention. The potential to be able to harness the enormous compute power may encourage malicious attacks which, if successful, can then be turned around to use that power for further mischief. At a March 2005 workshop in Oakland, CA, workshop participants identified a number of critical areas to be addressed that will build on existing work and provide a coherent program to reduce the risk to the large investment in LHC computing. These fall into four general categories: risk analysis, the ability for a VO to perform monitoring and control their resources, the ability to recover quickly from an incident, and vulnerability analysis of the middleware. Each and every physicist is expected to be able to access each and every resource controlled by experiment policy and authorization. By breaking into a single vulnerable system, therefore, an intruder can potentially gain access to many other resources. The program of work, therefore, takes account of the inherently distributed nature of the problem by putting strong emphasis on coordinated response to and control of an incident, since security at one location can be compromised by events at another location. Last year the San Diego Supercomputer Center was completely offline for an entire week due to a security compromise. The LHC Grid represents an extremely valuable resource, and our goal is to develop capabilities and procedures that will minimize the impact a security incident will have on the availability and effectiveness of our production infrastructure. In an environment such as the LHC Grid, covering a very diverse set of resources down to the individual laptops, we plan to the assumption that some security compromise is inevitable. If a system is compromised, the system must be quickly isolated and recovery

needs to be rapid, efficient, and thorough so that cost and latent risk are minimized. Sites must be able to regain control of their resources as quickly as possible and to prevent the compromise from spreading to other sites. This includes the ability to quickly and selectively disable both users and services. If a user credential is known to be compromised, it is important to be able to quickly determine the complete list of resources that were accessed using that credential since it was compromised. The ability to quickly recover from a security incident adds the additional value of allowing fast recovery from non-malicious user errors. In fact, user or administrator errors can cause as much damage as a malicious hacker. It is also important to be able to quickly determine when problems are due to a unauthorized activities and when they are due to activities triggered by legitimate members of the CMS and ATLAS communities. To help our planning and the prioritization of the proposed program of work we define the vulnerabilities to a potential incident thus: Loss of unique data Insertion of fraudulent data Inability to reestablish control of the computing infrastructure after an incident. Subversion of system software (loss of integrity) Inability to ingest detector output Massive coherent failure of the ensemble of resources Compromise of key infrastructure Pervasive slow down due to compromise that couldn t be removed We have arrived at the program of work below in risk, likelihood, impact and our ability to mitigate and respond. Clearly responsibility for the defense and continuous operation of the LHC computing systems span all organizations involved the experiments, the facility administrators, the middleware and service providers and the end users. All these players are already closely involved in the planning and execution of tasks to protect the LHC systems, and to provide the end-to-end security and trust infrastructure to allow controlled access to and open use of the systems by the physics communities. In this document we describe a set of tasks that, delivered as a coherent and managed program of work across the facility security teams, the technology providers and the experiments, will significantly reduce the vulnerability of the LHC computing environment to security incidents. We feel it is crucial to begin this work as soon as possible. We recommend that the community beginning working on a set of best practice documents to help build consensus within the HEP community on what the risks and issues are, and what are the best solutions. 2. Goals and Requirements An experimental collaboration constitutes a Virtual Organization, or VO, and is expected to operate information resources as its infrastructure. The VO has a duty to contribute to the overall security of the shared infrastructure. For example, it is important to ensure that a compromise at a Tier 3 site or a scientist s laptop does not compromise the entire grid. Only the VO can know what jobs are running and what the current set of resource utilization should be. The VO is also responsible for detecting and terminating runaway jobs, which may be due to user error or software bugs.

A virtual organization is composed of multiple real organizations, each of which have there own security requirements as well. Security tools and solutions must be designed and deployed in a manner that facilities exchange of information between organizations and virtual organizations. 1. The impact of a compromised user credential should be restricted to that user s work, and should ideally be short-lived such that its malicious capabilities will time-out in a manageable time-scale. This goes for compromised host credentials as well. 2. The impact of a compromise (root account etc) on a resource should be restricted as much as possible to that resource. 3. Higher risk services should be structured such that the impact and scope of any compromise is minimized. 4. Response to and control of incidents should be tested in a realistic distributed environment. 5. The latency of response to and containment of incidents should be minimized. 6. Usable and timely forensic information should be available to the incident response teams to allow tracing of the source and scope of an incident. 7. Stakeholders (site security, VO administration, etc) need to collect and review information independently, and have the ability to share and compare their analyses. 3. Program of Work At a March 2005 workshop in Oakland, CA, workshop participants identified a number of critical areas to be addressed that will build on existing work and provide a coherent program to reduce the risk to the large investment in LHC computing. These fall into four general categories: risk analysis, the ability for a VO to perform monitoring and control their resources, the ability to recover quickly from an incident, and vulnerability analysis of the middleware. Item 1: Risk Analysis and Best Practices It is essential to perform ongoing risk analysis of the LHC computing infrastructure. This includes analysis of the software stacks, the configurations or resources and services, and the trust relationships between all parties. It also includes closely monitoring new security exploits as they come out. The activity will provide periodic information to guide the program of work and prioritize the focus of the security teams. Item 2: Security Logging and Auditing Service The core component of this task is a real-time Security Logging and Auditing Service. This information service would contain as much log data as possible related to a set of Grid jobs, including host syslogs, CA logs, middleware logs, and so on. Some level of logging from firewalls and IDS s would be also very useful, but these will likely need to be sanitized before sites would release them. This data will be used to help identify problems and to quickly recover from an incident. It will also be used to help debug authentication and firewall problems (situations where there is not currently a useful error message to understand why something did not

connect). It would also help provide the necessary audit trail to help perform fast recovery after a security incident. Requirements: 1. Standardize the audit entry formats where ever possible to facilitate the subsequent browsing, querying and filtering. 2. Instrument the middleware runtimes to securely log relevant audit information. 3. Provide an integrating and organizing framework to collect many diverse sources of information (e.g. routers, job logs etc) to reconstruct the thread-of-work through the Grid fabric. 4. Make the audit information discoverable and accessible to diverse organizations through common interfaces. 5. Provide real time collection and analysis of the information to enable timely response. 6. Build in data filtering mechanisms so that we are not overwhelmed by too much log data. 7. Provide the trusted organizations secure access to the distributed audit information. The tasks required enable an organization or VO to monitor and control their Grid are: Security Logging and Auditing Service: Deployment of a scalable and reliable real-time service. Existing solutions such as the EDG logging and bookkeeping service will be evaluated. Tools to integrate existing log files will be developed. Auditing of all components: We will perform an analysis of what needs to be audited from each component, and work with middleware developers to ensure they are logging the necessary information. This logging will be integrated with the information service. Resource vulnerability scanning: Organizations and VO s need the ability to scan site Grid resources for vulnerabilities, since small sites may not be doing this, and large site might miss something. This will help VOs to perform security certification of the Grid resources they are responsible for, and help maximize the utility of their Grid. IDS / IPS: Intrusion Detection systems should be deployed to monitor Grid use and detect unauthorized behavior (due to user error, user breaking the rules, or due to unauthorized use). This data must be integrated into the information service. Border Control (site and VO): The boundaries of enclaves of trust are places where information is gathered and control may be applied. These border must be clearly defined, and then protected. Configuration Verification: Many security mechanisms such as firewalls, VOMS servers, and so on are configured and maintained by hand, and the chance of misconfiguring something is high. Therefore it is critical that the various layers are integrated and configuration of the system is automated to the extent possible.

Mechanisms to check the configuration of each of the layers and to analyze the security of the configured whole are essential. Item 3: Incident Response and Recovery The key to incident response is to be able to quickly contain their scope and to recover. Often it is very difficult to determine the extent of the damage, and what must be done to clean up after an incident. For example, if a user credential is known to be compromised, what is the complete list of resources that were accessed using that credential? Or if a single host at a site has been rootkit ed, what other hosts might be compromised as well. If a vulnerability is found in a Grid middleware component, how do we locate all locations where that version of the middleware is installed, disabled those resources until the vulnerability is fixed, and then patch / upgrade the software on all those resources? This task includes the following work items: Incident Response: Incident response typically needs to be coordinated between the local resource, local network, border, virtual organization, and wide-area network and needs to be automated to the extent possible. Effective incident response requires accurate information and analysis of the attack, which will be provided by the VO information server. Effective incident response also requires coordination between several sites by means of a confidential communication channel. The team of responders must be able to rapidly create a communication channel to respond to incidents. A suite of secure information and communication services tuned to the needs of security officers and their partners, responding to an on-going incident is needed. Forensics: Forensics data from all levels of the system are critical to long-term response (i.e. prosecution) and effective recovery. There are two primary goals: to collect evidence and minimize recovery effort. This data is often high volume and from a diverse set of sources. The responders need to be able to in real-time analyze the data and determine exactly what hosts have been compromised and the nature of the compromises to contain the attack and narrow the recovery effort. Security Testing: Tiger teams will be formed to look for vulnerabilities, and response planning will be done. We will also perform 2 major security drills. Item 4: Middleware vulnerability testing and analysis This activity will be responsible for evaluating and enhancing the quality of the middleware from a security perspective. This includes but is not limited to vulnerability to attacks, ease of patching, and installation procedures. From a security perspective, the end-to-end quality of a software stack is not determined only by its resistance to malicious attacks. The time it takes to replace a version with a known vulnerability with a new version that eliminates this vulnerability plays an important role in determining the quality of our software stack. Testing and analysis of all middleware is required. External software audits are needed, which could be done as software peer reviews, where middleware developers could review each other s architectures.

Other Work This workshop also identified several other areas that we feel are important, but we feel these issues are much broader than just LHC computing. We hope other projects will be addressing these issues. These include: Wide-Area Network Monitoring: The wide-area network provides an excellent place to track attack trends and to detect worms, viruses, and to recognize attack patterns. Connection logs and netflow data from the routers or from a network IDS can be used for this. By monitoring key Internet exchange points, one can provide an early warning system for viruses, worms, and attacks, and potentially block the attack before it reaches the end sites. Also, through cooperation with the end-sites, an attack manifesting at one site can be blocked from attacking other sites. Data Integrity: user error, hardware error, TCP checksum issue, intentional corruption, and so on. Authentication / Authorization Issues: protection again stealing short term credentials or session keys, and projection against high-jacking sessions. As the revocation of credentials is very expensive from an operational and management perspective, short-lived assertions should be used wherever possible. This would require further development and deployment of credential issuing services, like MyProxy and GridLogon. Authorized Audit Log Write/Read Access: The audit data is both sensitive and vital for investigations and recovery. The writing to those logs should therefore be integrity protected and authenticated. Furthermore, the access to the logs should be subject to access control policy and should allow trusted audit officers to access the logs in other administrative domains to reconstruct the forensic trail through the Grid fabric. This would require a fine-grained access control policy framework integrated with the audit log and collection services. Disposable Execution Environments: Virtual Machine techniques such as Xen and VMWare allow the creation of a restricted execution environment that can be destroyed or reloaded. The insulation properties of VM technologies may be able to help confine compromises to a single image and disallow rootkits to take over complete physical machines. Furthermore, paused/frozen images of an OS with a selective set of installed and configured applications could be used to facilitate security related updates and patches, and substantially speedup the recovery process after detected compromise. These technologies are maturing, and should be evaluated for use in the LHC Grid. Rootkit detection: Better tools for detection of rootkits are needed. Best Practices / Community Consensus It is important to start to build community consensus on what is the best was to secure sites that are part of the LHC Grid. We recommend that a set of best practices documents on several aspects of Grid Security be written. These include the following:

Risk Analysis of the LHC Grid: What are the main risks in terms of likelihood and recovery cost? Key management: What are the issues involving user and host key management (e.g.: caching, revocation, etc. ) Logging and auditing: What components should be included for standard logging and auditing? This would include a detailed report on what we log today and some ideas on how this information can be collected and used. What information should be logged locally, what should logged centrally, and what data filtering can be done? Scanning and VO certification: what vulnerabilities can be monitoring via scanning, and checklist of items a VO could use to certify that a given Grid resource meets its security standards? Integrated IDS: what should the IDS s be looking for, what information should be exchanged between the sites? Incident Response: what steps should be taken to contain and recover from an incident?