Security, Compliance and Sharing of Genomic Data on the Cloud



Similar documents
Using ArcGIS for Server in the Amazon Cloud

Introduction to AWS in Higher Ed

AWS Security. Security is Job Zero! CJ Moses Deputy Chief Information Security Officer. AWS Gov Cloud Summit II

SECURITY IS JOB ZERO. Security The Forefront For Any Online Business Bill Murray Director AWS Security Programs

Simone Brunozzi, AWS Technology Evangelist, APAC. Fortress in the Cloud

CLOUD COMPUTING WITH AWS An INTRODUCTION. John Hildebrandt Solutions Architect ANZ

Security Essentials & Best Practices

Building Energy Security Framework

Application Security Best Practices. Matt Tavis Principal Solutions Architect

Amazon Web Services Annual ALGIM Conference. Tim Dacombe-Bird Regional Sales Manager Amazon Web Services New Zealand

319 MANAGED HOSTING TECHNICAL DETAILS

IAN MASSINGHAM. Technical Evangelist Amazon Web Services

PCI COMPLIANCE ON AWS: HOW TREND MICRO CAN HELP

PCI COMPLIANCE ON AWS: HOW TREND MICRO CAN HELP

Amazon Web Services. Lawrence Berkeley LabTech Conference 9/10/15. Jamie Baker Federal Scientific Account Manager AWS WWPS

Famly ApS: Overview of Security Processes

LONDON. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

Expand Your Infrastructure with the Elastic Cloud. Mark Ryland Chief Solutions Architect Jenn Steele Product Marketing Manager

Service Organization Controls 3 Report

DLT Solutions and Amazon Web Services

Service Organization Controls 3 Report

THE BLUENOSE SECURITY FRAMEWORK

Introduction to Amazon Web Services! Leo Senior Solutions Architect

Running Oracle Applications on AWS

Amazon Web Services. For Government, Education, and Nonprofit Organizations. Jakob Huhn. Partner Manager Benelux, Public Sector

Live Guide System Architecture and Security TECHNICAL ARTICLE

Thing Big: How to Scale Your Own Internet of Things.

How To Use Aws.Com

CLOUD COMPUTING FOR THE ENTERPRISE AND GLOBAL COMPANIES Steve Midgley Head of AWS EMEA

Servers. Servers. NAT Public Subnet: /20. Internet Gateway. VPC Gateway VPC: /16

Every Silver Lining Has a Vault in the Cloud

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

Logentries Insights: The State of Log Management & Analytics for AWS

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

With Eversync s cloud data tiering, the customer can tier data protection as follows:

Cloud Computing with Amazon Web Services and the DevOps Methodology.

Securing Amazon It s a Jungle Out There

Extending your Enterprise IT with Amazon Virtual Private Cloud. Oyvind Roti Principal Solutions Architect, AWS

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

How To Create A Walkme.Com Walkthrus.Com Website And Help With Your Website Or App On A Pc Or Mac Or Ipad (For Pc) Or Mac (For Mac) Or Ipa (For Ipa) Or Pc

Amazon WorkDocs. Administration Guide Version 1.0

Primex Wireless OneVue Architecture Statement

Getting Started with AWS. Hosting a Static Website

Assignment # 1 (Cloud Computing Security)

Agenda. - Introduction to Amazon s Cloud - How ArcGIS users adopt Amazon s Cloud - Why ArcGIS users adopt Amazon s Cloud - Examples

VMware vcloud Air Security TECHNICAL WHITE PAPER

FileCloud Security FAQ

Amazon Web Services: Risk and Compliance January 2013

Data Collection and Analysis: Get End-to-End Security with Cisco Connected Analytics for Network Deployment

Amazon Web Services Primer. William Strickland COP 6938 Fall 2012 University of Central Florida

Getting Started with AWS. Hosting a Static Website

Alfresco Enterprise on AWS: Reference Architecture

Security Overview Enterprise-Class Secure Mobile File Sharing

Encrypting Data at Rest

PATCH MANAGER what does it do?

Anypoint Platform Cloud Security and Compliance. Whitepaper

White Paper How Noah Mobile uses Microsoft Azure Core Services

Cloud models and compliance requirements which is right for you?

Amazon Web Services: Risk and Compliance July 2015

White Paper: NCBI Database of Genotypes and Phenotypes (dbgap) Security Best Practices Compliance Overview for the New DNAnexus Platform

Scalable Architecture on Amazon AWS Cloud

Cloud Models and Platforms

ArcGIS 10.3 Server on Amazon Web Services

Learning Management Redefined. Acadox Infrastructure & Architecture

Cloud Security Overview

BMC s Security Strategy for ITSM in the SaaS Environment

MICROSTRATEGY ON AWS

Hadoop & Spark Using Amazon EMR

Table of Contents. FME Cloud Architecture Overview. Secure Operations. Application Security. Shared Responsibility.

Deploy Remote Desktop Gateway on the AWS Cloud

Logz.io See the logz that matter

Cloud Security Framework (CSF): Gap Analysis & Roadmap

McAfee Public Cloud Server Security Suite

APIs The Next Hacker Target Or a Business and Security Opportunity?

AWS CodePipeline. User Guide API Version

PRIVACY, SECURITY AND THE VOLLY SERVICE

Cloud Essentials for Architects using OpenStack

MIGRATIONWIZ SECURITY OVERVIEW

Amazon Cloud Storage Options

Using ArcGIS for Server in the Amazon Cloud

GoodData Corporation Security White Paper

AWS IaaS Services. Methods Digital GCloud Service Definition

Outlook. Corporate Research and Technologies, Munich, Germany. 20 th May 2010

How to setup NovaBACKUP DataCenter to backup data to Amazon S3 using Amazon s AWS Storage Gateway

Why should you look at your logs? Why ELK (Elasticsearch, Logstash, and Kibana)?

This paper introduces the security policies, practices, and procedures at Smartsheet.

Cloud Computing Trends

Architecture Overview

Automating Cloud Security Control and Compliance Enforcement for PCI DSS 3.0

TECHNOLOGY WHITE PAPER Jun 2012

Transcription:

Security, Compliance and Sharing of Genomic Data on the Cloud Angel Pizarro Scientific and Research Computing Amazon Web Services angel@amazon.com 2015 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

The Cancer Genome Atlas 20 Adult cancers 10,000 tumor pairs

Cancer Genomics Hub 50K Genomes at ~ $100/year/genome Houses genomes for all major National Cancer Institute projects (TCGA, TARGET, etc) 1.5PB (growing to 2.5PB in the next year) Serving over 1PB of data a month

CGHub Growth: Storage vs. Transfer >1PB / month Requester must have at least that much storage to analyze Requester must have a lot of compute Unsustainable

Data held in silos, unshared No one institute has enough on its own to make progress

Founded on June 5, 2013 Founding partners: 70+ leading healthcare, research, and disease advocacy organizations Now: more than 170 members from 40 countries Mission: to enable rapid progress in biomedicine Plan: Create and maintain the interoperability of technology platform standards for managing and sharing genomic data in clinical samples Develop guidelines and harmonizing procedures for privacy and ethics in the international regulatory context Engage stakeholders across sectors to encourage the responsible and voluntary sharing of data and of methods

GA4GH API promotes sharing GATTTATCTGCTCTCGTTG GAAGTACAAAATTCATTATGCAC AAAATCTGTAG

Advantages of Storing and Sharing in a Federated, Cloud-based Commons No single owner: Each Institute keeps data on either private or commercial cloud No one entity controls the world s genome data Globalization: Internet data exchange and containerized computation provide global transaction capability and ubiquitous availability Reduced cost: Hardware and storage disk costs reduced by bulk purchases Operational costs reduced by building datacenters in optimal locations and deploying automated systems Security: Many clouds are already proven for use with sensitive data Elasticity: In a large cloud, the cost of 1 computer for 1,000 hours is the same as the cost of 1000 computers for 1 hour (about $1.50/machine/hour). No machines sit idle.

Under the Hood at GA4GH We work together in an open source software development environment on the web: https://github.com/ga4gh All groups are welcome to participate Decision making is done by protocols developed by Apache Open Source Software Foundation Leadership is determined by amount of contribution Simple Mantra: collaborate on interface, compete on implementation

Interoperability: One API, Many Apps Genome Browser Repository (EBI) Command-line Interface MapReduce Wrapper Repository (NCBI) Local Repository Beacons Repository (Google) New Algorithms Repository (AWS)

GA4GH Driver Project: Beacons to Discover Data YES Do you have any genomes with an A at position 100,735 on chromosome 3? NO I can neither confirm nor deny that request

rs6152 Located in the first exon of the androgen receptor AR gene located on the X chromosome, is highly indicative of the ability to develop male pattern baldness

Digging Deeper into the Data Repository Discovery (Public) Does the data exist? Anonymous user Query: Do you have any genomes with an A at position 100,735 on chromosome 3? Reply: Yes or No Data Context Queries (Registered) Does the data have the properties I require? Registered user Shows details of studies with data of interest to user and provides link for requesting full access Full Data Access (Controlled) Give me the full dataset Approved user (signed contract) Permits full access to genotype, phenotype and raw DNA sequence

GA4GH Big Data Commons Model name: which object do I want? protocol: how do I get it? content: what does it mean?

GA4GH Big Data Commons Model The Web is based on three interlocking standards. Web name: which object do I want? URLs are universal identifiers for web objects. protocol: how do I get it? HTTP is the standard API that any web browser can use to request information from any web server. content: what does it mean? HTML is the single most important content type on the web, connecting data like text, images, videos, and PDFs.

GA4GH Big Data Commons Model The Global Alliance APIs define equivalents Web Global Alliance APIs name: which object do I want? URLs are universal identifiers for web objects. The GA4GH APIs define content digests that uniquely identify genomic and other objects. protocol: how do I get it? HTTP is the standard API that any web browser can use to request information from any web server. The GA4GH APIs define methods for requesting data. For example, a genome browser can use the searchreads method to request genomic reads, and get back a SearchReadsResponse, whether it's talking to a genome server at NCBI, EBI, or Google. content: what does it mean? HTML is the single most important content type on the web, connecting data like text, images, videos, and PDFs. The GA4GH APIs define logical schemas for every object, such as ReadGroup and Variant.

Data Content Digests as Identifiers Cryptographic 1-way hash function used Data dependent, not format dependent Unique to a data set Unforgeable and verifiable Privacy preserving Decentralized A file and database with same data encoded within have the same digest.

Methodologies for Data Content Digests Hashing allows o decentralized generation, privacy-preserving digests, unforgeable, unique A digest is computed on data itself, not storage format o serves as a distributed Rosetta Stone between repositories, formats, and accessions o unique to each specific version of data, used to establish provenance At a base level digests are computed on coarse-grained, immutable data objects (if data are changed, a new digest is created) Digests are composeable to create hierarchies for larger and larger sets of objects without needing to create additional copies of base level objects

Example Digest: SAM/BAM Record level digests File or data set digest At a dataset level, an order independent digest of each read & alignment would identify the dataset.

Putting it All Together Compute Environment The compute environments support running arbitrary analysis or processing pipelines. The only requirement for pipeline developers is that they use the GA4GH API to access data, so code running in any compute environment can work with data hosted in any data environment. GA4GH Data Environment Data repositories are responsible for: o storing data, using whatever internal representation they deem appropriate o managing globally unique data content digests, so that anyone anywhere can unambiguously refer to a particular data object (e.g. all the reads from a given sample ) o keeping track of authorization: single access control list for who is allowed to see what data All data is exposed to the outside world via the GA4GH API. o users of the data don't care about the internals; they just have to know how to call the API. Compute environment arbitrary pipeline GA4GH API Server source of truth data 1) curate data (reads,calls, ) 2) globally unique IDs/digests 3) authz (who may see what) GA4GH data environment

Security and Compliance Shared responsibility Emphasis on encryption Authentication and authorization at the data object level http://genomicsandhealth.org/files/public/%28october31-2014%29draft_securityframework-2.pdf

GA4GH Working Groups Genome Data Working Group (co-led by David Haussler and Richard Durbin) Clinical Data Working Group (co-led by Charles Sawyers and Kathryn North) Security Working Group (co-led by Paul Flicek and Dixie Baker) Regulatory and Ethics Working Group (co-led by Bartha Knoppers, Kazuto Kato and Partha Majumder)

GA4GH Driving Projects Currently active: 1. Beacon Project: (Steve Sherry, Marc Fiume) public querying of simple top level information about genetic variants 1. Genomic Matchmaker: (Heidi Rehm, Anthony Philippakis) expert sharing of rare genetic variants 1. BRCA Challenge: (Gunnar Rätsch) aggregation of research and clinical data to assess pathogenicity and penetrance of all variants in BRCA1 and 2

Ultimate Goal patient Assess clinical outcomes Select patient treatment Learning System Compare treatment effectiveness Implement new evidence for treatment prioritization

Amazon Web Services

Enterprise Applications Virtual Desktops Database s Relational Platform Services Infrastructure Analytics App Services Deployment & Management Hadoop Queuing Containers Real-time Orchestration Dev/ops Tools App Streaming No SQL Data Warehouse Data Workflows Caching Foundation Services Collaboration and Sharing Identity Sync Resource Templates Transcoding Email Search Compute Storage (VMs, Auto-scaling and Load Balancing) (Object, Block and Archive) Regions Mobile Services Availability Zones Mobile Analytics Usage Tracking Monitoring and Logs Security & Access Control Notifications Networking CDN and Points of Presence

Global Footprint Everyday, AWS adds enough new server capacity to support Amazon.com when it was a $7 billion global enterprise. Over 1 million active customers across 190 countries 800+ government agencies 3,000+ educational institutions 11 regions 28 availability zones 52 edge locations

AWS Regions and Availability Zones US East (VA) Availability Zone A Availability Zone B Availability Zone C Customer Decides Where Applications and Data Reside Note: Conceptual drawing only. The number of Availability Zones may vary.

Enterprise Applications Virtual Desktops Database s Relational Platform Services Infrastructure Analytics App Services Deployment & Management Hadoop Queuing Containers Real-time Orchestration Dev/ops Tools App Streaming No SQL Data Warehouse Data Workflows Caching Foundation Services Collaboration and Sharing Identity Sync Resource Templates Transcoding Email Search Compute Storage (VMs, Auto-scaling and Load Balancing) (Object, Block and Archive) Regions Mobile Services Availability Zones Mobile Analytics Usage Tracking Monitoring and Logs Security & Access Control Notifications Networking CDN and Points of Presence

Full stack sequence analysis platform Storage Compute Sequence data Upstream analysis Databases Variants Expression Phenotypes Analytics Data mining

1+ Million Cancer Genome Data Warehouse

Security is a Shared Responsibility

Customer Shared responsibility model Customer Data Platform, Applications, Identity & Access Management SOC 1/SSAE 16/ISAE 3402 SOC 2 ISO 27001/ 2 Certification Payment Card Industry (PCI) Data Security Standard (DSS) NIST Compliant Controls DoD Compliant Controls FedRAMP HIPAA and ITAR Compliant Operating System, Network & Firewall Configuration Server-side Encryption (File System and/or Data) Client-side Data Encryption & Data Integrity Authentication Amazon Customers implement their own set of controls Multiple customers with FISMA Low and Moderate ATOs Network Traffic Protection (Encryption/Integrity/Identity) Foundation Services Compute Storage AWS Global Infrastructure Database Networking Availability Zones Edge Locations Regions

Shared responsibility model Customer/Partner Facilities Network configuration Physical security Security groups Compute infrastructure Storage infrastructure Network infrastructure + OS firewalls Operating systems Applications Virtualization layer (EC2) Proper service configuration Hardened service endpoints Auth & acct management Rich IAM capabilities Authorization policies = Re-focus your security professionals on a subset of the problem Take advantage of high levels of uniformity and automation First global public cloud provider to achieve certification for security & quality management system

NIH dbgap security best practices Physical security Data center access and remote administrator access Electronic security User account security (for example, passwords) Use of Access Control Lists (ACLs) Secure networking Encryption of data in transit and at rest OS and software patching Data access security Authorization of access to data Tracking copies; cleaning up after use

Enterprise Applications Virtual Desktops Database s Relational Platform Services Infrastructure Analytics App Services Deployment & Management Hadoop Queuing Containers Real-time Orchestration Dev/ops Tools App Streaming No SQL Data Warehouse Data Workflows Caching Foundation Services Collaboration and Sharing Identity Sync Resource Templates Transcoding Email Search Compute Storage (VMs, Auto-scaling and Load Balancing) (Object, Block and Archive) Regions Mobile Services Availability Zones Mobile Analytics Usage Tracking Monitoring and Logs Security & Access Control Notifications Networking CDN and Points of Presence

Amazon Virtual Private Cloud (VPC) Internet Create secure network configurations for working with sensitive data Internet GW Service (23.20.103.11) EC2 EC2 10.0.2.12 10.0.1.11 SN 10.0.2.0/24 (Private) SN 10.0.1.0/24 (DMZ) VPC 10.0.0.0/16 AZ A Virtual Gateway AZ B AWS region VPC network isolation

AWS Identity and Access Management (IAM) Create and manage users in the AWS services Identity federation with Active Directory Control passwords, access keys, and multi-factor authentication (MFA) devices Root Account Roles Groups Administrators Developers Applications Jim Shandra Reporting Alyson Xiao Console Susan Tomcat Hardware or software (OAuth) Fine-grained permissions Very familiar security model Users, groups, permissions Integrated into the following: AWS Management Console AWS API access IAM resource-based policies Amazon S3, Amazon SQS, Amazon SNS Anand Multi-factor authentication AWS system entitlements

Segregate duties between roles with IAM AWS account owner (master) You get to choose who can do what in your AWS environment and from where Network & security Researcher Operations EMR Internet GW Service (EIP) (EIP) M Manage and operate S VPC Virtual Gateway A B US EAST

Data encryption in transit and at rest Amazon S3 HTTPS AES-256 server-side encryption AWS or customer managed keys Each object gets its own key Amazon EBS End-to-end secure network traffic Whole volume encryption AWS or customer managed keys Encrypted incremental snapshots Minimal performance overhead (utilizes Intel AES-NI)

Use Amazon CloudTrail to track access to APIs and IAM Records API calls, no matter how those API calls were made (console, SDK, CLI) Who did what and when and from what IP address Logs saved to Amazon S3 Includes EC2, Amazon EBS, VPC, Amazon RDS, IAM, AWS STS, and Amazon RedShift Be notified of log file delivery by using the Amazon Simple Notification Service (SNS) Aggregate log information across services into a single S3 bucket Out of the box integration with log analysis tools from AWS partners including Splunk, AlertLogic, and SumoLogic

AWS Config AWS Config is a fully managed service that provides you with an inventory of your AWS resources, lets you audit the resource configuration history and notifies you of resource configuration changes Changing Resources Recording Continuous Change History Stream AWS Config Snapshot (ex. 2014-11-05)

NIH dbgap security best practices Physical security AWS Cloud Data center access and remote administrator access Electronic security User account security (for example, passwords) Use of Access Control Lists (ACLs) Secure networking Encryption of data in transit and at rest OS and software patching IAM Virtual Private Cloud Data access security Authorization of access to data Tracking copies; cleaning up after use Amazon Amazon S3 EBS CloudTrail + Config

HIPAA: Protected Health Information on the Cloud

HIPAA Security and Compliance on AWS We sign Business Associates Agreements As do other cloud providers All data (including PHI) is under the complete control of customer Encrypt data Lock down resources and applications Monitor anything and everything

Thank you! Architecting for Genomic Data Security and Compliance in AWS http://bit.ly/aws-dbgap Creating Healthcare Data Applications to Promote HIPAA and HITECH Compliance http://bit.ly/aws-hipaa http://bit.ly/aws-hipaa-faq