Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction



Similar documents
European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Computational Requirements

Global Alliance. Ewan Birney Associate Director EMBL-EBI

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

NIH Genomic Data Sharing (GDS) Policy Guidance Memo #2 1

NIH s Genomic Data Sharing Policy

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

6 ELIXIR Domain Specific Services

The 100,000 genomes project

ESTRO PRIVACY AND DATA SECURITY NOTICE

Case Study Life Sciences Data

Report of the DTL focus meeting on Life Science Data Repositories

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

1.2: DATA SHARING POLICY. PART OF THE OBI GOVERNANCE POLICY Available at:

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Worldwide Collaborations in Molecular Profiling

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Committee on WIPO Standards (CWS)

Enabling a federated environment to support biomedical research. Gianmauro Cuccuru CRS4

An Introduction to Managing Research Data

Electronic Document and Record Compliance for the Life Sciences

Towards the construction of an integrated Wheat Information System

Q: What browsers will be supported? A: Internet Explorer (from version 6), Firefox (from version 3.0), Safari, Chrome

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

MANAGED FILE TRANSFER: 10 STEPS TO SOX COMPLIANCE

A complete platform for proactive data management

FTP-Stream Data Sheet

White Paper: NCBI Database of Genotypes and Phenotypes (dbgap) Security Best Practices Compliance Overview for the New DNAnexus Platform

CONSUMER DATA RESEARCH CENTRE DATA SERVICE USER GUIDE. Version: August 2015

Writing a Wellcome Trust Data Management & Sharing Plan

Information and Data Sharing Policy* Genomics:GTL Program

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

IT 415 Information Visualization Spring Semester

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

How To Write A Blog Post On Globus

SURFsara Data Services

Major US Genomic Medicine Programs: NHGRI s Electronic Medical Records and Genomics (emerge) Network

How To Ensure Health Information Is Protected

Overview. Overarching observations

Streamlining the drug development lifecycle with Adobe LiveCycle enterprise solutions

What s Next for Data Sharing: Insight from the NIH Experience

NOW!! Registry and BioBank Services for! Your Organization/Company/Clinic/Project!

ECRIN (European Clinical Research Infrastructures Network)

EMC DOCUMENTUM CONTENT ENABLED EMR Enhance the value of your EMR investment by accessing the complete patient record.

Version 21 Date: 14th September 2010 ETHICAL GOVERNANCE FRAMEWORK. Drafted by the Ethical Advisory Group of the UK10K project

Trade Repository Service White Paper December 2013

The Information Commissioner s Office response to HM Treasury s Call for Evidence on Data Sharing and Open Data in Banking

Signature Requirements for the etmf

MANAGED FILE TRANSFER: 10 STEPS TO PCI DSS COMPLIANCE

Integrated Rule-based Data Management System for Genome Sequencing Data

Comments of the EDPS in response to the public consultation on

A Service for Data-Intensive Computations on Virtual Clusters

MANAGED FILE TRANSFER: 10 STEPS TO HIPAA/HITECH COMPLIANCE

HL7 Clinical Genomics and Structured Documents Work Groups

Document process management solutions for MiFID compliance

An Introduction to Genomics and SAS Scientific Discovery Solutions

Collaborative Computational Projects: Networking and Core Support

Six Challenges for the Privacy and Security of Health Information. Carl A. Gunter University of Illinois

escience and Post-Genome Biomedical Research

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Enhancing Functionality of EHRs for Genomic Research, Including E- Phenotying, Integrating Genomic Data, Transportable CDS, Privacy Threats

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes

TRANSFoRm: Vision of a learning healthcare system

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

The i2b2 Hive and the Clinical Research Chart

Security architecture and framework Design and pilot implementation

MOOCdb: Developing Data Standards for MOOC Data Science

Release of Data from EORTC Studies for Use in External Research Projects

Delivering the power of the world s most successful genomics platform

INVESTRAN DATA EXCHANGE

Data controllers and data processors: what the difference is and what the governance implications are

Title Draft Pan-Canadian Primary Health Care Electronic Medical Record Content Standard, Version 2.0 Data Extract Specifi cation Business View

Protective Marking for UK Government

Remote Data Extraction Policy and Procedure

BIOINFORMATICS Supporting competencies for the pharma industry

Enterprise Information Management Services Managing Your Company Data Along Its Lifecycle

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

The New EU Clinical Trials Regulation: The Good, the Bad, the Ugly

Big Data for Population Health

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

Test Data Management Concepts

Clinical Knowledge Manager. Product Description 2012 MAKING HEALTH COMPUTE

The Big Data Bioinformatics System

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

i2b2 Clinical Research Chart

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Document Change Control

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

RMS. Privacy Policy for RMS Hosting Plus and RMS(one) Guiding Principles

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Xarios EMEA Xarios Asia / Pacific Xarios North America

Deliverable First report on sample storage, DNA extraction and sample analysis processes

Harmonized Use Case for Electronic Health Records (Laboratory Result Reporting) March 19, 2006

ANSYS EKM Overview. What is EKM?

INTERNATIONAL PHARMACEUTICAL PRIVACY CONSORTIUM COMMENTS IN RESPONSE TO THE CALL FOR EVIDENCE ON EU DATA PROTECTION PROPOSALS

European Medicines Agency

Digital Pathways. Harlow Enterprise Hub, Edinburgh Way, Harlow CM20 2NQ

A Guide to Horizon 2020 Funding for the Creative Industries

Transcription:

Work Package 13.5: Report summarising the technical feasibility of the European Genotype Archive to collect, store, and use genotype data stored in European biobanks in a manner that complies with all applicable medical and privacy regulations Authors: Paul Flicek and Ilkka Lappalainen 1. Introduction The European Genome-phenome Archive (EGA; http://www.ebi.ac.uk/ega) promotes the sharing of all types of potentially identifiable genetic and phenotypic data consented for research use but not for full open public release. In this report, the details of the EGA design and implementation are presented both the technical feasibility and the effectiveness of the EGA for its stated purpose is demonstrated. As a demonstrated working and active project of the European Bioinformatics Institute (EBI), it provides a critical service in the ELIXIR infrastructure for European biobanks and other researchers. 2. EGA Infrastructure The EGA infrastructure has been designed to provide a secure and scalable system for the archiving and dissemination of such data. The EGA security policy includes the development of a safe computing facility within EBI and a comprehensive suite of protocols for information management. All implemented protocols are consistent with the European Union Data Protection Directive (95/46/EC) and are subject to regular independent audit. The archived data are required to follow the EGA data access policy model whereby the data access decisions are made by a data access-granting organisation (DAO) and not by the EGA. The EGA project provides data management and distribution services for users of the database. 2.1 Overview of the EGA data model

The EGA data model is modularized to provide high security and optimal performance for large data archiving. Concepts in the data model provide storage of information on a subject, any sample that may have been derived from the subject and the phenotypic and genetic variations acquired from a particular sample in separate relational databases. The databases are located in a secure area accessible only to the EGA team. The raw data supporting the variant calls is stored outside these databases in an encrypted format. The links between the data points stored in the databases are made using abstract EGA identifiers. The EGA application programming interface (API) has been developed to provide unified and transparent tools for archiving and distributing data. No databases or direct access to databases is provided through the API for authorized users or members of the public. The access is provided to data files that have been created from the archive based on the agreement with the corresponding DAO. A dataset constitutes a single unit of released data that is governed by a single data access policy. The access to a dataset is provided by the EGA to the user once the authorization has been granted by the governing DAO. 2.2 The EGA data model for phenotypic information A sample object is created for each submitted sample to which the provided phenotype variables are associated as key value pairs. The EGA supports archiving of longitudinal sample types by linking samples to a particular subject once this information has been made available. It is also possible to hold this key outside of the EGA or authorize access to the subject-sample mapping separately to the data. The EGA does not have an internal process to harmonize phenotype data on samples across different submissions. The phenotype data are distributed as it has been submitted into our system. The submission of phenotype data using a standardized ontology vocabulary is encouraged. The EGA supports submission updates. 2.3 The EGA data model for genetic data

The accepted data types include manufacturer-specific raw data formats from array-based genotyping and raw DNA sequence data arising from resequencing, transcriptomics or other assays. The raw data files, such as the information for each probe in an array-based experiment or the raw reads from a next generation sequence experiment, are encrypted and archived into a file repository that shares the same design as that used for the European Nucleotide Archive (ENA) 1. Only the EGA team members have access to the archival encryption key. The variant types called from the raw data are stored in relational databases optimized for each data type. The schema requirements for genotypes are very different to those of structural variations. The EGA API facilitates the storage and retrieval of variants, together with any associated information recorded during the calling process, such as intensity values or quality scores. It is possible to archive variants called with a number of different algorithms for the same experiment. The API also allows us to merge genotype data acquired with different technologies, phase submitted data or impute unobserved genotypes using public reference panels such as those being developed in the 1000 Genomes Project. The EGA archives any submitted summary level statistical analysis. Results of the quality analysis connected to the submission are also stored, but without altering the original data. 2.4 Feasibility of the EGA data model In summary, the EGA data model allows for the storage of phenotypic and genetic information for samples in physically separate locations. Security is, therefore, controlled specifically for the stored data type without compromising access across the data archived for a particular subject and provide a scalable system that is able to respond to future storage, analysis and distribution requirements. 3. How users interact with the EGA 1 Leinonen et al., Nucleic Acid Research 2010

The EGA supports both submission of data from individual researcher or research groups in support of publications and also for prepublication data release for large-scale community resource projects as recommended by the Toronto workshop 2. 3.1 Data submissions to the EGA All data files must be encrypted prior to their upload to a dedicated submission account. The EGA only accepts the encryption keys using an out of band method such as telephone, postal mail or a courier. The EGA also provides a public key that allows secure encryption for the data submissions. In addition to data files, the EGA requires each submission to provide accurate information on the experimental and analytical methods used in the study. Each submission must also include DAO contact details, applied policy information and a certification for the authority to submit the data to the EGA for archiving and dissemination on behalf of the submitting organisation. The EGA accepts information in pre-defined submission formats, such as excel sheet based format MAGE-tab or XML. The EGA submission website 3 includes the most recent documentation for the submission process and provides examples of the data submission formats. The experienced EGA help-desk 4 also provides additional help during the submission. Once the data has been submitted to us, the EGA team members work together with the submitter to make sure that the data are correctly presented in our system. The release of any data from the EGA requires DAO authorization. 3.2 Data release from the EGA Submissions to the EGA must be consistent with national laws and regulations. The archived data are required to follow the EGA data access policy model whereby the data access decisions are made by a data accessgranting organisation (DAO) and not by the EGA. The DAO may be the same organisation that approved and monitored the initial study protocol or a designate of this approving organization such as a dedicated data access 2 Toronto International Data Release Workshop Authors, Nature 2009 3 http://www.ebi.ac.uk/ega/page.php?page=data_submission 4 email to ega-helpdesk@ebi.ac.uk

committee (DAC). Access to the data must be granted in a timely fashion to all bona fide researchers whose use of the data is consistent with the original consent agreements. The data access agreement that dictates how data must be stored, transferred or analysed is made directly between the applicant and the corresponding DAO. The EGA associates the data access rights to personal accounts within our system. All account actions are logged into our audit system. The EGA project provides data management tools that allow the DAO to directly add, remove or summarize permissions for those EGA accounts that are linked to the data they have a mandate to govern. These tools also show the full audit trail which includes of when a particular data access was added to an EGA account and who performed this action. The EGA supports workflows that require multiple authorizations that can include administrators from several organisations. The complex workflows have been implemented to allow strict checks of the applicants prior to data access authorization. The EGA user management tools are integrated into our website and can be linked to an account by DAO authorization. The EGA provides full documentation regarding the use of these tools and further training for data access management is available upon request. 4. Response to changing scientific environment Since the launch of the service, scientific developments have impacted the EGA operations significantly and the data models and procedures have been robust to these changes. As an example, the next generation sequencing influences the size and type of genetic information collected from the samples and submitted to our system resulting in developments to the data model and infrastructure as these data are generally larger in size per sample than arraybased genotype data. Additionally, in the fall of 2008, a publication 5 described computational methods to predict whether a given individual had participated in a particular research project using summary-level data. The EGA, together with the 5 Homer N et al., PLoS Genet 4(8): e1000665

DAOs, responded to this publication by removing all public access to data that could lead to the identification of the research participants. These data are now made available to users that either have been granted access only to the summary data or have access to the individual genotypes, and hence, would be able to produce the same results. The EGA is able to provide summary data from studies archived in our system for other EBI resources or to other ESFRI projects should the policy change in the future. Summary The EGA service has been used successfully since April 2008 and currently manages genetic and phenotypic information for approximately 80 000 samples listed in more than 40 different studies and serve 1700 authorized data users worldwide. These data include samples that are stored in European BioBanks such as the UK DNA Banking Network. The EGA provides appropriate security and has built a robust and scalable infrastructure that is responsive to changes in the science and regulatory environment. Taken as a whole, it is clear that the operation of the EGA as a service is technically feasible and that that EGA can be used as a critical tool for ELIXIR and European ESFRI projects.