Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction

Work Package 13.5: Report summarising the technical feasibility of the European Genotype Archive to collect, store, and use genotype data stored in European biobanks in a manner that complies with all applicable medical and privacy regulations Authors: Paul Flicek and Ilkka Lappalainen 1. Introduction The European Genome-phenome Archive (EGA; http://www.ebi.ac.uk/ega) promotes the sharing of all types of potentially identifiable genetic and phenotypic data consented for research use but not for full open public release. In this report, the details of the EGA design and implementation are presented both the technical feasibility and the effectiveness of the EGA for its stated purpose is demonstrated. As a demonstrated working and active project of the European Bioinformatics Institute (EBI), it provides a critical service in the ELIXIR infrastructure for European biobanks and other researchers. 2. EGA Infrastructure The EGA infrastructure has been designed to provide a secure and scalable system for the archiving and dissemination of such data. The EGA security policy includes the development of a safe computing facility within EBI and a comprehensive suite of protocols for information management. All implemented protocols are consistent with the European Union Data Protection Directive (95/46/EC) and are subject to regular independent audit. The archived data are required to follow the EGA data access policy model whereby the data access decisions are made by a data access-granting organisation (DAO) and not by the EGA. The EGA project provides data management and distribution services for users of the database. 2.1 Overview of the EGA data model

The EGA data model is modularized to provide high security and optimal performance for large data archiving. Concepts in the data model provide storage of information on a subject, any sample that may have been derived from the subject and the phenotypic and genetic variations acquired from a particular sample in separate relational databases. The databases are located in a secure area accessible only to the EGA team. The raw data supporting the variant calls is stored outside these databases in an encrypted format. The links between the data points stored in the databases are made using abstract EGA identifiers. The EGA application programming interface (API) has been developed to provide unified and transparent tools for archiving and distributing data. No databases or direct access to databases is provided through the API for authorized users or members of the public. The access is provided to data files that have been created from the archive based on the agreement with the corresponding DAO. A dataset constitutes a single unit of released data that is governed by a single data access policy. The access to a dataset is provided by the EGA to the user once the authorization has been granted by the governing DAO. 2.2 The EGA data model for phenotypic information A sample object is created for each submitted sample to which the provided phenotype variables are associated as key value pairs. The EGA supports archiving of longitudinal sample types by linking samples to a particular subject once this information has been made available. It is also possible to hold this key outside of the EGA or authorize access to the subject-sample mapping separately to the data. The EGA does not have an internal process to harmonize phenotype data on samples across different submissions. The phenotype data are distributed as it has been submitted into our system. The submission of phenotype data using a standardized ontology vocabulary is encouraged. The EGA supports submission updates. 2.3 The EGA data model for genetic data

The accepted data types include manufacturer-specific raw data formats from array-based genotyping and raw DNA sequence data arising from resequencing, transcriptomics or other assays. The raw data files, such as the information for each probe in an array-based experiment or the raw reads from a next generation sequence experiment, are encrypted and archived into a file repository that shares the same design as that used for the European Nucleotide Archive (ENA) 1. Only the EGA team members have access to the archival encryption key. The variant types called from the raw data are stored in relational databases optimized for each data type. The schema requirements for genotypes are very different to those of structural variations. The EGA API facilitates the storage and retrieval of variants, together with any associated information recorded during the calling process, such as intensity values or quality scores. It is possible to archive variants called with a number of different algorithms for the same experiment. The API also allows us to merge genotype data acquired with different technologies, phase submitted data or impute unobserved genotypes using public reference panels such as those being developed in the 1000 Genomes Project. The EGA archives any submitted summary level statistical analysis. Results of the quality analysis connected to the submission are also stored, but without altering the original data. 2.4 Feasibility of the EGA data model In summary, the EGA data model allows for the storage of phenotypic and genetic information for samples in physically separate locations. Security is, therefore, controlled specifically for the stored data type without compromising access across the data archived for a particular subject and provide a scalable system that is able to respond to future storage, analysis and distribution requirements. 3. How users interact with the EGA 1 Leinonen et al., Nucleic Acid Research 2010

The EGA supports both submission of data from individual researcher or research groups in support of publications and also for prepublication data release for large-scale community resource projects as recommended by the Toronto workshop 2. 3.1 Data submissions to the EGA All data files must be encrypted prior to their upload to a dedicated submission account. The EGA only accepts the encryption keys using an out of band method such as telephone, postal mail or a courier. The EGA also provides a public key that allows secure encryption for the data submissions. In addition to data files, the EGA requires each submission to provide accurate information on the experimental and analytical methods used in the study. Each submission must also include DAO contact details, applied policy information and a certification for the authority to submit the data to the EGA for archiving and dissemination on behalf of the submitting organisation. The EGA accepts information in pre-defined submission formats, such as excel sheet based format MAGE-tab or XML. The EGA submission website 3 includes the most recent documentation for the submission process and provides examples of the data submission formats. The experienced EGA help-desk 4 also provides additional help during the submission. Once the data has been submitted to us, the EGA team members work together with the submitter to make sure that the data are correctly presented in our system. The release of any data from the EGA requires DAO authorization. 3.2 Data release from the EGA Submissions to the EGA must be consistent with national laws and regulations. The archived data are required to follow the EGA data access policy model whereby the data access decisions are made by a data accessgranting organisation (DAO) and not by the EGA. The DAO may be the same organisation that approved and monitored the initial study protocol or a designate of this approving organization such as a dedicated data access 2 Toronto International Data Release Workshop Authors, Nature 2009 3 http://www.ebi.ac.uk/ega/page.php?page=data_submission 4 email to ega-helpdesk@ebi.ac.uk

committee (DAC). Access to the data must be granted in a timely fashion to all bona fide researchers whose use of the data is consistent with the original consent agreements. The data access agreement that dictates how data must be stored, transferred or analysed is made directly between the applicant and the corresponding DAO. The EGA associates the data access rights to personal accounts within our system. All account actions are logged into our audit system. The EGA project provides data management tools that allow the DAO to directly add, remove or summarize permissions for those EGA accounts that are linked to the data they have a mandate to govern. These tools also show the full audit trail which includes of when a particular data access was added to an EGA account and who performed this action. The EGA supports workflows that require multiple authorizations that can include administrators from several organisations. The complex workflows have been implemented to allow strict checks of the applicants prior to data access authorization. The EGA user management tools are integrated into our website and can be linked to an account by DAO authorization. The EGA provides full documentation regarding the use of these tools and further training for data access management is available upon request. 4. Response to changing scientific environment Since the launch of the service, scientific developments have impacted the EGA operations significantly and the data models and procedures have been robust to these changes. As an example, the next generation sequencing influences the size and type of genetic information collected from the samples and submitted to our system resulting in developments to the data model and infrastructure as these data are generally larger in size per sample than arraybased genotype data. Additionally, in the fall of 2008, a publication 5 described computational methods to predict whether a given individual had participated in a particular research project using summary-level data. The EGA, together with the 5 Homer N et al., PLoS Genet 4(8): e1000665

DAOs, responded to this publication by removing all public access to data that could lead to the identification of the research participants. These data are now made available to users that either have been granted access only to the summary data or have access to the individual genotypes, and hence, would be able to produce the same results. The EGA is able to provide summary data from studies archived in our system for other EBI resources or to other ESFRI projects should the policy change in the future. Summary The EGA service has been used successfully since April 2008 and currently manages genetic and phenotypic information for approximately 80 000 samples listed in more than 40 different studies and serve 1700 authorized data users worldwide. These data include samples that are stored in European BioBanks such as the UK DNA Banking Network. The EGA provides appropriate security and has built a robust and scalable infrastructure that is responsive to changes in the science and regulatory environment. Taken as a whole, it is clear that the operation of the EGA as a service is technically feasible and that that EGA can be used as a critical tool for ELIXIR and European ESFRI projects.