Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane www.ebi.ac.uk

Similar documents
Steven Newhouse, Head of Technical Services

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Global Alliance. Ewan Birney Associate Director EMBL-EBI

Keystones for supporting collaborative research using multiple data sets in the medical and bio-sciences

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Databases and platforms for data analysis from NGS of MTB

Module 1. Sequence Formats and Retrieval. Charles Steward

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

6 ELIXIR Domain Specific Services

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

The 100,000 genomes project

Data Sharing Initiative: International Cancer Genome Consortium

Scientific and Technical Applications as a Service in the Cloud

Accelerate genomic breakthroughs in microbiology. Gain deeper insights with powerful bioinformatic tools.

New solutions for Big Data Analysis and Visualization

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

RE Cloud Infrastructure as a Service

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

A Primer of Genome Science THIRD

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Cloud Ready for Bioinformatics?

General Services Administration Federal Supply Service Authorized Federal Supply Schedule Price List

Use of Whole Genome Sequencing (WGS) of food-borne pathogens for public health protection

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

Dr Alexander Henzing

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

EMBL-EBI Web Services

Data Grids. Lidan Wang April 5, 2007

Delivering the power of the world s most successful genomics platform

Processing Genome Data using Scalable Database Technology. My Background

FACULTY OF MEDICAL SCIENCE

How To Write A Blog Post On Globus

High Performance Computing OpenStack Options. September 22, 2015

Challenges associated with analysis and storage of NGS data

ELIXIR Scientific Programme

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

G E N OM I C S S E RV I C ES

WINDOWS AZURE EXECUTION MODELS

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

EMBL Identity & Access Management

EMBL-European Bioinformatics Institute. Annual Scientific Report 2012

Cloud-based Analytics and Map Reduce

Bioinformatics Grid - Enabled Tools For Biologists.

TRANSFORMING DATA PROTECTION

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Personalized Medicine and IT

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

CloudLink - The On-Ramp to the Cloud Security, Management and Performance Optimization for Multi-Tenant Private and Public Clouds

Course Specification

School of Nursing. Presented by Yvette Conley, PhD

Attacking the Biobank Bottleneck

Veeam Cloud Connect. Version 8.0. Administrator Guide

European Molecular Biology Laboratory Case Example

OpenStack IaaS. Rhys Oxenham OSEC.pl BarCamp, Warsaw, Poland November 2013

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

VMware Cloud Automation Design and Deploy IaaS Service

vcloud Air Disaster Recovery Technical Presentation

Experiences and challenges in the development of the JASMIN cloud service for the environmental science community

Preparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo

SURFsara HPC Cloud Workshop

NIH Genomic Data Sharing (GDS) Policy Guidance Memo #2 1

How To Choose Cloud Computing

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Automated and Scalable Data Management System for Genome Sequencing Data

COPO: Collaborative Open Plant Omics. Rob Davey Data Infrastructure and Algorithms Group Leader

SURFsara HPC Cloud Workshop

M The Nucleus M The Cytoskeleton M Cell Structure and Dynamics

VMware Virtual SAN Design and Sizing Guide TECHNICAL MARKETING DOCUMENTATION V 1.0/MARCH 2014

HBC How to build your cloud - Steps to Extend your Datacenter

Bacterial Next Generation Sequencing - nur mehr Daten oder auch mehr Wissen? Dag Harmsen Univ. Münster, Germany dharmsen@uni-muenster.

Software Defined Security Mechanisms for Critical Infrastructure Management

SUSE Cloud 2.0. Pete Chadwick. Douglas Jarvis. Senior Product Manager Product Marketing Manager

Deploying Business Virtual Appliances on Open Source Cloud Computing

BIOLOGICAL SCIENCES REQUIREMENTS [63 75 UNITS]

Transcription:

Three data delivery cases for EMBL- EBI s Embassy Guy Cochrane www.ebi.ac.uk

EMBL European Bioinformatics Institute Genes, genomes & variation European Nucleotide Archive 1000 Genomes Ensembl Ensembl Genomes Ensembl Plants European Genome-phenome Archive Metagenomics portal GWAS Catalog browser Protein sequences InterPro Pfam UniProt Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank Literature & ontology Europe PubMed Central Gene Ontology Experimental Factor Ontology Expression ArrayExpress Expression Atlas Metabolights PRIDE Reactions, interactions & pathways IntAct Reactome MetaboLights Chemical biology ChEMBL ChEBI Systems BioModels Enzyme Portal BioSamples

Sequence data at EMBL-EBI Sample/method Sample/method Read Read Alignment Alignment Assembly European Genome-phenome Archive - Controlled access data - Human data around molecular medicine - http://www.ebi.ac.uk/ega/ Annotation European Nucleotide Archive - Unrestricted data - Pan-species and application - http://www.ebi.ac.uk/ena/

Sequence data at EMBL-EBI Sample/method Sample/method Read Read Alignment Alignment Assembly Annotation European Nucleotide Archive - Unrestricted data - Pan-species and application - http://www.ebi.ac.uk/ena/ European Genome-phenome Archive - Controlled access data - Human data around molecular medicine - http://www.ebi.ac.uk/ega/ Infrastructure provision - BBSRC: RNAcentral, MG Portal - MRC: 100k Genomes data implementation - EC: COMPARE, MicroB3, ESGI, BASIS - etc.

Challenges Data have high volume and grow rapidly Data are dynamic (continuous feed) and their application has urgency Users require arbitrary and ad hoc access

Tara Oceans

Tara Oceans Capacity

Infectious disease Opportunity: A methodological revolution in clinical and public health towards shotgun sequencing-based methods Scientific power: Sequence harbours rich information Diagnostic: identification, typing, resistance profiling, etc. Public health: outbreak detection, response strategy, vaccine development Mechanistic: host interactions, pathogencity, virulence, transmission, antimicrobial resistance COMPARE: recently launched Horizon 2020 project in which EMBL-EBI is informatics provider Global Microbial Identifier: Initiative with EMBL-EBI involvement supporting technologies, standards and data sharing for pathogen surveillance Informatics roles for EMBL: COMPARE: Rapid global sharing of surveillance and outbreak data, systematic integrated analysis, compute provision (Embassy) Standards for reporting, analysis and the communication of results New algorithms and analysis methods User interfaces for surveillance data reporting, across the domains

COMPARE platform Sources Processes Portals and environments COMPARE Registry COMPARE Data Resource Public data COMPARE workflow engine Assembly & alignment Food workflow development COMPARE Portal Default tools API INSDC data exchange Managed access data Private data AnnotaHon Typing Workflow integrahon Clinical workflow development Outbreak workflow development Hosted tools API API EBI infrastructure Embassy infrastructure DTU infrastructure Embassy virtual domain

COMPARE platform Sources Processes Portals and environments COMPARE Registry COMPARE Data Resource Public data COMPARE workflow engine Assembly & alignment Food workflow development COMPARE Portal Default tools API INSDC data exchange Managed access data Private data AnnotaHon Urgency Typing Workflow integrahon Clinical workflow development Outbreak workflow development Hosted tools API API EBI infrastructure Embassy infrastructure DTU infrastructure Embassy virtual domain

Personalised medicine Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation As part of GA4GH, EMBL-EBI is working on Resources serving reference human genomic and transcriptomic data, including Google read API, variant Beacons, etc. CRAM compression supporting greater data fluidity and APIs to allow direct computational access Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures Past and current FP7 projects include SLING, BASIS, ESGI

Personalised medicine Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation Arbitrary access As part of GA4GH, EMBL-EBI is working on Resources serving reference human genomic and transcriptomic data, including Google read API, variant Beacons, etc. CRAM compression supporting greater data fluidity and APIs to allow direct computational access Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures Past and current FP7 projects include SLING, BASIS, ESGI

ENA conventional read data delivery Conventional infrastructure (FTP, Aspera, GridFTP) ENA metadata FIRE1 ENA data (NFS)

ENA Embassy read data delivery Conventional infrastructure (FTP, Aspera, GridFTP) ENA metadata FIRE2 ENA data (Cleversafe) FUSE HTTP

ENA Embassy read data delivery Conventional infrastructure (FTP, Aspera, GridFTP) Embassy cloud infrastructure (VMWare -> OpenStack) Marine cache Tara Oceans Embassy ENA metadata FIRE2 Pathogen cache COMPARE Embassy ENA data (Cleversafe) FUSE CRAM cache GA4GH Embassy HTTP

ENA external read data delivery phase II

EMBL-EBI Embassy Cloud Steven Newhouse Head of Technical Services

The Challenge Facing EMBL-EBI Volume and variety of genomic data expanding EMBL-EBI data doubling every year - replication is challenging Infrastructure currently 50,000 CPUs & 60+PB Need to support complex analysis scenarios Web and programmatic access to services (3M unique users) Access to both public and managed access data sets Bespoke workflows and tools across a variety of domains Hard for users to replicate data sets for local analysis Use the cloud to bring local analysis to EMBL-EBI data 18

EMBL-EBI Embassy Cloud Service hosted at EMBL-EBI data centres Direct network access to public and managed data sets Direct network to access public services Expect both academic and commercial users Technical Implementation Logically isolated outside EMBL-EBI s LANs Secure flexible infrastructure for both tenant and host Resources exposed using VMware s vcloud Director & OpenStack Provide isolated IaaS clouds to multiple users 19

Why Embassy Cloud? An embassy is sovereign territory in a host country Host Country: EMBL-EBI Data Centre Sovereign Territory: Host Country not allowed to enter Virtualisation provides the protection for tenant and host Host puts boundaries in place to protect it from the tenant Tenant has freedom and control within those boundaries 20

Embassy Cloud Concept PanCancer Public Data Public Services Managed Data Embassy Cloud 1 Embassy Cloud 2 Embassy Cloud 3 Private Data Virtualised EMBL-EBI Hardware 21

User Benefits for the IaaS Model Tenant organisations get an empty virtual infrastructure They establish their own virtual machines and networks System administration performed by the tenant EMBL-EBI staff have no access to the VMs Added value from EMBL-EBI over other clouds Machines and data hosted in known jurisdiction Direct network data sets (public & managed access) Direct network access to public EMBL-EBI services 22

Benefits to EMBL-EBI of the IaaS Model A secure collaborative workspace Work does not contend with main EMBL-EBI resources Clearly define the committed IT resources and data Explore how to build more data focused analysis services Move the analysis to where the big data is located Learn from and inform other big data scientific communities 23

Embassy Cloud: Typical Uses Collaborative Environment Neutral ground outside internal network CTTV: Resources and VMs to host intranet, databases, Data Staging Undertake submission from local machine (following data staging) rather from remote location BRAEMBL: Remote submission unreliable due to file upload Data Analysis Large scale management and analysis of data PanCancer: 1,000 cores, 2.5 TB RAM, 1.0 PB HDD

Issues Object Store Storage Infrastructure Essential for scalable high-performance storage Applications need to adapt to flat model Current caching strategy will have a limit Sharing resources between sites/communities/clouds Adopt a standards based model for federating resources Solutions for uploading and distributing VMs (+containers?) Replicating large data sets to attract workloads to a cloud 25

Gaps à Activities à Solutions? Data Set Replication Strategic pre-positioning of data into clouds Leverage JANET/GEANT, GridFTP + Globus Transfers, Cloud federation for mobile computing EGI has a federated cloud and VM distribution model ELIXIR plans to build on existing infrastructure where possible Wide-area file access needed for collaborative data analysis High performance wide-area object-store Need access control for human related data Coordinated investment in infrastructure Where is the UK coordination? What coordination is needed? Integrating commercial resources where they add value Integration with EU Infrastructure (ELIXIR) 26