Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI



Similar documents
Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Steven Newhouse, Head of Technical Services

Enabling a federated environment to support biomedical research. Gianmauro Cuccuru CRS4

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

European Molecular Biology Laboratory Case Example

EMBL. International PhD Training. Mikko Taipale, PhD Whitehead Institute/MIT Cambridge, MA USA

Data integration and modelling in health sciences Science as a conversation across borders

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

EMBL Identity & Access Management

Enabling multi-cloud resources at CERN within the Helix Nebula project. D. Giordano (CERN IT-SDC) HEPiX Spring 2014 Workshop 23 May 2014

Keystones for supporting collaborative research using multiple data sets in the medical and bio-sciences

Microsoft Research Worldwide Presence

Embargoed until 14:30 CEST European time, 13:30 BST UK, 8:30 Eastern US summer time Contacts:

Attacking the Biobank Bottleneck

EMBO Short Term Fellowships Information for applicants

Office 365. Service Overview with a focus on Identity Federation and Directory Synchronization. Jono Luk, Program Manager jluk@microsoft.

Personalized Medicine and IT

Health Care Systems: An International Comparison. Strategic Policy and Research Intergovernmental Affairs May 2001

Big Data and Storage Management at the Large Hadron Collider

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

European Research Council

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

187/ December EU28, euro area and United States GDP growth rates % change over the previous quarter

TOYOTA I_SITE More than fleet management

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

99/ June EU28, euro area and United States GDP growth rates % change over the previous quarter

Agenda. Company Platform Customers Partners Competitive Analysis

Big Data for health. Farr Institute, Administrative Data Research Centres, Medical Bioinformatics. 9 July Jacky Pallas, UCL

Trials community. Yannick Legré. EGI InSPIRE RI

Thermo Scientific Compound Discoverer Software. A New Generation. of integrated solutions for small molecule structure ID

COST Presentation. COST Office Brussels, ESF provides the COST Office through a European Commission contract

TOWARDS PUBLIC PROCUREMENT KEY PERFORMANCE INDICATORS. Paulo Magina Public Sector Integrity Division

Increasing Innovation in R&D - Seizing early stage external growth opportunities

About us. As our customer you will be able to take advantage of the following benefits: One Provider. Flexible Billing. Our Portal.

CISCO IP PHONE SERVICES SOFTWARE DEVELOPMENT KIT (SDK)

CERN s Scientific Programme and the need for computing resources

Report on Government Information Requests

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

SEERA-EI. Introduction to Cloud Computing. SEERA-EI training, 13 April Aneta Karaivanova, IICT-BAS, Bulgaria

Data Centers and Cloud Computing

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

Students: undergraduate and graduate students who are currently enrolled in universities

ICT budget and staffing trends in Healthcare

EGI services for distribution and federation of data and computing

Hong Kong s Health Spending 1989 to 2033

IAB Europe AdEx Benchmark Daniel Knapp, IHS Eleni Marouli, IHS

The EGI Federated Cloud e-infrastructure

EMC BACKUP-AS-A-SERVICE

Genuine BMW Accessories. The Ultimate Driving Machine. BMW Trackstar. tracked. recovered. BMW TRACKSTAR.

Reporting practices for domestic and total debt securities

How many students study abroad and where do they go?

Helix Nebula, the Science Cloud

DCA QUESTIONNAIRE V0.1-1 INTRODUCTION AND IDENTIFICATION OF THE DATA CENTRE

13 th Economic Trends Survey of the Architects Council of Europe

Funding and network opportunities for cluster internationalization

CHANGING DATA CENTRE SITE SELECTION CRITERIA

A Proof of Concept Cloud Based Solution. Mark Evans Tessella Inc. PASIG Austin, TX - January 13 th 2012

Deliverable D1.1. Building data bridges between biological and medical infrastructures in Europe. Grant agreement no.:

INDIGO DataCloud. Technical Overview RIA INFN-Bari

The Private Cloud Your Controlled Access Infrastructure

Standard Big Data Architecture and Infrastructure

Cisco Conference Connection

Workprogramme

IBM Spectrum Protect in the Cloud

The EMBL-European Bioinformatics Institute

Electricity, Gas and Water: The European Market Report 2014

opening the clouds qualitative overview of the state-of-the-art open source cloud management platforms. ACM, middleware 2009 conference

PUBLIC VS. PRIVATE HEALTH CARE IN CANADA. Norma Kozhaya, Ph.D Economist, Montreal economic Institute CPBI, Winnipeg June 15, 2007

EMBL Postdoctoral Programme An exciting and stimulating environment. European Molecular Biology Laboratory

How To Improve Cloud Infrastructure

AMADEUS: Analyse MAjor Databases from EUropean Sources - A financial database of 4 million European companies, including Eastern Europe MODULE.

Harmonizing Change Control Processes Globally

Preventing fraud and corruption in public procurement

A Service for Data-Intensive Computations on Virtual Clusters

Web Application Hosting Cloud Architecture

A public-private partnership building a multidisciplinary cloud platform for data intensive science

1. Perception of the Bancruptcy System Perception of In-court Reorganisation... 4

European Data Infrastructure - EUDAT Data Services & Tools

GE Grid Solutions. Providing solutions that keep the world energized Press Conference Call Presentation November 12, Imagination at work.

ORGANISATION FOR ECONOMIC CO-OPERATION AND DEVELOPMENT

A public-private partnership building a multidisciplinary cloud platform for data intensive science

Enterprise Mobility Suite (EMS) Overview

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

How to Obtain CODIS. Tim Zolandz. FBI Laboratory (703)

Size and Development of the Shadow Economy of 31 European and 5 other OECD Countries from 2003 to 2015: Different Developments

IDENTITY & ACCESS MANAGEMENT IN THE CLOUD

EBiSC the first European bank for induced pluripotent stem cells

E-Seminar. Financial Management Internet Business Solution Seminar

Big Data Challenges for e-science Infrastructure

SURFsara Data Services

How To Build An Open Source Data Infrastructure

relating to household s disposable income. A Gini Coefficient of zero indicates

A leader in the development and application of information technology to prevent and treat disease.

CISCO PIX SECURITY APPLIANCE LICENSING

BT Premium Event Call and Web Rate Card

Is there an elephant in the room?

The 100,000 genomes project

Transcription:

Big Data in BioMedical Sciences Steven Newhouse, Head of Technical Services, EMBL-EBI

Big Data for BioMedical Sciences EMBL-EBI: What we do and why? Challenges & Opportunities Infrastructure Requirements European Context The Future

EMBL-EBI What we do and why?

The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Administration EMBO Structural biology Grenoble Bioinformatics Monterotondo, Rome EMBL staff: 1700 people >60 nationalities Structural biology Mouse biology

EMBL member states Austria, Belgium, Croatia, Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member states: Argentina, Australia

EMBL-EBI MISSION To provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress

European Bioinformatics Institute (EBI) International, non-profit research institute Europe s hub for biological data services and research 570 members of staff from 53 nations Funded primarily by member states and research bodies (EC, USA, UK, Wellcome Trust)

What is bioinformatics? The science of storing, retrieving and analysing large amounts of biological information An interdisciplinary science involving: biologists biochemists computer scientists mathematicians

Challenges & Opportunities

Drug discovery From discovering a target to a drug reaching the market: 12 years Bioinformatics shortens time to target discovery. EMBL-EBI services support all stages of drug discovery.

Interpreting human variation How and why do we differ from one another? What causes susceptibility to disease? We explore individual genomes by comparing them with reference genomes. Combining that information with other types of data provides valuable insights into human variation. This is largely a data-driven process.

Making choices There are 3 billion base pairs in the human genome. Figuring out which regions are involved in disease and what they do is a major challenge. Time, Inc.

Personalised Medicine Reduced sequencing costs enable new techniques Past: 13 yrs & 2Bn Now: 2 days & 1000 Clinical integration raises privacy & ethical concerns UK: 100K Genomes Link to medical records

From Data to Knowledge in the Life Sciences Molecular data Information Knowledge Mutation X disrupts enzyme function, which causes Y disease

Answering these Challenges Data Infrastructure Distributed trans-national sources ( 1000s+ sequencers) Individually and collectively producing LOTS of data Incredibly sensitive meta-data (i.e. your medical records) Data Consumption Annotating data leads to information and knowledge Data Exploitation Enable access beyond specialised researcher Improve usability across broad user base

Short Break

Infrastructure Requirements Role of the Technical Services Cluster

Life science: many data types Genes, genomes & variation Gene, protein & metabolite expression Protein sequences, families & motifs Macromolecular structures Chemogenomics & metabolomics Interactions, reactions & pathways Cross-domain tools & resources

Data Growth: A Community Challenge

Challenges of Big Data Infrastructure 2015+ 50000 00000 50000 00000 50000 0 Cores 110 100 90 80 70 60 50 40 2014 2015 2016 2017 2018 201910 2020 Racks 2000 1500 1000 2014 2015 2016 2017 2018 2019 2020 500 Cost (M /year) 0 OpEx CapEx Storage (PB) 2014 2015 2016 2017 2018 2019 2020 In 2020: Storage: 1,920PB Cores: 230,000 5 0 2014 2015 2016 2017 2018 2019 2020

EMBL-EBI Data Centre Infrastructure JANET Production Area Hinxton (HX) Staging Area Hemel Hempstead (HH) WLB Flint Cross (FC) Disk LAN SAN Standby DBs Cluster Disk LAN SAN Web DB Virtual Servers LAN SAN Cluster Disk LAN SAN WLB Web DB Virtual Servers Cluster Disk Cluster Redundant LAN Redundant SAN Web DB Virtual Servers Web DB Delphix DB Archive B Disk Virtual Servers 10Gb Network Connections JANET Archive A WLB Embassy Cloud JANET

EBI Provides Services and Data Resources Data Resources Public and Managed Access Individual sequence or bulk download Services Web & programmatic access for common tools Run jobs on EMBL-EBI hardware Volume and variety of genomic data expanding EMBL-EBI data doubling every year - replication is challenging Infrastructure currently 50,000 CPUs & 55+PB

Management Technical services Basic Users Research Teams Power users Research Partners Service Teams Infrastructur e Providers Web Development Web development User experience Web Production Web platforms Web infrastructure Web deployment Systems Applications Database Desktop Cloud Systems Infrastructure Storage Network Compute

The Challenge Facing Services @ EMBL- EBI Need to support complex analysis scenarios Access to both public and managed access data sets Bespoke workflows and tools across a variety of domains Issues with disk to memory bandwidth Web and programmatic access to services (3M unique users) Hard for users to replicate data sets for local analysis Use the cloud to bring local analysis to EMBL-EBI data

Public Data Public Services Managed Data Embassy Cloud 1 Embassy Cloud 2 Embassy Cloud 3 Private Data Embassy Cloud Concept PanCancer Virtualised EMBL-EBI Hardware

Embassy Cloud Our partners work directly beside EMBL-EBI data. High bandwidth Low latency Robust, secure environment. Not in competition with commercial cloud services. Other cloud initiatives: ELIXIR-facing cloud support HELIX Nebula.

Typical Uses Web Application Hosting Limited need for resources & VMs CTTV: Host intranet, databases, Data Staging Undertake submission from local machine (following data staging) rather from remote location BRAEMBL: Remote submission unreliable due to file upload Data Analysis Large scale management and analysis of data PanCancer: 1,000 cores, 2.5 TB RAM, 1.0 PB HDD

European Context

ELIXIR: a distributed data infrastructure EMBL-EBI is a major driver in ELIXIR, the pan-european research infrastructure for biological information. Central Hub at EMBL-EBI, with Nodes at centres of excellence throughout Europe. The goal of ELIXIR: Build a sustainable European infrastructure for biological information Support life science research and its translation to medicine, the environment, the bioindustries and society.

Building capacity in Europe ELIXIR: a sustainable infrastructure for biological information in Europe. Supporting life science research and its translation to: medicine agriculture the environment the bioindustries society.

AAI Architecture Relying services EGA wiki Cloud Intranet Data archive Credential translation ELIXIR Proxy IdP ELIXIR Directory Dataset authorisation management Bona fide management ELIXIR AAI Group/role management Attribute self-management edugain IdPs Common IdPs External authentication (e-infrastructures)

Authentication & Authorization Cloud & Storage Architecture Portals Workflow Engines Relying services Cloud IaaS Metrics ELIXIR Cores Storage Service Registry Storage Cloud IaaS External (e-infrastructures) Cores Storage Storage Data Transfer Service Directory Accounting Software Library

Initial Prioritisation and Grouping Analysis AAI 0: Federated ID, Other ID 1: Elixir ID 2: Credential Translation, Group/Attribute Mgmt, Endorsed Attributes Cloud Compute 1: Cloud IaaS, HTC Cluster, Cloud Storage, Federated Cloud IaaS 2: Infrastructure Service Registry & Directory, VM Library 3: Operational Integration, Resource Accounting Data Transfer 1:Network File Storage, File Transfer 2: Data Set Replication, PID & Meta-data Registry

Centralization & specialization Data is submitted to specialized centralized repositories. Current situation. 35 Data production Data centralization Data

Federation If data gets bigger, the data might have to stay where it is produced. We might have to provision data producers with storage and computation. Data might be pulled instead of pushed into centralized repositories. 36 36 Data production Data centralization Data

The Future

Centre for Therapeutic Target Validation Collaboration to pinpoint the processes in the human body that have a demonstrable effect on disease. Public-private initiative: GSK: expertise in disease biology EMBL-EBI: expertise in life science data integration and analysis Wellcome Trust Sanger Institute: expertise in the role of genetics in disease.

Technical Services Cluster: IT as a Service Professionalising our IT Services Looking at lightweight FitSM portfolio Developing a Service Portfolio Internal and External (Public & Private) Users Putting the Service Portfolio into a governance process To manage and communicate change of the portfolio Contributing to the Elixir Research Infrastructure

External Infrastructure Activities EUDAT Leverage CDI to strategically replicate large popular data sets HelixNebula Meet future needs through hybrid model External hosted data centres with direct cloud access EGI: European Grid Infrastructure Provides federated cloud & cluster resources CERN OpenLab Identifying and building common IT services for science Cancer Research UK Bringing islands of protected data together for cloud based analysis

Other Cloud Activity at EMBL-EBI Use Amazon to provide geographical distribution Direct link to globally replicate databases HelixNebula Integration of commercial cloud providers with big research Benefit of additional security assurances For use by pharmaceutical companies For on-demand personalised medicine Explore using IaaS to supplement/replace data centres Put DC on cloud, scale out services (service + database), etc.

The Future Private Analysis Public Service Integrating Platform (Deal with discovery, provision & placement) EMBL-EBI IT (Services, Research, Clusters) Virtualised Infrastructure Virtualised Infrastructure Elixir Community Services Virtualised Infrastructure Elixir Service Storage Compute Cloud Providers Data Storage Compute EMBL-EBI Data Centre Data Storage Compute Infrastructure Provider Data Data Provider Geant Network

Thank you steven.newhouse@ebi.ac.uk