HPC infrastructure at King s College London and Genomics England



Similar documents
Keystones for supporting collaborative research using multiple data sets in the medical and bio-sciences

Big Data for health. Farr Institute, Administrative Data Research Centres, Medical Bioinformatics. 9 July Jacky Pallas, UCL

Data platforms to support research, evaluation & practice. David V Ford Professor of Health Informatics School of Medicine, Swansea University

The 100,000 genomes project

Life Sciences and Large Data Challenges

Automated and Scalable Data Management System for Genome Sequencing Data

Informatics: Opportunities & Applications. Professor Colin McCowan Robertson Centre for Biostatistics and Glasgow Clinical Trials Unit

Integrated Rule-based Data Management System for Genome Sequencing Data

University of Birmingham & CLIMB GPFS User Experience. Simon Thompson Research Support, IT Services University of Birmingham

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Cloud Sure - Virtual Machines

Private cloud computing advances

ESRC Big Data Network Phase 2: Business and Local Government Data Research Centres Welcome, Context, and Call Objectives

Virtual Server and Storage Provisioning Service. Service Description

Individual Referencing Systems. Anthea Springbett Programme Principal SHIS-R NHS Information Services Division

DATA ED EDINBURGH DATA SCIENCE AND MANAGING NATIONAL DATA SERVICES AT EDINBURGH PROF MARK PARSONS

BRISSkit: Biomedical Research Infrastructure Software Service kit

CEDA Storage. Dr Matt Pritchard. Centre for Environmental Data Archival (CEDA)

Service Description CloudSure Public, Private & Hybrid Cloud

Virtual Data Centre Public Cloud Simplicity Private Cloud Security

Digital Health: Catapulting Personalised Medicine Forward STRATIFIED MEDICINE

ACME Enterprises IT Infrastructure Assessment

Personalized Medicine and IT

Big Data for Population Health

SURFsara Data Services

Data management challenges in todays Healthcare and Life Sciences ecosystems

Virtualisation Cloud Computing at the RAL Tier 1. Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013

Object storage in Cloud Computing and Embedded Processing

HP Software Defined Networking - Eugene Berger, Chief Technologist, HP Enterprise Group

Steven Newhouse, Head of Technical Services

CloudSure Managed IaaS

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

REDCENTRIC INFRASTRUCTURE AS A SERVICE SERVICE DEFINITION

Building Storage Service in a Private Cloud

Big Data and the social sciences a perspective from the ESRC. Peter Elias

Mit Soft- & Hardware zum Erfolg. Giuseppe Paletta

G Cloud 4 Service Definition Document: CDG Common Digital Platform

The Greenplum Analytics Workbench

ICT Services for the Charity Sector

irods in complying with Public Research Policy

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Diploma in Information Technology Network Intergration Specialist COURSE INFORMATION PACK

Using Electronic Health Records to Support Patient Empowerment. Mike Denis CIO, South London and Maudsley NHS Foundation Trust

Optimised Managed IT Services, Hosting and Infrastructure. Keep your business running at peak performance

Hyperscale Use Cases for Scaling Out with Flash. David Olszewski

Diploma in Information Technology Network Integration Specialist COURSE INFO PACK

Technical. Overview. ~ a ~ irods version 4.x

Experience of Data Transfer to the Tier-1 from a DIRAC Perspective

Globus and the Centralized Research Data Infrastructure at CU Boulder

The Future of Data Management

Arkivum's Digital Archive Managed Service

CVE-401/CVA-500 FastTrack

A Cloud WHERE PHYSICAL ARE TOGETHER AT LAST

ArcGIS for Server: In the Cloud

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

The Impact of PaaS on Business Transformation

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Virtual Server Hosting Service Definition. SD021 v1.8 Issue Date 20 December 10

Hosted SharePoint: Questions every provider should answer

Ubuntu OpenStack on VMware vsphere: A reference architecture for deploying OpenStack while limiting changes to existing infrastructure

Dell Reference Configuration for Hortonworks Data Platform

Guardian365. Managed IT Support Services Suite

SMB Direct for SQL Server and Private Cloud

Microsoft Analytics Platform System. Solution Brief

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

Symantec Endpoint Protection 11.0 Architecture, Sizing, and Performance Recommendations

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

BlueArc unified network storage systems 7th TF-Storage Meeting. Scale Bigger, Store Smarter, Accelerate Everything

CloudDesk - Security in the Cloud INFORMATION

Microsoft Hyper-V chose a Primary Server Virtualization Platform

WebFOCUS Cloud Express. The WebFOCUS Cloud Express service is delivered as a managed G-Cloud service by Amtex Solutions Ltd.

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

[Type text] SERVICE CATALOGUE

How To Use A Vmware View For A Patient Care System

Overview of HPC Resources at Vanderbilt

Transcription:

HPC infrastructure at King s College London and Genomics England Tim Hubbard @timjph King s College London, King s Health Partners Genomics England Wellcome Trust Sanger Institute Farr-ADRN-MB einfrastructure workshop 16 th January 2015

King s College London Anchor tenant at Infinity Data Centre in Slough Existing HPC recently procured for KCL will be relocated alongside new HPC procured for two BRCs to create a large research facility

HPC 3 KCL s existing HPC environment consists of: Beowulf Linux Cluster Infiniband Network Grid Engine 1464 CPU cores, 8 GPU cores 4.2TB RAM 87TB Usable Lustre Storage

KCL collaborates with two other members of King s Health Partners to form NIHR funded Biomedical Research Centres (BRC): The Biomedical Research Centre for Mental Health, with the South London and Maudsley NHS Foundation Trust The Comprehensive Biomedical Research Centre for Mental Health, with Guy s and St. Thomas NHS Foundation Trust

Both BRCs have a significant (and similar) research computing requirement to process biomedical imaging, -omics and clinical record data. Recently been awarded grants to expand research computing capacity from Maudsley Charity and Guy s and St Thomas s Charity totalling around 2.5 million. Joining forces with KCL HPC 3 to create a flexible research computing environment for KHP in the new Infinity SDC...

Hardware: Infinband compute In addition to HPC 3 : 64 x Haswell (2 x 10 core) nodes, 128GB RAM 32 x Haswell (2 x 10 core), 256GB RAM 10GbE Compute Existing BRC-MH 15 x 16core, 256GB RAM HP blade servers 32 x Haswell (2 x 10 core), 256GB RAM Tier 1 storage (Lustre) ~500TB scratch Tier 2 storage (ceph) ~ 3PB usable storage Off-site backup storage

Research Computing Service Platform Grid Engine Compute Cluster - infiniband network - lustre scratch OpenStack Cloud - 10GbE network - Cloud nodes may be used to temporarily expand Grid Engine Cluster irods Research Data Store - Built on Ceph object store Hardware

Public Domain OpenStack irods Store Research Domain Grid Engine OpenStack irods Store NHS Domain Grid Engine OpenStack irods Store

Genomics England

Genomics England Proposed data flows Sequencing Centres Sample repository Refreshable identifiable Clinical Data, linked to anonymised Whole Genome Sequence Annotation Apps Sample Patient Consent EHR Primary Care Hospital episodes Clinical Report Clinical Genetics, Cancer & Public Health. NHS Trusts, Patients & Public Pilots: Selected Centres, CRUK, BRCs Fire wall Patient data stays on NHS side Only processed results pass outside Safe haven: Anonymised Clinical data and DNA sequence Clinicians & Academics GeCIP Industry Main Program: Genomic Medicine Centres

MRC infrastructure award 2014-15 8m Skyscape 5Pb storage, tape through NSSA tendering Rental of CPUs 2015-16 16m Full procurement

Data Sharing Open to all Human Genome Projects where subject consented: Hapmap, 1000 genomes Repository: Genbank, ENA, DDBJ (INSDC) Managed distribution (must be bona fide researcher) Genetic data for disease cohorts, with phenotypes Repository: DbGaP, EGA (Encrypted distributions etc.) Managed access, no redistribution Genomics England datasets Repository: GeL Datacentre

A future with closed datasets Multiple sets of Hospital/National datasets with no redistribution policies Value for research in generating statistics across this global set

Global Alliance for Genomes and Health http://genomicsandhealth.org/

Global Alliance for Genomes and Health http://genomicsandhealth.org/

Developing the UK infrastructure for e-health research John Ainsworth Deputy Director, Farr @ HeRC 16 January 2015 UCL Workshop

From Big Data to Big Scale DATA METHODS & MODELS EXPERTISE Vast data volume, velocity, variety TSUNAMI Supra-linear growth in papers & tools BLIZZARD Similar number of analysts DROUGHT Three Big Health E-Research Challenges 1. Assist hypothesis formation with data 2. Weed out non-reproducible findings early 3. Couple data-intensive healthcare and research

Who is Farr? Diseases are more easily prevented than cured and the first step to their prevention is the discovery of their exciting causes. William Farr 30 November 1807 14 April 1883

What is Farr? A distributed research institute that will integrate and scale, at the UK level, the work of four Health Informatics Research Centres (HIRCs)

History In August2012, ten UK funding agencies awarded four Centres of Excellence in e- health informatics research The four HIRCs aim to optimize the use of health records in research and address the UK s capacity building requirements to support a sustainable health informatics research base.

Health Informatics Research Centres Scotland Dundee, Glasgow, Edinburgh, St Andrews, Aberdeen, Strathclyde, MRC HGU, NHS NSS HeRC Manchester, York, Lancaster, Liverpool, Sheffield, AHSNs CIPHER Swansea, Bristol, Cardiff, Exeter, Leicester, Sussex, NWIS, Public Health Wales UCL Partners UCL, LSHTM, Queen Mary, Public Health England Map Source: www.m62.net

More History In 2013, the Farr Institute was created to support the HIRCs collective work. Farr Institute @ CIPHER Farr Institute @ HeRC Farr Institute @ Scotland Farr Institute @ UCL Partners Together, they bring a total of 20 academic institutions and two MRC units. Farr will act as the nexus of the UK Health Informatics Research Network

Aims of the Farr Create a physical and electronic infrastructure to support and accelerate the Centres collaborative work Support partnerships by providing a physical structure to co-locate NHS organisations, industry, and other UK academic centres Facilitate collaboration, the sharing of datasets, and the adoption of common standards Develop new opportunities for future data linkage at scale

UK Health Informatics Research Network Farr will lead the UK Health Informatics Research Network Farr will develop the Network s 5-year strategy plan and provide a blueprint for its activities The Network aims to strengthen the UK s capability in health informatics research by harnessing the expertise in the Farr and the wider UK research community The Network is open to all members of the research community Prof Carole Goble Prof Carole Goble

HeRC elab Based at Vaughan House IGTK L2 ISO27001:2013 in process Initial Farr Investment Labs Safe Haven HPC Devices VC Additional MRC CRI funding Clinical Proteomics Centre UK Dementia Platform Single Cell Genomics Secure file storage Secure file exchange Secure file transfer across NHS N3 Secure file transfer across public networks elab data management services via web interface Data linkage Data repository Research data extracts Data analysis software and compute Virtual machine service from remote locations Virtual machine service from secure data analysis environment Dataset inventory Personal health data repository HPC remote access

N3 NHS User N8 HPC Janet HAN HeRC Safe Haven : ISO27001 ISMS Phase 10 Research Repository Single Sign On Transient Repository Applications & Compute Remote Desktop elab 2 factor auth 2 factor auth Researchers HeRC Governance Board HeRC NHS : NHS IGTK Remote Repository AAAI NHS Pseudo Data Repository Data Transfer NHS elab Patients & Devices Dataset Catalogue

Big Data funding for health, medical and administrative data MRC 20M for the four Farr Institute nodes, for einfrastructure and buildings, June 2013 ESRC 34M for four Administrative Data Research Centres (ADRC) and Administrative Data Service, Nov 2013 MRC 39M for six Medical Bioinformatics Initiative projects, Feb 2014

The safe share project Background There is significant investment in medical research trying to unlock the value of data collected by the NHS and the wider government in order to further knowledge of disease and ill-health and improve medical treatments Building on the recent development of the, MRC and partner funded, Farr Institute, Medical Bioinformatics Initiative and the Administrative Data Research Network, and their infrastructure requirements Challenges Health Data is very personal and sensitive, and there is rightly public concern about any real or perceived inappropriate access Significant numbers of ethical, consensual and practical hurdles to making use of the data for research Title of presentation 00/00/2013 13

Meeting the Big Data challenge Being able to access data securely Being able to share data safely Being able to work together collaboratively Solve the problem once for everyone, potential solutions at scale and give public confidence that data is appropriately protected Project to be run in two parts, each with a set of pilots: 1. Secure connectivity, higher assurance network (HAN) 2. Authentication, Authorisation and Accounting Infrastructure (AAAI)

Secure Connectivity Use Cases Inter-Farr initial trial between Farr centre at Manchester and the N8 HPC at Leeds, but will extend to the other Farr centres (Swansea, London and Dundee) Intra-Farr to securely link the Swansea Farr centre with one of its collaborative projects with Bristol (ALSPAC) ADRC / Farr Pod to Data Centre connectivity between accredited secure rooms that can be connected to ADRC data centres for remote working

Authentication, Authorisation and Accounting Infrastructure (AAAI) Use Cases Dementia Study by Oxford University with the objective to demonstrate researchers using home institution credentials and a generic user request model to authenticate access to a set of relevant national and study specific datasets HeRC N8, HPC, DiRAC access between these facilities using home institution credentials emedlab partners will be able to analyse human genome data, medical images, clinical, psychological and social data. To demonstrate using a common AAAI with access via a common credential Swansea University Health Informatics Group investigating whether Moonshot can provide an authentication mechanism, allowing use of home institution credentials

Partners The project is funded and managed by Jisc working in partnership with: Wider Initiatives: The Farr Institute The MRC Medical Bioinformatics Initiative The Administrative Data Research Network Incorporating organisations involved in the pilots: University of Manchester UCL Swansea University University of Dundee Francis Crick Institute University of Oxford University of Leeds University of Sheffield University of Southampton University of Bristol HSCIC

Timetable Agreement on requirements and use cases - complete Funding approval - complete Detailed project planning in progress Detailed design and architecture of infrastructure in progress Operational standards, development controls Q1 2015 Infrastructure deployment, installation and commissioning Q2 2015 Initial operational and testing with customers Q3 2015 Customer trials begin Q4 2015 External certification ISO27001 process Q1 2016 Recommendations Q2 2016

The 3Rs of data science: Repeat, Reproduce, Reuse

The 3Rs of data science: Repeat, Reproduce, Reuse The 1T of data science: Transparency

Reproducibility A principle of the scientific method Evidence to test and justify claims Comparison of results and methods Peer review http://xkcd.com/242/ Prof Carole Goble

Defining drug exposure 192 different datasets 1. Selecting stop date 2. Handling missing stop date 3. Overlapping prescriptions Decision nodes 4. Small treatment gaps

A Data Science Commons Publish, Discover, Reuse Data Science Artefacts as Research Objects Rules 1. Each unique research object placed into the Commons must have a unique identifier. 2. That unique identifier must allow the research object to be found, shared and attributed. 3. Attribution requires associated provenance that, minimally, identifies the creator(s) of the unique research object, those that have subsequently modified it, and how it was modified. More at www.farrcommons.org

Farr ADRN Medical Bioinformatics e-infrastructure Workshop Simon Thompson The Swansea University version simon@chi.swan.ac.uk

Health Informatics Group, Swansea University FARR ADRC Swansea Bio-Info (SAIL)

FARR Based on SAIL Databank Linked Routine Data, Internationally Recognised data linkage system 4.7 million people 9 billion rows of data Over 20 core national datasets, 200+ project specific datasets GP Primary Care Inpatient & outpatients Secondary Care A&E, Emergency care Pathology & LIMS Births & Deaths Child Health & Perinatal Screening Screening Breast, Cervical Cancer registries WCB, CARIS, WCISU Education data Central Repository / Wharehouse 300 users, > 70m research induced NHS Wales connectivity (DAWN2-N3) infrastructure inside NHS core data centers

Based on Split File Principal File 1 Demographics + Link Key ID Name Address 56 Fred Bloggs The Big house 78 Jim Jones 87 peterson rd 45 Harry Lucas 19 meirwen Supplier Data File 2 Linkage ID Name Address BP Diag 56 Fred Bloggs The Big house 120/80 G33.. 78 Jim Jones 87 peterson rd 135/45 P123. 45 Harry Lucas 19 meirwen 125/75 G77.. Clinical (s) + Link Key ID BP Diag 56 120/80 G33.. 78 135/45 P123. 45 125/75 G77.. File 3 ID ALF Conf 56 65276573 88 78 32377722 97 45 27638236 95 Load into SAIL ALF_E BP Diag 4252 120/80 G33.. 7482 135/45 P123. 8436 125/75 G77..

FARR Evolution Remote Desktop (VDI) Technology, Single Sign on (Active Directory) Shared Security Model / Provisioning(v3) Two factor authentication Introduction of addition services Secure Filestore, WIKI, Helpdesk, Training Anonymising of GIS datasets (residencies and geo data) Active Directory Pooled standard config Vmware View Security Server (VPN) (x3) Vmware view Connection Broker Dedicated configurable Data Warehouse c Two Factor Authentication Server Specialist / Custom config

FARR Evolution Building on initiatives Data /Dataset documentation Data Quality measurement Automation of processes / Self service New probabilistic matching engine Natural language processing New technologies SQL Server 2014 cluster, HADOOP, R cluster Local & Remote capabilities Data Appliance UKSeRP White Labelling SAIL infrastructure Security Model v3 and provisioning v3 (some federation) Choice of two factor authentication platform Geo restrictions Project Level Encryption

National Research Data Appliance (NRDA) Simplistic Viewpoint NRDA1 NRDA2 User interface for dataset management Matching and Linkage Data Loader Data Quality Data Catalogue Pluggable architecture NRDA3 1 st deployment to NHS Trust this month

UK Secure Research Platform (UKSeRP) Simplistic view:- PORTAL Virtual Desktops NRDA Security Probabilistic Linkage Data Catalogue, Documentation, Metrics, Quality T1 T2 T3 IBM DB2 MP-DB SQL 2014 Cluster PostgreSQL + Post GIS ARCGIS HADOOP Cluster Virtualisation Stack IBM ICA HPC / Specialist Shared Filestore Doc / Community Support

UKSeRP uses NRDA User Portal ServiceDesk Data Appliance Security v3 Provisioning Capabilities Permissions People DataSets Data Loading Data management Data Documentation Data Quality Versioning Data Catalogue Probabilistic Linkage Transport / Sharing Anonymisation Trusted 3 rd Party Probabilistic Linkage Data Catalogue NLP Shared Infrastructure Data OLAP D.M. IBM DB2 Data OLAP BI SQL Server 1 2 3 Hadoop PIG HDFS Cloudera Hadoop 1 12 Files DFS Webdav Filestore SAS SPSS VDI VDI Templates VDI Templates Templates Vmware View VDI VDI Virtual Templates Templates Servers SCVMM DB2 IBM C.A. EDMS CliniThink VMware HyperV Backup, Recovery and DR Core Active Directory Accounts DHCP DNS WSUS

Data Science Building New building solely for MRC / ESRC Whole building considerably more secure/controlled than any existing building on campus. SEAP Level 4 area on top floor incorporating a server room and safe setting.

The tour so Farr!!

Health Informatics Group, Swansea University FARR ADRC Swansea Bio-Info (SAIL)

ADRC link FARR but Administrate Data Use of previous investment in systems / knowledge / development Very similar to FARR at the 1000 foot view, lots of differences in detail Lots of time on perfecting the design of the wheel must be a better design than square??? These dataset have not be shared at scale before lots of nervousness NRDA UKSeRP

New world for these data suppliers Not a repository model Compile dataset Do research Publish Destroy Data is transitory and specific to a project Data Linkage New linkage capabilities in NRDA required Possible Encryption at source with linkage based on encrypted demographics

New world for these data suppliers Security Much higher security requirements Hoping for shared infrastructure, ADRC on UKSeRP All researcher must have Safe Researcher Training / Cert System Admin / Developers Security Cleared Safe Settings Physical location with dataset locked to these Remote locations Cardiff, Bristol (link back - FARR)

New world for these data suppliers Security Much higher security requirements hoping for shared infrastructure, ADRC on UKSeRP All researcher must have Safe Researcher Training / Cert (link back FARR) System Admin / Developers Security Cleared Safe Setting linking to Cardiff, Bristol NRDA and new linkage Encryption at source linkage

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data SAIL Doc/Meta Data Devices Bespoke Data Compute Cluster NRDA Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data FARR Doc/Meta Data Devices Bespoke Data Compute Cluster NRDA Medical Images Research Image Rep. Image FARR NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data ADRC Doc/Meta Data Devices Bespoke Data Compute Cluster NRDA Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data Bespoke Data Compute Cluster NRDA MS Platform Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Biobank ProjectAnon. Structured Data UKSeRP Research Platform Images CLIMB System Bio-Info NRDA

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA CLIMB

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data Devices Bespoke Data UKDP Compute TTP NRDA NRDA Doc/Meta Data Cluster Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA

The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA SAFE Share Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA

The tour of routine data ends here!!

ADRC-Scotland & Farr Institute - Scotland Dr Stephen Pavis NHS Scotland

History in Scotland NHS National Services Scotland linking data for over 20 yrs Scottish Health Informatics Programme Empirical research Infrastructural design Public engagement Law and subsequent Guiding Principles Computing infrastructure (with separation of function) Data Linkage Framework (Scottish Government) Funding from ESRC (ADRC-S), MRC and 9 others (Farr and HIRC), Scottish Government (Data Linkage and Sharing Service)

The Scottish Model Facilitating research that is in the public interest whilst protecting individuals privacy Avoiding large data warehouses but ensuring data can be brought together efficiently to answer important research questions Creating partnerships and networks across sectors (academia, public and commercial sectors). But not selling data or allowing commercial companies direct access to individuals personal information Sharing resources and expertise to create efficient public services (Campbell Christie report) http://www.scotland.gov.uk/topics/statistics/datalinkageframework

Farr Scotland and ADRS-S data resources Neonatal Record GP consultations Mental Health Substance misuse Community care BIRTH Dental Out patients Hospital Admissions DEATH Maternity Prescribing A&E Screening Suicide Cancer registrations Child health surveillance Immunisation Imaging Laboratory BIRTH Education Looked after children Marriage Community care Care homes DEATH HMRC DWP Census (Scotland & UK)

IT Security Assurance NHS require System Security Protocol approved by IT Security Officer within National Services Scotland ADRC-S data suppliers require UK Government security classification ADRN have agreed that: Project data will not exceed the Official Sensitive category Each ADRC will provide an environment which is able to process data at the Official Sensitive level

Scottish Informatics and Linkage Collaboration Farr Institute (MRC) Administrative Data Research Centre (ESRC) Urban Big Data Centre? Shared computing resources at University of Edinburgh edris Research coordination and advice (NSS) Shared TTP Linkage service at NRS Shared office space at BioQuarter (UoD and UoE) SILC (Shared services for research initiatives that process sensitive data)

edata Research & Innovation Service Provide analyses, interpretation and intelligence about data (where required) 8 1 A named Person from start to finish Liaison with technical infrastructure (safe havens) 7 Single point of entry for health research Support projects from start to finish 2 Help with study design Facilitate completion of required permissions 6 5 Build relationship between data suppliers and customers 4 3 Provide expert advice on coding, terminology, meta data and study feasibility Liaison with data suppliers to secure data Agree deliverables and timelines

ADS Essex Advice/ Data Request Researcher requires access to linked data edris Co-ordinator refers data request to sources Advice and guidance Project IDs 1 Personal IDs 1 TTP Training & Researcher Approval Data Sources (e.g. NHS, Social Services, Police or local datasets) Project IDs Mapping Linking Service Project IDs 2 Personal IDs 2 Project IDs 2 with payload data 2 Project IDs 1 with payload data 2 Once trained and approvals for linkage are in place, the Researcher can access the linked dataset with in the safe haven. De-identified dataset within Safe haven

Challenges Software, various packages with different pricing mechanisms. Can we negotiate once for ADRC and Farr UK wide? Being clear for researchers on role of ADS and edris Different funding and charging models ADRC and Farr Scotland

Thank you for listening Stephen Pavis s.pavis@nhs.net

CLIMB Simon Thompson Research Computing Team University of Birmingham

CLIMB Project Funded by Medical Research Council (MRC) Four partner Universities Birmingham Cardiff Swansea Warwick ~ 8m (~$13M) grant Private cloud, running 1000 VMs over 4 sites For Microbial bioinformatics

The CLIMB Consortium Professor Mark Pallen (Warwick) and Dr Sam Sheppard (Swansea) Joint PIs Professor Mark Achtman (Warwick), Professor Steve Busby FRS (Birmingham), Dr Tom Connor (Cardiff)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) Co-Is Dr Nick Loman (Birmingham)* and Dr Chris Quince (Warwick) ; MRC Research Fellows * Principal bioinformaticians architecting and designing the system

And Marius Bakke (University of Warwick, CLIMB) Since January 2015 Simon Thompson (University of Birmingham) Matthew Ismail (University of Warwick) Simon Thompson (Swansea University)

CLIMB Separate OpenStack region per site Federated single gateway to access Local GPFS high performance ~0.5PB per site CEPH storage cluster replicated across sites For archive of VMs Between 2-5PB total usable over 4 sites

Where are we? - OpenStack Birmingham kit delivered for OpenStack Proof of concept running (with real users) Cardiff, Swansea and Warwick awaiting deployment with OCF (NSSA mini tender) Collaborating with IBM GPFS development team on OpenStack issues

Where are we? - CEPH Mini tender under NSSA, awarded to Dell CEPH cluster orders placed with Dell Inktank/RedHat engaged to provide architecture and services assistance

What is emedlab? Jacky Pallas, UCL David Fergusson, Crick

emedlab is Joint project with 6 institutions UCL, QMUL, LSHTM, Crick, Sanger, EBI Clinical, imaging and genomics data Cancer, cardiovascular and rare diseases Linked to KCL, Farr London and Genomics England Shared infrastructure in off site datacentre Minimum 9,000 cores and 4Pb data Colocation costs, networking

What is emedlab?

Benefits Data/compute architecture designed for medical bioinformatics Shared expertise and training 4 junior group leaders funded Farr/eMedLab Training Academy

Biomedical compute requirement Bags of memory Not so much about compute power, Lots of low power cores for through-put More storage, MORE, MORE! Not just storage volume data complexity, heterogeneity

Data First Design? compute STORAGE

Logical Architecture for emedlab

Technology Highlights X86 (6000 cores) High capacity 40Gb Mellanox networking Chubby nodes ~ 500Gb RAM per node Open Stack/Enterprise Red Hat GPFS storage (9Pb raw)

irods (digression) Data management is critical but enforcing systems in research is difficult irods (Integrated Rules Oriented Data System) DICE team, UNC, San Diego https://www.irods.org Federated system, different zones, administrative domains Project Workflows, Micro Services (rules/policies) triggered by specific events to implement workflows Each group can implement workflows to suit their needs Federated instances for large data management Wide area instances have been implemented

Shared Co-location JANet framework any research organization can contract with supplier without full OJEU process. Anchor tenants: UCL, Kings, LSE, QMUL, Crick, Sanger Interested: Bristol, Cancer Research Institute, Imperial Genomics England? Physically co-locating large data sets to allow secure shared computation across them.

Offsite Data Centre Community Cloud Model LRI LRI UCL NIMR Clinical data sharing private networks through lightpaths? Others The Crick King s King s College College SANGER IMPERIAL Others UK JANet pilot projects expected this year. ELIXIR/CSC (Finland) have come to the same technical solutions independently. Hope to collaborate between UK and Finland to extend the connections.

Collaborative Space Life Science Hub emedlab & beyond (?) Promote Skills Development (Systems, Informatics) Prototyping and deploying standards across multiple entities (Global Alliance) Promotes collaboration (both at IT and Informatics levels faster development, less duplication of effort de-facto standards) Produce real world infrastructure tools (production use across collaborating partners) Provide Sandboxes (testing development) Attractive to Industry partners (hardware evaluations, new technology deployment) Prototype public cloud techniques in private setting (safe environment) Safe Haven for sensitive data that should not move to public cloud Provide easier access to larger data sets. Pooled resources maximise Capital investment benefits for small and large user

WHO? MRC Medical Informatics Project UK MED-BIO: aggregation, integration, visualisation and analysis of large, complex data Dr Sarah Butcher s.butcher@imperial.ac.uk Head Bioinformatics Support Service Applicant: Prof. Paul Elliott Co-Is Nicholson, Glen, Guo Partner Institutions: Imperial Institute of Cancer Research (ICR, Ashworth) European Bioinformatics Institute (EMBL-EBI, Steinbeck) Centre for the Improvement of Population Health through E-health Research (CIPHER, Lyons) MRC Clinical Sciences Centre (CSC, Petretto) MRC Human Nutrition Research (MRC-HNR, Griffin). Industrial partners: Waters Corp. Bruker Biospin Huawei Technologies Co. Ltd. Thomson Reuter Astra Zeneca Award later than others April 2014 BUT same deadlines Science case the Exposome Data The exposome Concept Strategy for knowledge generation by UK MED-BIO Main primary data volume producer is Phenome Centre = metabolomics Also: NGS (exomes, genomes, targetted) Proteomics (mass spec) Transcriptomics and methylation-based Gut metagenomics and meta-transcriptomics Genome wide association studies So need to support primary data analyses AND Integration and intelligent data-mining of large, heterogeneous, high dimensional datasets (from all of above) 1

Metabolomics Data Pre-grant Starting Point - Storage Abundance m/z A single UPLC-MS profile ~8 GB Maximum annual throughput is 50k samples ~ 2 PB of data Intermediate data modelling will inflate this further Raw data copied straight to archive, maybe re-use twice in 5 years for methods validation De-noising can shave 15-40% of data sizes Peak picking will extract ~ 1MB of data from each profile Proprietary formats rife open formats possible but tend to compress less No central storage and limited back-up and archiving for research data and not linked directly to HPC centre Phenome centre has own limited storage capacity (250TB) and managed backups Phenome centre projected to need multiple petabytes raw data archive Bioinformatics service underpins some groups but limited (old, full) storage (~200TB), back-up Several crucial data management solutions in different places e.g. Phenome centre LIMS server, IC Healthcare Tissue Bank Database Very little physical data centre space on one College site only Pressing need for a centralised tiered storage system with archiving Pre-grant Starting Point - Compute Challenges Heterogeneous job profiles Heavy use of cluster and cache-coherent memory systems in piecemeal way Sequence-based analyses mainly on bioinformatics servers (max. 128GB RAM per server) Windows desktops for some non-scaling analyses No shared: compute environment, software stack, job scheduling or storage between all groups Already significant compute bottleneck for large jobs - numbers processors but particularly jobs requiring large RAM Some jobs already requiring >1 TB RAM for extended periods and getting larger Requirement for sand-boxed development environments Requirement to centrally host non-hpc services Make system fit for purpose when purpose will change over project lifetime Big unknowns in user requirements new groups, new fellowships, emerging technologies, software, methods, partners Heterogeneous user profiles Emerging codebase e.g. metabolomics feature extraction currently running on commercial windows software, moving towards open source solutions on cluster (or even GPU eventually) Matlab / R code being ported to C++ Little central infrastructure to build on Limited central knowledgebase for parallel file systems, irods etc. TRANSmart & etriks integration not specifically funded 2

Location, Location, Location South Kensington data centre Cluster nodes SGI cache-coherent memory nodes Tiered storage Tape archive Video wall, touch overlay for meeting centre Tiered storage duplication site Tape archive duplication site High memory servers System Summary Cluster nodes - PowerEdge C6000/ C6220 Xeon E5-2660v2 2.2GHz total 3040 cores already High memory servers - PowerEdge R920 7 with 1TB RAM each, 40 cores, 16TB fast internal storage, 20TB local array and Infiniband to tier1 Cache-coherent memory nodes SGI UV 2000-640 cores, 8 TB RAM, 350TB usable locally attached scratch tiered storage from DDN on each of 2 sites: 350 TB useable tier 1 GPFS 2 petabyte tier 2 WOS TSM tape archive on SpectraT950 (2 petabyte LTO6 capacity) Asynchronous replication between sites layout Where Are We Now? Unpacking, racking, installing In use 3

Challenges/questions Operations Group All hardware set-up Existing data transferred, tiering rules configured Establish standardised software environment for compute Data flow established User grouping established Data flow outwards with partners THEN irods?? (have test setup to configure) Data Sharing environment? Interaction with Patient data systems TransSMART/eTRIKS? BUSINESS MODEL Full time sysadmin - TBC - being recruited Bioinformatician/ data manager - James Abbott (Bioinformatics Support Service) + TBC Sarah Butcher (ops chair) Bioinformatics Support Service Steve Lawlor ICT Data Centre Manager Simon Burbidge ICT HPC Manager Jake Pearce NINR/MRC Phenome Centre Data Manager 4

UVRI/MRC Medical Informatics Centre (UMIC) PIs Pontiano Kaleebu (MRC Uganda) Manj Sandhu (Sanger) Budget: ~ 2.9m funded by MRC ~ 900k capital equipment ~ 2m resource budget (staff, network connectivity, ) Capital spend all committed as of 12/2014 Physical infrastructure ( 280k) Host building funded by Wellcome Trust ( 0) Existing DR building ( 0) Contributions to electrical upgrade onsite ( 60k) Data Centre and DR upgrades ( 220k) IT equipment ( 620k)

UMIC Location Ugandan Virus Research Institute (UVRI) Campus Entebbe, Uganda

UMIC Physical Infrastructure Offices: 30m 2 Data Centre: 32m 2

UMIC Compute & Storage Compute equipment 4x HP BLc7000 blade enclosures Main compute resource 512 cores each (AMD CPUs) 4TB RAM each (8GB/core) 2x 10 GbE per enclosure 4x HP DL380p servers Virtual machine hosts (infrastructure) 20 cores each 256GB RAM each Storage equipment 2x high speed scratch storage filesystems Intel Enterprise Edition Lustre 2 MDT/MGS servers (HA) 4 OSS servers each 256TB usable on each filesystem 2x long-term reliable (aka slow ) storage HP SL4540 tray node servers 348TB replicated across two servers (one in DR building)

UMIC Networking Network equipment Juniper MX104 router HA pair Juniper SRX3400 firewall HA pair 5x Juniper EX4300 1GbE switches 3x Juniper EX4550 10GbE switches 3x Aruba Instant 115 wireless access points Connectivity Google installing 2x (redundant) 1Gb fibre links Regional connectivity at up to 1Gbps via RENUnet Overseas connectivity initially at 10Mbps Resource spend will stay constant over time --> bandwidth will increase

Management & Personnel Technical Infrastructure Working Group Scientific Working Group Support staff Project Administrator (hired) Informatics staff 1x Senior Bioinformatician (recruitment ongoing) Technical staff 1x Systems Manager (hired; currently training in UK) 3x Other Systems posts (recruitment in 2015 Q2)

MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus V1 David Golding, Tom Fleming January 2015 University of Leeds 2015 1

MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus Organisational design MRC Medical Bioinformatics Centre Joint Projects Board (LTHT & University) Leeds Institute of Data Analytics (LIDA) ESRC Consumer Data Research Centre Researchers Example specialisms: Clinical, Data Scientist, Statistician, Epidemiologist, Health Economist Researchers Example specialisms: Data Scientist, Geographer, Statistician Centre Director (MBC) IT Director Centre Director (CDRC) Centre Manager (MBC) IRC Lead Centre Manager (CDRC) Research Operations (MBC) Centre Operations Team for MBC IRC Development Manager IRC Developer (and steady state) Research Operations (CDRC) Centre Operations Team for CDRC Integrated Research Campus (IRC) Team Head of Service Management Service Support team HPC team Servers and Storage Team Networking Team Datacentre Team Desktop Support Team Desktop Development Team Security Team University IT Operations Teams University of Leeds 2015 2

MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus Service design Research services Centre / IRC staff Technology services IT Operations staff Centre Operations Teams UoL IT Operations Data administration Data profiling Data linkage Centre Manager (Reports to Principle Investigators) Physical desktops build Desktop Development Team Data cleaning Data analysis Job obfuscation Security Data aggregation / abstraction Audit Research Operations Virtual desktops Applications support Desktop Support Team Operating systems IRC Data Services Team Data transfer management IRC Development Manager (Reports to UoL IT Head of Development) Virtual servers (e.g. SQL, Achiever) Application & environment management IRC Developer Applications Operating systems Logical storage management Storage areas and access controls for research groups Storage areas and access controls for data administration services Storage areas and access controls for data deposit / gateway Virtual desktop platform Virtualisation hypervisor Physical servers Servers and Storage Team Physical storage High Performance Computing Applications on HPC HPC Team Operating systems for HPC Storage for HPC Network Networking Team Power and cooling Racks Datacentre Team University of Leeds 2015 3

MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus Platform design Researchers / data scientists outside physical Centre (Worsley L11) Integrated Research Campus: Research centres IRC Data Services Team Researchers / data scientists inside physical Centre (Worsley L11) External gateway zone Data Services Zone VDI Control Service Research zone VDI Control Service Data Services Working Area Research Working Area Deposit Gateway Data Services servers Data Services virtual desktops Research-group servers Researcher virtual desktops Data deposited under Data Transfer Agreements; projects have ethical and governance approval Deposit gateway Database servers Working areas Risk-profiling tools Analysis tools Database servers Shared-licence apps Working areas Statistics tools Analysis tools A B Linking tools Research group sharing A B Collaboration tools Data providers Security Audit Data profiling Landing areas A B Data profiling Data cleaning Data linkage Data analysis Internal release of datasets for risk-profiling and linking Data profiling Data cleaning Data linkage Data analysis Data visualisation Security Data exploration Audit Internal release of working copies for daily working (check-out, check-in) Data analysis Data visualisation Data cleaning Data exploration Internal release HPC job obfuscation Data aggregation / abstraction HPC job obfuscation Data aggregation / abstraction Security Audit of datasets arrived from provider Security Audit Security Audit Publishing Ggateway Data Services Store Research Working Store Publishing gateway Data Services Store Manager Internal release of linked, risk-profiled datasets Working Store Manager Data consumers External release of authorised, risk-profiled datasets Security Audit Publishing area Internal release of published datasets after risk profiling / Data controller Code controller Security Candidates for publishing: internal release for risk profiling Data controller Code controller Security Data sharing A B assurance / etc Audit Audit Internal release of Storage areas Version control Storage areas Version control obfuscated / anonymous data for analysis and then storage of output A B A B Provided datasets Master linking table Published datasets Linked, riskprofiled data sets Working copies Outputs Internal gateway zone Analysis Gateway System administration zone IRC Administration Security Analysis gateway Virtual desktop administration Audit monitoring and alerting Audit Identity Security System update management System monitoring and alerting management External release of obfuscated / anonymous data for analysis and then storage of Analysis transit area A B Audit Virus scan management Application monitoring and alerting System update services Directory service output High Performance Compute LICAP A home B Data analysis MBC HPC (Cluster) nobackup scratch BCGene Data analysis IRC Data Services Team IT Operations Servers & Storage Team Corporate system Corporate directory update services Identity Security management University core systems MBC HPC (SGI UV2) Data analysis IT Operations HPC team scratch Shared network and deployment hosts? Farr HPC (SGI UV2) Data analysis HMR only scratch University of Leeds 2015 4