HPC infrastructure at King s College London and Genomics England Tim Hubbard @timjph King s College London, King s Health Partners Genomics England Wellcome Trust Sanger Institute Farr-ADRN-MB einfrastructure workshop 16 th January 2015
King s College London Anchor tenant at Infinity Data Centre in Slough Existing HPC recently procured for KCL will be relocated alongside new HPC procured for two BRCs to create a large research facility
HPC 3 KCL s existing HPC environment consists of: Beowulf Linux Cluster Infiniband Network Grid Engine 1464 CPU cores, 8 GPU cores 4.2TB RAM 87TB Usable Lustre Storage
KCL collaborates with two other members of King s Health Partners to form NIHR funded Biomedical Research Centres (BRC): The Biomedical Research Centre for Mental Health, with the South London and Maudsley NHS Foundation Trust The Comprehensive Biomedical Research Centre for Mental Health, with Guy s and St. Thomas NHS Foundation Trust
Both BRCs have a significant (and similar) research computing requirement to process biomedical imaging, -omics and clinical record data. Recently been awarded grants to expand research computing capacity from Maudsley Charity and Guy s and St Thomas s Charity totalling around 2.5 million. Joining forces with KCL HPC 3 to create a flexible research computing environment for KHP in the new Infinity SDC...
Hardware: Infinband compute In addition to HPC 3 : 64 x Haswell (2 x 10 core) nodes, 128GB RAM 32 x Haswell (2 x 10 core), 256GB RAM 10GbE Compute Existing BRC-MH 15 x 16core, 256GB RAM HP blade servers 32 x Haswell (2 x 10 core), 256GB RAM Tier 1 storage (Lustre) ~500TB scratch Tier 2 storage (ceph) ~ 3PB usable storage Off-site backup storage
Research Computing Service Platform Grid Engine Compute Cluster - infiniband network - lustre scratch OpenStack Cloud - 10GbE network - Cloud nodes may be used to temporarily expand Grid Engine Cluster irods Research Data Store - Built on Ceph object store Hardware
Public Domain OpenStack irods Store Research Domain Grid Engine OpenStack irods Store NHS Domain Grid Engine OpenStack irods Store
Genomics England
Genomics England Proposed data flows Sequencing Centres Sample repository Refreshable identifiable Clinical Data, linked to anonymised Whole Genome Sequence Annotation Apps Sample Patient Consent EHR Primary Care Hospital episodes Clinical Report Clinical Genetics, Cancer & Public Health. NHS Trusts, Patients & Public Pilots: Selected Centres, CRUK, BRCs Fire wall Patient data stays on NHS side Only processed results pass outside Safe haven: Anonymised Clinical data and DNA sequence Clinicians & Academics GeCIP Industry Main Program: Genomic Medicine Centres
MRC infrastructure award 2014-15 8m Skyscape 5Pb storage, tape through NSSA tendering Rental of CPUs 2015-16 16m Full procurement
Data Sharing Open to all Human Genome Projects where subject consented: Hapmap, 1000 genomes Repository: Genbank, ENA, DDBJ (INSDC) Managed distribution (must be bona fide researcher) Genetic data for disease cohorts, with phenotypes Repository: DbGaP, EGA (Encrypted distributions etc.) Managed access, no redistribution Genomics England datasets Repository: GeL Datacentre
A future with closed datasets Multiple sets of Hospital/National datasets with no redistribution policies Value for research in generating statistics across this global set
Global Alliance for Genomes and Health http://genomicsandhealth.org/
Global Alliance for Genomes and Health http://genomicsandhealth.org/
Developing the UK infrastructure for e-health research John Ainsworth Deputy Director, Farr @ HeRC 16 January 2015 UCL Workshop
From Big Data to Big Scale DATA METHODS & MODELS EXPERTISE Vast data volume, velocity, variety TSUNAMI Supra-linear growth in papers & tools BLIZZARD Similar number of analysts DROUGHT Three Big Health E-Research Challenges 1. Assist hypothesis formation with data 2. Weed out non-reproducible findings early 3. Couple data-intensive healthcare and research
Who is Farr? Diseases are more easily prevented than cured and the first step to their prevention is the discovery of their exciting causes. William Farr 30 November 1807 14 April 1883
What is Farr? A distributed research institute that will integrate and scale, at the UK level, the work of four Health Informatics Research Centres (HIRCs)
History In August2012, ten UK funding agencies awarded four Centres of Excellence in e- health informatics research The four HIRCs aim to optimize the use of health records in research and address the UK s capacity building requirements to support a sustainable health informatics research base.
Health Informatics Research Centres Scotland Dundee, Glasgow, Edinburgh, St Andrews, Aberdeen, Strathclyde, MRC HGU, NHS NSS HeRC Manchester, York, Lancaster, Liverpool, Sheffield, AHSNs CIPHER Swansea, Bristol, Cardiff, Exeter, Leicester, Sussex, NWIS, Public Health Wales UCL Partners UCL, LSHTM, Queen Mary, Public Health England Map Source: www.m62.net
More History In 2013, the Farr Institute was created to support the HIRCs collective work. Farr Institute @ CIPHER Farr Institute @ HeRC Farr Institute @ Scotland Farr Institute @ UCL Partners Together, they bring a total of 20 academic institutions and two MRC units. Farr will act as the nexus of the UK Health Informatics Research Network
Aims of the Farr Create a physical and electronic infrastructure to support and accelerate the Centres collaborative work Support partnerships by providing a physical structure to co-locate NHS organisations, industry, and other UK academic centres Facilitate collaboration, the sharing of datasets, and the adoption of common standards Develop new opportunities for future data linkage at scale
UK Health Informatics Research Network Farr will lead the UK Health Informatics Research Network Farr will develop the Network s 5-year strategy plan and provide a blueprint for its activities The Network aims to strengthen the UK s capability in health informatics research by harnessing the expertise in the Farr and the wider UK research community The Network is open to all members of the research community Prof Carole Goble Prof Carole Goble
HeRC elab Based at Vaughan House IGTK L2 ISO27001:2013 in process Initial Farr Investment Labs Safe Haven HPC Devices VC Additional MRC CRI funding Clinical Proteomics Centre UK Dementia Platform Single Cell Genomics Secure file storage Secure file exchange Secure file transfer across NHS N3 Secure file transfer across public networks elab data management services via web interface Data linkage Data repository Research data extracts Data analysis software and compute Virtual machine service from remote locations Virtual machine service from secure data analysis environment Dataset inventory Personal health data repository HPC remote access
N3 NHS User N8 HPC Janet HAN HeRC Safe Haven : ISO27001 ISMS Phase 10 Research Repository Single Sign On Transient Repository Applications & Compute Remote Desktop elab 2 factor auth 2 factor auth Researchers HeRC Governance Board HeRC NHS : NHS IGTK Remote Repository AAAI NHS Pseudo Data Repository Data Transfer NHS elab Patients & Devices Dataset Catalogue
Big Data funding for health, medical and administrative data MRC 20M for the four Farr Institute nodes, for einfrastructure and buildings, June 2013 ESRC 34M for four Administrative Data Research Centres (ADRC) and Administrative Data Service, Nov 2013 MRC 39M for six Medical Bioinformatics Initiative projects, Feb 2014
The safe share project Background There is significant investment in medical research trying to unlock the value of data collected by the NHS and the wider government in order to further knowledge of disease and ill-health and improve medical treatments Building on the recent development of the, MRC and partner funded, Farr Institute, Medical Bioinformatics Initiative and the Administrative Data Research Network, and their infrastructure requirements Challenges Health Data is very personal and sensitive, and there is rightly public concern about any real or perceived inappropriate access Significant numbers of ethical, consensual and practical hurdles to making use of the data for research Title of presentation 00/00/2013 13
Meeting the Big Data challenge Being able to access data securely Being able to share data safely Being able to work together collaboratively Solve the problem once for everyone, potential solutions at scale and give public confidence that data is appropriately protected Project to be run in two parts, each with a set of pilots: 1. Secure connectivity, higher assurance network (HAN) 2. Authentication, Authorisation and Accounting Infrastructure (AAAI)
Secure Connectivity Use Cases Inter-Farr initial trial between Farr centre at Manchester and the N8 HPC at Leeds, but will extend to the other Farr centres (Swansea, London and Dundee) Intra-Farr to securely link the Swansea Farr centre with one of its collaborative projects with Bristol (ALSPAC) ADRC / Farr Pod to Data Centre connectivity between accredited secure rooms that can be connected to ADRC data centres for remote working
Authentication, Authorisation and Accounting Infrastructure (AAAI) Use Cases Dementia Study by Oxford University with the objective to demonstrate researchers using home institution credentials and a generic user request model to authenticate access to a set of relevant national and study specific datasets HeRC N8, HPC, DiRAC access between these facilities using home institution credentials emedlab partners will be able to analyse human genome data, medical images, clinical, psychological and social data. To demonstrate using a common AAAI with access via a common credential Swansea University Health Informatics Group investigating whether Moonshot can provide an authentication mechanism, allowing use of home institution credentials
Partners The project is funded and managed by Jisc working in partnership with: Wider Initiatives: The Farr Institute The MRC Medical Bioinformatics Initiative The Administrative Data Research Network Incorporating organisations involved in the pilots: University of Manchester UCL Swansea University University of Dundee Francis Crick Institute University of Oxford University of Leeds University of Sheffield University of Southampton University of Bristol HSCIC
Timetable Agreement on requirements and use cases - complete Funding approval - complete Detailed project planning in progress Detailed design and architecture of infrastructure in progress Operational standards, development controls Q1 2015 Infrastructure deployment, installation and commissioning Q2 2015 Initial operational and testing with customers Q3 2015 Customer trials begin Q4 2015 External certification ISO27001 process Q1 2016 Recommendations Q2 2016
The 3Rs of data science: Repeat, Reproduce, Reuse
The 3Rs of data science: Repeat, Reproduce, Reuse The 1T of data science: Transparency
Reproducibility A principle of the scientific method Evidence to test and justify claims Comparison of results and methods Peer review http://xkcd.com/242/ Prof Carole Goble
Defining drug exposure 192 different datasets 1. Selecting stop date 2. Handling missing stop date 3. Overlapping prescriptions Decision nodes 4. Small treatment gaps
A Data Science Commons Publish, Discover, Reuse Data Science Artefacts as Research Objects Rules 1. Each unique research object placed into the Commons must have a unique identifier. 2. That unique identifier must allow the research object to be found, shared and attributed. 3. Attribution requires associated provenance that, minimally, identifies the creator(s) of the unique research object, those that have subsequently modified it, and how it was modified. More at www.farrcommons.org
Farr ADRN Medical Bioinformatics e-infrastructure Workshop Simon Thompson The Swansea University version simon@chi.swan.ac.uk
Health Informatics Group, Swansea University FARR ADRC Swansea Bio-Info (SAIL)
FARR Based on SAIL Databank Linked Routine Data, Internationally Recognised data linkage system 4.7 million people 9 billion rows of data Over 20 core national datasets, 200+ project specific datasets GP Primary Care Inpatient & outpatients Secondary Care A&E, Emergency care Pathology & LIMS Births & Deaths Child Health & Perinatal Screening Screening Breast, Cervical Cancer registries WCB, CARIS, WCISU Education data Central Repository / Wharehouse 300 users, > 70m research induced NHS Wales connectivity (DAWN2-N3) infrastructure inside NHS core data centers
Based on Split File Principal File 1 Demographics + Link Key ID Name Address 56 Fred Bloggs The Big house 78 Jim Jones 87 peterson rd 45 Harry Lucas 19 meirwen Supplier Data File 2 Linkage ID Name Address BP Diag 56 Fred Bloggs The Big house 120/80 G33.. 78 Jim Jones 87 peterson rd 135/45 P123. 45 Harry Lucas 19 meirwen 125/75 G77.. Clinical (s) + Link Key ID BP Diag 56 120/80 G33.. 78 135/45 P123. 45 125/75 G77.. File 3 ID ALF Conf 56 65276573 88 78 32377722 97 45 27638236 95 Load into SAIL ALF_E BP Diag 4252 120/80 G33.. 7482 135/45 P123. 8436 125/75 G77..
FARR Evolution Remote Desktop (VDI) Technology, Single Sign on (Active Directory) Shared Security Model / Provisioning(v3) Two factor authentication Introduction of addition services Secure Filestore, WIKI, Helpdesk, Training Anonymising of GIS datasets (residencies and geo data) Active Directory Pooled standard config Vmware View Security Server (VPN) (x3) Vmware view Connection Broker Dedicated configurable Data Warehouse c Two Factor Authentication Server Specialist / Custom config
FARR Evolution Building on initiatives Data /Dataset documentation Data Quality measurement Automation of processes / Self service New probabilistic matching engine Natural language processing New technologies SQL Server 2014 cluster, HADOOP, R cluster Local & Remote capabilities Data Appliance UKSeRP White Labelling SAIL infrastructure Security Model v3 and provisioning v3 (some federation) Choice of two factor authentication platform Geo restrictions Project Level Encryption
National Research Data Appliance (NRDA) Simplistic Viewpoint NRDA1 NRDA2 User interface for dataset management Matching and Linkage Data Loader Data Quality Data Catalogue Pluggable architecture NRDA3 1 st deployment to NHS Trust this month
UK Secure Research Platform (UKSeRP) Simplistic view:- PORTAL Virtual Desktops NRDA Security Probabilistic Linkage Data Catalogue, Documentation, Metrics, Quality T1 T2 T3 IBM DB2 MP-DB SQL 2014 Cluster PostgreSQL + Post GIS ARCGIS HADOOP Cluster Virtualisation Stack IBM ICA HPC / Specialist Shared Filestore Doc / Community Support
UKSeRP uses NRDA User Portal ServiceDesk Data Appliance Security v3 Provisioning Capabilities Permissions People DataSets Data Loading Data management Data Documentation Data Quality Versioning Data Catalogue Probabilistic Linkage Transport / Sharing Anonymisation Trusted 3 rd Party Probabilistic Linkage Data Catalogue NLP Shared Infrastructure Data OLAP D.M. IBM DB2 Data OLAP BI SQL Server 1 2 3 Hadoop PIG HDFS Cloudera Hadoop 1 12 Files DFS Webdav Filestore SAS SPSS VDI VDI Templates VDI Templates Templates Vmware View VDI VDI Virtual Templates Templates Servers SCVMM DB2 IBM C.A. EDMS CliniThink VMware HyperV Backup, Recovery and DR Core Active Directory Accounts DHCP DNS WSUS
Data Science Building New building solely for MRC / ESRC Whole building considerably more secure/controlled than any existing building on campus. SEAP Level 4 area on top floor incorporating a server room and safe setting.
The tour so Farr!!
Health Informatics Group, Swansea University FARR ADRC Swansea Bio-Info (SAIL)
ADRC link FARR but Administrate Data Use of previous investment in systems / knowledge / development Very similar to FARR at the 1000 foot view, lots of differences in detail Lots of time on perfecting the design of the wheel must be a better design than square??? These dataset have not be shared at scale before lots of nervousness NRDA UKSeRP
New world for these data suppliers Not a repository model Compile dataset Do research Publish Destroy Data is transitory and specific to a project Data Linkage New linkage capabilities in NRDA required Possible Encryption at source with linkage based on encrypted demographics
New world for these data suppliers Security Much higher security requirements Hoping for shared infrastructure, ADRC on UKSeRP All researcher must have Safe Researcher Training / Cert System Admin / Developers Security Cleared Safe Settings Physical location with dataset locked to these Remote locations Cardiff, Bristol (link back - FARR)
New world for these data suppliers Security Much higher security requirements hoping for shared infrastructure, ADRC on UKSeRP All researcher must have Safe Researcher Training / Cert (link back FARR) System Admin / Developers Security Cleared Safe Setting linking to Cardiff, Bristol NRDA and new linkage Encryption at source linkage
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data SAIL Doc/Meta Data Devices Bespoke Data Compute Cluster NRDA Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data FARR Doc/Meta Data Devices Bespoke Data Compute Cluster NRDA Medical Images Research Image Rep. Image FARR NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data ADRC Doc/Meta Data Devices Bespoke Data Compute Cluster NRDA Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text TTP NRDA Routine Data Bespoke Data Compute Cluster NRDA MS Platform Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Biobank ProjectAnon. Structured Data UKSeRP Research Platform Images CLIMB System Bio-Info NRDA
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA CLIMB
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data Devices Bespoke Data UKDP Compute TTP NRDA NRDA Doc/Meta Data Cluster Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA
The joining up of efforts and re-use is absolutely critical Routine Data Free Text Remote NRDA Systems Systems Free Text Routine Data TTP NRDA SAFE Share Bespoke Data Compute Cluster NRDA Doc/Meta Data Devices Medical Images Research Image Rep. Image NRDA Structured Data UKSeRP Research Platform Anon. Images CLIMB System Bio-Info NRDA
The tour of routine data ends here!!
ADRC-Scotland & Farr Institute - Scotland Dr Stephen Pavis NHS Scotland
History in Scotland NHS National Services Scotland linking data for over 20 yrs Scottish Health Informatics Programme Empirical research Infrastructural design Public engagement Law and subsequent Guiding Principles Computing infrastructure (with separation of function) Data Linkage Framework (Scottish Government) Funding from ESRC (ADRC-S), MRC and 9 others (Farr and HIRC), Scottish Government (Data Linkage and Sharing Service)
The Scottish Model Facilitating research that is in the public interest whilst protecting individuals privacy Avoiding large data warehouses but ensuring data can be brought together efficiently to answer important research questions Creating partnerships and networks across sectors (academia, public and commercial sectors). But not selling data or allowing commercial companies direct access to individuals personal information Sharing resources and expertise to create efficient public services (Campbell Christie report) http://www.scotland.gov.uk/topics/statistics/datalinkageframework
Farr Scotland and ADRS-S data resources Neonatal Record GP consultations Mental Health Substance misuse Community care BIRTH Dental Out patients Hospital Admissions DEATH Maternity Prescribing A&E Screening Suicide Cancer registrations Child health surveillance Immunisation Imaging Laboratory BIRTH Education Looked after children Marriage Community care Care homes DEATH HMRC DWP Census (Scotland & UK)
IT Security Assurance NHS require System Security Protocol approved by IT Security Officer within National Services Scotland ADRC-S data suppliers require UK Government security classification ADRN have agreed that: Project data will not exceed the Official Sensitive category Each ADRC will provide an environment which is able to process data at the Official Sensitive level
Scottish Informatics and Linkage Collaboration Farr Institute (MRC) Administrative Data Research Centre (ESRC) Urban Big Data Centre? Shared computing resources at University of Edinburgh edris Research coordination and advice (NSS) Shared TTP Linkage service at NRS Shared office space at BioQuarter (UoD and UoE) SILC (Shared services for research initiatives that process sensitive data)
edata Research & Innovation Service Provide analyses, interpretation and intelligence about data (where required) 8 1 A named Person from start to finish Liaison with technical infrastructure (safe havens) 7 Single point of entry for health research Support projects from start to finish 2 Help with study design Facilitate completion of required permissions 6 5 Build relationship between data suppliers and customers 4 3 Provide expert advice on coding, terminology, meta data and study feasibility Liaison with data suppliers to secure data Agree deliverables and timelines
ADS Essex Advice/ Data Request Researcher requires access to linked data edris Co-ordinator refers data request to sources Advice and guidance Project IDs 1 Personal IDs 1 TTP Training & Researcher Approval Data Sources (e.g. NHS, Social Services, Police or local datasets) Project IDs Mapping Linking Service Project IDs 2 Personal IDs 2 Project IDs 2 with payload data 2 Project IDs 1 with payload data 2 Once trained and approvals for linkage are in place, the Researcher can access the linked dataset with in the safe haven. De-identified dataset within Safe haven
Challenges Software, various packages with different pricing mechanisms. Can we negotiate once for ADRC and Farr UK wide? Being clear for researchers on role of ADS and edris Different funding and charging models ADRC and Farr Scotland
Thank you for listening Stephen Pavis s.pavis@nhs.net
CLIMB Simon Thompson Research Computing Team University of Birmingham
CLIMB Project Funded by Medical Research Council (MRC) Four partner Universities Birmingham Cardiff Swansea Warwick ~ 8m (~$13M) grant Private cloud, running 1000 VMs over 4 sites For Microbial bioinformatics
The CLIMB Consortium Professor Mark Pallen (Warwick) and Dr Sam Sheppard (Swansea) Joint PIs Professor Mark Achtman (Warwick), Professor Steve Busby FRS (Birmingham), Dr Tom Connor (Cardiff)*, Professor Tim Walsh (Cardiff), Dr Robin Howe (Public Health Wales) Co-Is Dr Nick Loman (Birmingham)* and Dr Chris Quince (Warwick) ; MRC Research Fellows * Principal bioinformaticians architecting and designing the system
And Marius Bakke (University of Warwick, CLIMB) Since January 2015 Simon Thompson (University of Birmingham) Matthew Ismail (University of Warwick) Simon Thompson (Swansea University)
CLIMB Separate OpenStack region per site Federated single gateway to access Local GPFS high performance ~0.5PB per site CEPH storage cluster replicated across sites For archive of VMs Between 2-5PB total usable over 4 sites
Where are we? - OpenStack Birmingham kit delivered for OpenStack Proof of concept running (with real users) Cardiff, Swansea and Warwick awaiting deployment with OCF (NSSA mini tender) Collaborating with IBM GPFS development team on OpenStack issues
Where are we? - CEPH Mini tender under NSSA, awarded to Dell CEPH cluster orders placed with Dell Inktank/RedHat engaged to provide architecture and services assistance
What is emedlab? Jacky Pallas, UCL David Fergusson, Crick
emedlab is Joint project with 6 institutions UCL, QMUL, LSHTM, Crick, Sanger, EBI Clinical, imaging and genomics data Cancer, cardiovascular and rare diseases Linked to KCL, Farr London and Genomics England Shared infrastructure in off site datacentre Minimum 9,000 cores and 4Pb data Colocation costs, networking
What is emedlab?
Benefits Data/compute architecture designed for medical bioinformatics Shared expertise and training 4 junior group leaders funded Farr/eMedLab Training Academy
Biomedical compute requirement Bags of memory Not so much about compute power, Lots of low power cores for through-put More storage, MORE, MORE! Not just storage volume data complexity, heterogeneity
Data First Design? compute STORAGE
Logical Architecture for emedlab
Technology Highlights X86 (6000 cores) High capacity 40Gb Mellanox networking Chubby nodes ~ 500Gb RAM per node Open Stack/Enterprise Red Hat GPFS storage (9Pb raw)
irods (digression) Data management is critical but enforcing systems in research is difficult irods (Integrated Rules Oriented Data System) DICE team, UNC, San Diego https://www.irods.org Federated system, different zones, administrative domains Project Workflows, Micro Services (rules/policies) triggered by specific events to implement workflows Each group can implement workflows to suit their needs Federated instances for large data management Wide area instances have been implemented
Shared Co-location JANet framework any research organization can contract with supplier without full OJEU process. Anchor tenants: UCL, Kings, LSE, QMUL, Crick, Sanger Interested: Bristol, Cancer Research Institute, Imperial Genomics England? Physically co-locating large data sets to allow secure shared computation across them.
Offsite Data Centre Community Cloud Model LRI LRI UCL NIMR Clinical data sharing private networks through lightpaths? Others The Crick King s King s College College SANGER IMPERIAL Others UK JANet pilot projects expected this year. ELIXIR/CSC (Finland) have come to the same technical solutions independently. Hope to collaborate between UK and Finland to extend the connections.
Collaborative Space Life Science Hub emedlab & beyond (?) Promote Skills Development (Systems, Informatics) Prototyping and deploying standards across multiple entities (Global Alliance) Promotes collaboration (both at IT and Informatics levels faster development, less duplication of effort de-facto standards) Produce real world infrastructure tools (production use across collaborating partners) Provide Sandboxes (testing development) Attractive to Industry partners (hardware evaluations, new technology deployment) Prototype public cloud techniques in private setting (safe environment) Safe Haven for sensitive data that should not move to public cloud Provide easier access to larger data sets. Pooled resources maximise Capital investment benefits for small and large user
WHO? MRC Medical Informatics Project UK MED-BIO: aggregation, integration, visualisation and analysis of large, complex data Dr Sarah Butcher s.butcher@imperial.ac.uk Head Bioinformatics Support Service Applicant: Prof. Paul Elliott Co-Is Nicholson, Glen, Guo Partner Institutions: Imperial Institute of Cancer Research (ICR, Ashworth) European Bioinformatics Institute (EMBL-EBI, Steinbeck) Centre for the Improvement of Population Health through E-health Research (CIPHER, Lyons) MRC Clinical Sciences Centre (CSC, Petretto) MRC Human Nutrition Research (MRC-HNR, Griffin). Industrial partners: Waters Corp. Bruker Biospin Huawei Technologies Co. Ltd. Thomson Reuter Astra Zeneca Award later than others April 2014 BUT same deadlines Science case the Exposome Data The exposome Concept Strategy for knowledge generation by UK MED-BIO Main primary data volume producer is Phenome Centre = metabolomics Also: NGS (exomes, genomes, targetted) Proteomics (mass spec) Transcriptomics and methylation-based Gut metagenomics and meta-transcriptomics Genome wide association studies So need to support primary data analyses AND Integration and intelligent data-mining of large, heterogeneous, high dimensional datasets (from all of above) 1
Metabolomics Data Pre-grant Starting Point - Storage Abundance m/z A single UPLC-MS profile ~8 GB Maximum annual throughput is 50k samples ~ 2 PB of data Intermediate data modelling will inflate this further Raw data copied straight to archive, maybe re-use twice in 5 years for methods validation De-noising can shave 15-40% of data sizes Peak picking will extract ~ 1MB of data from each profile Proprietary formats rife open formats possible but tend to compress less No central storage and limited back-up and archiving for research data and not linked directly to HPC centre Phenome centre has own limited storage capacity (250TB) and managed backups Phenome centre projected to need multiple petabytes raw data archive Bioinformatics service underpins some groups but limited (old, full) storage (~200TB), back-up Several crucial data management solutions in different places e.g. Phenome centre LIMS server, IC Healthcare Tissue Bank Database Very little physical data centre space on one College site only Pressing need for a centralised tiered storage system with archiving Pre-grant Starting Point - Compute Challenges Heterogeneous job profiles Heavy use of cluster and cache-coherent memory systems in piecemeal way Sequence-based analyses mainly on bioinformatics servers (max. 128GB RAM per server) Windows desktops for some non-scaling analyses No shared: compute environment, software stack, job scheduling or storage between all groups Already significant compute bottleneck for large jobs - numbers processors but particularly jobs requiring large RAM Some jobs already requiring >1 TB RAM for extended periods and getting larger Requirement for sand-boxed development environments Requirement to centrally host non-hpc services Make system fit for purpose when purpose will change over project lifetime Big unknowns in user requirements new groups, new fellowships, emerging technologies, software, methods, partners Heterogeneous user profiles Emerging codebase e.g. metabolomics feature extraction currently running on commercial windows software, moving towards open source solutions on cluster (or even GPU eventually) Matlab / R code being ported to C++ Little central infrastructure to build on Limited central knowledgebase for parallel file systems, irods etc. TRANSmart & etriks integration not specifically funded 2
Location, Location, Location South Kensington data centre Cluster nodes SGI cache-coherent memory nodes Tiered storage Tape archive Video wall, touch overlay for meeting centre Tiered storage duplication site Tape archive duplication site High memory servers System Summary Cluster nodes - PowerEdge C6000/ C6220 Xeon E5-2660v2 2.2GHz total 3040 cores already High memory servers - PowerEdge R920 7 with 1TB RAM each, 40 cores, 16TB fast internal storage, 20TB local array and Infiniband to tier1 Cache-coherent memory nodes SGI UV 2000-640 cores, 8 TB RAM, 350TB usable locally attached scratch tiered storage from DDN on each of 2 sites: 350 TB useable tier 1 GPFS 2 petabyte tier 2 WOS TSM tape archive on SpectraT950 (2 petabyte LTO6 capacity) Asynchronous replication between sites layout Where Are We Now? Unpacking, racking, installing In use 3
Challenges/questions Operations Group All hardware set-up Existing data transferred, tiering rules configured Establish standardised software environment for compute Data flow established User grouping established Data flow outwards with partners THEN irods?? (have test setup to configure) Data Sharing environment? Interaction with Patient data systems TransSMART/eTRIKS? BUSINESS MODEL Full time sysadmin - TBC - being recruited Bioinformatician/ data manager - James Abbott (Bioinformatics Support Service) + TBC Sarah Butcher (ops chair) Bioinformatics Support Service Steve Lawlor ICT Data Centre Manager Simon Burbidge ICT HPC Manager Jake Pearce NINR/MRC Phenome Centre Data Manager 4
UVRI/MRC Medical Informatics Centre (UMIC) PIs Pontiano Kaleebu (MRC Uganda) Manj Sandhu (Sanger) Budget: ~ 2.9m funded by MRC ~ 900k capital equipment ~ 2m resource budget (staff, network connectivity, ) Capital spend all committed as of 12/2014 Physical infrastructure ( 280k) Host building funded by Wellcome Trust ( 0) Existing DR building ( 0) Contributions to electrical upgrade onsite ( 60k) Data Centre and DR upgrades ( 220k) IT equipment ( 620k)
UMIC Location Ugandan Virus Research Institute (UVRI) Campus Entebbe, Uganda
UMIC Physical Infrastructure Offices: 30m 2 Data Centre: 32m 2
UMIC Compute & Storage Compute equipment 4x HP BLc7000 blade enclosures Main compute resource 512 cores each (AMD CPUs) 4TB RAM each (8GB/core) 2x 10 GbE per enclosure 4x HP DL380p servers Virtual machine hosts (infrastructure) 20 cores each 256GB RAM each Storage equipment 2x high speed scratch storage filesystems Intel Enterprise Edition Lustre 2 MDT/MGS servers (HA) 4 OSS servers each 256TB usable on each filesystem 2x long-term reliable (aka slow ) storage HP SL4540 tray node servers 348TB replicated across two servers (one in DR building)
UMIC Networking Network equipment Juniper MX104 router HA pair Juniper SRX3400 firewall HA pair 5x Juniper EX4300 1GbE switches 3x Juniper EX4550 10GbE switches 3x Aruba Instant 115 wireless access points Connectivity Google installing 2x (redundant) 1Gb fibre links Regional connectivity at up to 1Gbps via RENUnet Overseas connectivity initially at 10Mbps Resource spend will stay constant over time --> bandwidth will increase
Management & Personnel Technical Infrastructure Working Group Scientific Working Group Support staff Project Administrator (hired) Informatics staff 1x Senior Bioinformatician (recruitment ongoing) Technical staff 1x Systems Manager (hired; currently training in UK) 3x Other Systems posts (recruitment in 2015 Q2)
MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus V1 David Golding, Tom Fleming January 2015 University of Leeds 2015 1
MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus Organisational design MRC Medical Bioinformatics Centre Joint Projects Board (LTHT & University) Leeds Institute of Data Analytics (LIDA) ESRC Consumer Data Research Centre Researchers Example specialisms: Clinical, Data Scientist, Statistician, Epidemiologist, Health Economist Researchers Example specialisms: Data Scientist, Geographer, Statistician Centre Director (MBC) IT Director Centre Director (CDRC) Centre Manager (MBC) IRC Lead Centre Manager (CDRC) Research Operations (MBC) Centre Operations Team for MBC IRC Development Manager IRC Developer (and steady state) Research Operations (CDRC) Centre Operations Team for CDRC Integrated Research Campus (IRC) Team Head of Service Management Service Support team HPC team Servers and Storage Team Networking Team Datacentre Team Desktop Support Team Desktop Development Team Security Team University IT Operations Teams University of Leeds 2015 2
MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus Service design Research services Centre / IRC staff Technology services IT Operations staff Centre Operations Teams UoL IT Operations Data administration Data profiling Data linkage Centre Manager (Reports to Principle Investigators) Physical desktops build Desktop Development Team Data cleaning Data analysis Job obfuscation Security Data aggregation / abstraction Audit Research Operations Virtual desktops Applications support Desktop Support Team Operating systems IRC Data Services Team Data transfer management IRC Development Manager (Reports to UoL IT Head of Development) Virtual servers (e.g. SQL, Achiever) Application & environment management IRC Developer Applications Operating systems Logical storage management Storage areas and access controls for research groups Storage areas and access controls for data administration services Storage areas and access controls for data deposit / gateway Virtual desktop platform Virtualisation hypervisor Physical servers Servers and Storage Team Physical storage High Performance Computing Applications on HPC HPC Team Operating systems for HPC Storage for HPC Network Networking Team Power and cooling Racks Datacentre Team University of Leeds 2015 3
MRC Medical Bioinformatics Centre ESRC Consumer Data Research Centre Integrated Research Campus Platform design Researchers / data scientists outside physical Centre (Worsley L11) Integrated Research Campus: Research centres IRC Data Services Team Researchers / data scientists inside physical Centre (Worsley L11) External gateway zone Data Services Zone VDI Control Service Research zone VDI Control Service Data Services Working Area Research Working Area Deposit Gateway Data Services servers Data Services virtual desktops Research-group servers Researcher virtual desktops Data deposited under Data Transfer Agreements; projects have ethical and governance approval Deposit gateway Database servers Working areas Risk-profiling tools Analysis tools Database servers Shared-licence apps Working areas Statistics tools Analysis tools A B Linking tools Research group sharing A B Collaboration tools Data providers Security Audit Data profiling Landing areas A B Data profiling Data cleaning Data linkage Data analysis Internal release of datasets for risk-profiling and linking Data profiling Data cleaning Data linkage Data analysis Data visualisation Security Data exploration Audit Internal release of working copies for daily working (check-out, check-in) Data analysis Data visualisation Data cleaning Data exploration Internal release HPC job obfuscation Data aggregation / abstraction HPC job obfuscation Data aggregation / abstraction Security Audit of datasets arrived from provider Security Audit Security Audit Publishing Ggateway Data Services Store Research Working Store Publishing gateway Data Services Store Manager Internal release of linked, risk-profiled datasets Working Store Manager Data consumers External release of authorised, risk-profiled datasets Security Audit Publishing area Internal release of published datasets after risk profiling / Data controller Code controller Security Candidates for publishing: internal release for risk profiling Data controller Code controller Security Data sharing A B assurance / etc Audit Audit Internal release of Storage areas Version control Storage areas Version control obfuscated / anonymous data for analysis and then storage of output A B A B Provided datasets Master linking table Published datasets Linked, riskprofiled data sets Working copies Outputs Internal gateway zone Analysis Gateway System administration zone IRC Administration Security Analysis gateway Virtual desktop administration Audit monitoring and alerting Audit Identity Security System update management System monitoring and alerting management External release of obfuscated / anonymous data for analysis and then storage of Analysis transit area A B Audit Virus scan management Application monitoring and alerting System update services Directory service output High Performance Compute LICAP A home B Data analysis MBC HPC (Cluster) nobackup scratch BCGene Data analysis IRC Data Services Team IT Operations Servers & Storage Team Corporate system Corporate directory update services Identity Security management University core systems MBC HPC (SGI UV2) Data analysis IT Operations HPC team scratch Shared network and deployment hosts? Farr HPC (SGI UV2) Data analysis HMR only scratch University of Leeds 2015 4