Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Anthony Underwood Bioinformatics Unit, Infectious Disease Informatics, Microbiological Services, Public Health England
Public Health England PHE is an executive agency, sponsored by the Department of Health, UK. We protect and improve the nation's health and wellbeing, and reduce health inequalities Microbiology Services we provide specialist investigation and control of communicable disease outbreaks, chemical incidents, radiation and other environmental hazards we provide the evidence-based science and clinical practice in specialist microbiology in support of the wider public health system and NHS hospitals 2 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
PHE Reference Microbiology Reference Microbiology carries out a broad spectrum of work relating to prevention of infectious disease. The remit of the centre at Colindale includes: Infectious disease surveillance, Providing specialist and reference microbiology and microbial epidemiology, Research & Development Coordinating the investigation and cause of national and uncommon outbreaks, Helping advise government on the risks posed by various infections Responding to international health alerts. 3 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
PHE Specialist Microbiology Services The PHE Specialist Microbiology Services consists of 8 specialist clinical laboratories operating across England. These laboratories provide a comprehensive range of clinical diagnostic and public health microbiology tests and services to the NHS and allied healthcare providers sector. SMS also includes a further five dedicated food, water and environmental (FW&E) testing laboratories who undertake statutory testing for the NHS, local authorities, and other key stakeholders. 4 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Formed in 2013 3 staff 2 Linux servers Amongst first public health institutes to see the potential of bioinformatics and fund it Now Bioinformatics Unit Infectious Disease Informatics 15 staff 512 cores (UGE) 300Tb usable HPC storage 5 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
MS Public Health Functions Questions we often ask of a pathogen isolate: 1. What is it? 2. What characteristics does it have? 3. How does it relate to other isolates? 6 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
1. What is it? 2. What characteristics does it possess? 1. Identify the infectious agent or exclude particular infections and associated risks. 2. Antibiotic resistant? Presence of toxins?
3. How does it relate to other isolates? Do cases of an infection have a common source or are they linked? What is the source and what are the risk factors? e.g: food, school, travel to certain countries What is the best way of: treating the affected? protecting others? limiting further spread? 8 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pathogen typing Giving bugs a label If we discover isolates with the same type we can Include/exclude individual cases to an outbreak (e.g MRSA in hospitals) Establish an association between an outbreak of food poisoning and a specific food vehicle (e.g egg mayo sandwich) Trace the source of contaminants within a manufacturing process (e.g chocolate factory, baby feed) The type also helps Determine changes in microbial populations in response to interventions (e.g. vaccination strategies, vaccine escape) Study variations and trends in the pathogenicity, virulence and antibiotic resistance within a species (e.g new ABr acquisition) 9 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
20 th Century Microbiology 10 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Bacterial Identification Culture 11 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Bacterial Identification Gram Stain and API Strips 12 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Phenotypic Characterisation of Microbes Serotyping Sensitivity Testing Phage Typing 13 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Gel-based Typing methods Ciprofloxacin-resistant Salmonella Kentucky in Travellers http://wwwnc.cdc.gov/eid/article/12/10/06-0589-f1.htm 14 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
MLST: multi-locus sequence typing Locus adhp Putative function of gene Alcohol dehydrogena se (gbs0054) Size of sequence d fragment (bp) No. (%) of polymorp No. of alleles identified hic nucleotid e sites % G+C d n /d s 498 11 12 (2.4) 43.1 0.13 72286 Position in GBS genome ( bp) phes Phenylalanyl trna synthetase 501 5 7 (1.4) 37.1 0.17 912817 atr Amino acid transporter (gbs0538) 501 8 12 (2.4) 36.9 0.14 560085 glna Glutamine synthetase 498 6 6 (1.2) 35.7 0.12 1868862 sdha Serine dehydratase (gbs2105) 519 6 13 (2.5) 41.4 0.12 2179923 Sørensen U B S et al. mbio 2010; doi: 10.1128/mBio.00178-10 glck Glucose kinase (gbs0518) 459 4 7 (1.5) 42.6 0.13 538770 tkt Transketolas e (gbs0268) 480 5 8 (1.7) 38.9 0.42 287111 15 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Virus Identification 16 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Typing of Viruses 17 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Our vision for 21 st Century Microbiology 18 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Whole genome sequencing 19 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Why whole genome sequencing? Cost e.g replacement of Salmonella serotyping Speed e.g replacement of TB drug resistance testing Added value multiple outputs from one test Extra resolution increased discriminatory power over traditional technqiues 20 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Salmonella population structure Minimal spanning tree of MLST data for S. enterica subspecies enterica Each circle corresponds to a sequence type (ST) ebgs are natural clusters of genetically related isolates MLST STs correlate with serotypes Achtman et al., 2012 21 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Hypothetical WGS-based workflow for Diagnostics & Reference Microbiology 22 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Achieving the ambition 23 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pilot studies: Lab protocols Clinical scientist Sequencer 24 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pilot studies: Bioinformatics process Bioinformatician Blah, blah, blah X,Y,Z A,B,C Blah, blah.. 25 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Pilot studies: Interpretation? Department of Health officials, Doctors, Epidemiologists Blah, blah, blah X,Y,Z A,B,C Blah, blah.. 26 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Moving from pilot studies to routine WGS for public health microbiology 27 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Writing scripts is easy Creating software is hard In order for WGS to replace current tests the assays require accreditation (ISO15189) Quality Reproducibility Audit trail Any WGS-based test suitable for public health intervention will need Speed Resilience 28 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing Presentation title - edit in Header and Footer
Quality Working with laboratory scientists and epidemiologists in a 3-phase approach 1. Generation of a command line workflow based on user-requirements 2. User testing and accustomisation using Galaxy 3. Automation 29 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
How do we generate outputs? Quality assessment and trimming Important to be able to provide a quality score for the result as well as the reads Majority of our workflows use mapping rather than assembly Derivation of 7-locus MLST from mapping to loci that comprise the schema Gene profiling for ABr and virulence factors using single-copy housekeeping gene as +ve 30 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Reproducibility Fastq files are tagged campylobacter-jejuni-complex-typing : 2-0-0 : kmerid_pattern: Campylobacter (jejuni coli)" UID-sample_name-workflow-version.fastq.gz components Workflows : are described in a config file - component_name: "phe/qa_and_trim" campylobacter-jejuni-complex-typing : 2-0-0 : kmerid_pattern: Campylobacter (jejuni coli)" version: components "1-1" : - component_name: "phe/qa_and_trim" version: "1-1" - version: component_name: "1-0" "phe/kmerid" version: "1-0" - component_name: "phe/mlst_typing" version: - "1-1" component_name: "phe/gene_finder" version: "1-0" - component_name: "phe/combine_xml" version: "1-0" version: "1-0" - component_name: "phe/kmerid" - component_name: "phe/mlst_typing" - component_name: "phe/gene_finder" - The component_name: results for "phe/combine_xml" each sample are tagged with same workflow version: "1-0" and version 31 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Auditability Each sample is tracked throughout the process from sending lab to report output Metrics from lab processing recorded Sequencing quality is logged Each component of the bioinformatics process logs its own progress and success/ failure Only when all quality thresholds are achieved and all components are completed are results/reports transferred to end-users 32 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
isolates received Pathogen isolates Ad Hoc received Scripts Reports Plate and form Workflows: Clinical scientist Culture Bioinformatician submission UNIX System Administrator UGE Nucleic DDN Lustre-based 4 Hrs UNIX Sys Admin Workflow-specific Computing acid High performance Kmer Liquid components Cluster handling extraction storage robots Sequencing Identification Trimming Bioinformatician Sample Dilution G A T C on C Illumina Department Gene Library Reports Preparation of Serotype T G MLST G A C profile Automated HiSeq workflows 2500 Health Metrics=> officials G A type A C T LIMS G A T C C C C G A Rapid T Mode Web-based T G G A form C Gene G A A C T G A MLST T C C selecting C C G A T profile T G type G A C workflows for G A A C T samples Department of Health officials Library preparation and Sample Doctors, Preparation: Epidemiologists 24 Hrs sequencing: 72 Hrs Doctors, Epidemiologists Drug resistance C C G A T Consensus G sequence A T C C T G G A C G A A C T C C G A T 96 well plate Automated Bioinformatics Sequence Sequencing Technician G A T C C T G G A C G A A C T C C G A T G A T C C T G G A C G A A C T C C G A T Deplexing: 4hrs Clinical Sequencing scientist Technician
Speed and Resilience Infrastructure 34 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Speed and Resilience Sequencing machines UGE High Performance Computing cluster PHE/Colindale zone HAproxy & irods server With SSL SAN certificate irods / PHE zone irods / other zone? High Performance Storage DDN EXAScaler / Lustre filesystem DDN WOS object storage system PHE WAN WTSI? PHE/Birmingham zone Sequencing machines Computing and Storage system HAproxy irods server SSL SAN certificate DDN WOS object storage system 35 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Examples of WGS in action Routine samples processed from April to September 2014 Organism Number Processed Salmonella 3954 Staphylococcus aureus 913 Streptococcus pyogenes 1274 Streptococcus pneumoniae 959 Other bacteria 238 HCV 114 HEV 3 HIV 257 Influenza 187 MeV 47 Other viruses 35 Total 7981 36 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Lustre FS problems Apparently random failures to write or writing of incomplete files Bio-banking for phenotype-genotype studies Data release Timely release of raw data Minimal meta data o Date of isolation o o Current and Future Challenges Source (Human/Environmental/Food) Place (Country?) Policy makers still wary 37 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Current and Future Challenges Scalability Plan to scale to 3000 samples/week OpenStack for surge compute Currently have 300Tb usable Lustre storage Medium term archive to object store Release data to ENA (SRA) at EBI 16 weeks Lustre 6 months Object store for ever SRA Fastqs (CRAM?) Bam files All result/log/error and meta files Fastqs (with workflow descriptor) Text/pdf result files and reports Meta data Fastqs (CRAM?) 38 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing
Acknowledgements Virtual Pathogen Genomics Unit Bioinformatics Unit Francesco Giannoccaro Matthew Goulden,Steven Platt, Rediat Tewolde, Aleksey Jironkin, Ali Al-Shahib, Ulf Schaefer, Kieren Lythgow Jonathan Green Other Bioinformaticians Tim Dallman, Phil Ashton Michel Doumith Reference and Specialist Microbiology laboratories Icons made by Freepik from www.flaticon.com 39 Provisioning robust automated analytical pipelines for whole genome-based public health microbiological typing