The IT Challenges of Next- Gen Sequencing

The IT Challenges of Next- Gen Sequencing Tony Cox Head of Sequencing Informatics Sanger Institute, Cambridge, UK 24th November 2009 avc@sanger.ac.uk

Outline» Next generation sequencing presents big challenges in informatics and data management. Driven by rapid change in:» Chemistry/instrumentation» Analysis techniques and software» Storage/processing requirements» Problems we have faced at Sanger and some solutions we have implemented

Capillary Sequencing Limitations» Number of samples per experiment (96)».0001 Gb/run» 1000 base reads» 1-2 hrs run time» $100,000 / Gb Since the human genome is 3Gb this approach is fundamentally limiting - a change was needed to make routine genome-scale sequencing viable

Moore s Law vs. Sequencing Sequencing is a key research technique that drives biological discovery. Pressure to sequence faster and more cheaply has been relentless.

Next Generation Sequencing Instrumentation Illumina - Genome Analyser Life Sciences SOLiD Roche/454 Titanium

Single Base Sequencing Cyclic process of: incorporate single, terminated, dye-labelled base. illuminate with laser and detect de-protect, repeat until chemistry becomes unreliable

GAIIx Optics

Illumina Single Base Sequencing» Flowcell similar size to microscope slide 60 61» 8 sample lanes» Two lasers + two filters detect four base/channels» 120 image tiles /lane» 1 image = 8Mb L1 L8 A C» ~500k images G T 1 120

Raw Image Data to DNA Sequence Images acquired at each chemistry cycle where one base is added 1 2 3 4 5 6 7 8 9 Base sequence T G C T A C G A T

Sanger Illumina Production Facility 40 x GAIIx /RTA

IT Challenges» What are the IT challenges associated with running multiple next-generation sequencers in a high-throughput environment?» Understanding the data» How much will we produce?» How much will we keep?» How much must we move?

How much data will we produce?» Raw instrument data (huge number of large images)» Intermediate pipeline processing data (product of image processing). Typically very many text files.» Run folder has >1 million files in it» Results data small number of large files. May be 100x smaller than raw data» QC and LIMS» Bases and qualities» Alignments

How much data will we keep?» Images (raw data) are not interesting in the long term. Keep for only days or a few weeks (allows for re-analysis)» Keep what intermediate data you need to validate the experiment as a success.» QC data, LIMS and tracking information. May be stored longer term (years?).» Results data keep forever» Bases and qualities» Alignments, SNPs

How much data will we move?» Data has to be separated from the instrument at some point (RTA now does this for us)» May need to move to several locations for analysis, safe archive etc» Terabytes of data are likely to be involved» Moving terabyte datasets around networks is non-trivial even in an advanced IT infrastructure

Sanger NGS Data Output Instrument Upgrades Yearly Capillary output

Storage Planning» This is difficult and getting it wrong can break budgets and science projects» Think first in terms of bases produced, not in bytes needed» Work out bytes-perbase multipliers that are sensible for your scientific objectives

Storage Planning An Example from Sanger» We allow ~15 bytes/base for pipeline output storage.» Drive this down with more efficient storage formats!» Allow 15x-20x inflation for analysis (e.g. alignments and SNP calling)» Allow ~5x for long term storage of results

Compute Planning» Depends on type of analysis.» Work out how many millions of short reads your preferred aligner can process per hour» Extrapolate to the number of CPU days/day you will need to keep up» Analysis is rarely a clean process. Much reanalysis takes place

Compute + Storage = I/O» If your compute and storage requirements are big your network and disk I/O will be critical to efficiency.» Moving data around is very slow» Keep compute and storage close and well connected.

Archive (ENA) Sequencing Data Flow 1.RTA/CIFs 10 x 50Tb NFS Staging Area 2. pipeline analysis 3. archive Sequencing farm Analysis farm Analysis farm Lustre scratch storage Oracle Database (100Tb) 4. secondary analysis

Instrument Data Management Staging Storage RTA/CIFS IL3 IL3 IL2 IL2 IL1 IL1 10-15Tb per instrument 4-6 wk production buffer Staged data deletion policy Incoming Incoming Analysis Analysis Outgoing Outgoing Pipeline Monitor

What have we learned?

Manufacturers are upgrading instruments constantly» Illumina went from 10 Gbases per run in Q1 2009 to a 50 Gbases now and projected 95 Gbases per run by end 2009.» Storage requirements increase 10-fold in one year.» But real world data yields rarely match those advertised» At some point the informatics/it budget passes the sequencing budget

Plan for Change» Just have to accept that instruments, software and data processing requirements are changing very rapidly (month by month).» Plan our storage infrastructure carefully - or data management quickly gets out of control and projects will suffer

Precision is Difficult» We almost always underestimate the informatics resources needed to support data production and analysis.» Lab protocols and analysis techniques are changing rapidly. We need an agile approach to developing our software» It will probably be obsolete in less than 12 months

In Conclusion» Next gen sequencing is still a very rapidly moving field.» Plan for change!» keeping our infrastructure flexible» keep disk space expandable» keep software agile