The IT Challenges of Next- Gen Sequencing

Similar documents
Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Data Management & Storage for NGS

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

July 7th 2009 DNA sequencing

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

G E N OM I C S S E RV I C ES

Challenges in data acquisition, storage and processing for NIH funded studies

High Performance Compu2ng Facility

Big data in cancer research : DNA sequencing and personalised medicine

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Next generation DNA sequencing technologies. theory & prac-ce

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

BioHPC Web Computing Resources at CBSU

Accelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com DataDirect Networks. All Rights Reserved

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

A Laboratory Information. Management System for the Molecular Biology Lab

NGS data analysis. Bernardo J. Clavijo

Core Facility Genomics

Q&A: Kevin Shianna on Ramping up Sequencing for the New York Genome Center

How Sequencing Experiments Fail

Increasing Lab Efficiency by Automating Sample Test Workflows Using OpenLAB Enterprise Content Manager (ECM) and Business Process Manager (BPM)

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

SRA File Formats Guide

Solid State Drive Architecture

Reproducible Research: A user s perspective on how to enable new discoveries with the OSDC

Here are my slides from lecture, along with my notes about each slide.

MSU Tier 3 Usage and Troubleshooting. James Koll

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Introduction to next-generation sequencing data

Overview of Next Generation Sequencing platform technologies

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Operations Management and the Integrated Manufacturing Facility

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

CSCA0102 IT & Business Applications. Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global

Institutional Partnership Program

Data Movement and Storage. Drew Dolgert and previous contributors

<Insert Picture Here> The Evolution Of Clinical Data Warehousing

Integrated Rule-based Data Management System for Genome Sequencing Data

Frequently Asked Questions (FAQ)

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

EaseTag Cloud Storage Solution

EMBL Identity & Access Management

Management von Forschungsprimärdaten und DOI Registrierung. Dr. Matthias Lange (Bioinformatics & Information Technology) June 19 th, 2013

Recommended hardware system configurations for ANSYS users

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

Lustre failover experience

Introduction to NGS data analysis

PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA

Analysis of ChIP-seq data in Galaxy

NGS Technologies for Genomics and Transcriptomics

Technology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips

lesson 1 An Overview of the Computer System

Writing Assignment #2 due Today (5:00pm) - Post on your CSC101 webpage - Ask if you have questions! Lab #2 Today. Quiz #1 Tomorrow (Lectures 1-7)

Key Considerations for Managing Big Data in the Life Science Industry

The Microsoft Large Mailbox Vision

Introduction to Research Data Management

Data-Intensive Science and Scientific Data Infrastructure

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Big Data Challenges in Bioinformatics

Storage for Science. Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments. An Isilon Systems Whitepaper

The NGS IT notes. George Magklaras PhD RHCE

MiSeq: Imaging and Base Calling

How to recover a failed Storage Spaces

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Discover how customers are taking a radical leap forward with flash

Maximize Storage Efficiency with NetApp Thin Provisioning and Symantec Thin Reclamation

Myths about Historians

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Intro to Bioinformatics

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT

Data management challenges in todays Healthcare and Life Sciences ecosystems

Cluster Generation. Module 2: Overview

Illumina GAIIx Sequencing Service

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Virtualizing SQL Server 2008 Using EMC VNX Series and Microsoft Windows Server 2008 R2 Hyper-V. Reference Architecture

Cloud-Based Big Data Analytics in Bioinformatics

Genetic diagnostics the gateway to personalized medicine

Product Brief: XenData X2500 LTO-6 Digital Video Archive System

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Tech Application Chapter 3 STUDY GUIDE

Automated Lab Management for Illumina SeqLab

Microbial Oceanomics using High-Throughput DNA Sequencing

SEQUENCING. From Sample to Sequence-Ready

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

ENTELEC 2002 SCADA SYSTEM PERIODIC MAINTENANCE

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Automated and Scalable Data Management System for Genome Sequencing Data

Backup architectures in the modern data center. Author: Edmond van As Competa IT b.v.

Record Storage and Primary File Organization

Storage Solutions for Bioinformatics

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

System Architecture. CS143: Disks and Files. Magnetic disk vs SSD. Structure of a Platter CPU. Disk Controller...

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Many government agencies are requiring disclosure of security breaches. 32 states have security breach similar legislation

Transcription:

The IT Challenges of Next- Gen Sequencing Tony Cox Head of Sequencing Informatics Sanger Institute, Cambridge, UK 24th November 2009 avc@sanger.ac.uk

Outline» Next generation sequencing presents big challenges in informatics and data management. Driven by rapid change in:» Chemistry/instrumentation» Analysis techniques and software» Storage/processing requirements» Problems we have faced at Sanger and some solutions we have implemented

Capillary Sequencing Limitations» Number of samples per experiment (96)».0001 Gb/run» 1000 base reads» 1-2 hrs run time» $100,000 / Gb Since the human genome is 3Gb this approach is fundamentally limiting - a change was needed to make routine genome-scale sequencing viable

Moore s Law vs. Sequencing Sequencing is a key research technique that drives biological discovery. Pressure to sequence faster and more cheaply has been relentless.

Next Generation Sequencing Instrumentation Illumina - Genome Analyser Life Sciences SOLiD Roche/454 Titanium

Single Base Sequencing Cyclic process of: incorporate single, terminated, dye-labelled base. illuminate with laser and detect de-protect, repeat until chemistry becomes unreliable

GAIIx Optics

Illumina Single Base Sequencing» Flowcell similar size to microscope slide 60 61» 8 sample lanes» Two lasers + two filters detect four base/channels» 120 image tiles /lane» 1 image = 8Mb L1 L8 A C» ~500k images G T 1 120

Raw Image Data to DNA Sequence Images acquired at each chemistry cycle where one base is added 1 2 3 4 5 6 7 8 9 Base sequence T G C T A C G A T

Sanger Illumina Production Facility 40 x GAIIx /RTA

IT Challenges» What are the IT challenges associated with running multiple next-generation sequencers in a high-throughput environment?» Understanding the data» How much will we produce?» How much will we keep?» How much must we move?

How much data will we produce?» Raw instrument data (huge number of large images)» Intermediate pipeline processing data (product of image processing). Typically very many text files.» Run folder has >1 million files in it» Results data small number of large files. May be 100x smaller than raw data» QC and LIMS» Bases and qualities» Alignments

How much data will we keep?» Images (raw data) are not interesting in the long term. Keep for only days or a few weeks (allows for re-analysis)» Keep what intermediate data you need to validate the experiment as a success.» QC data, LIMS and tracking information. May be stored longer term (years?).» Results data keep forever» Bases and qualities» Alignments, SNPs

How much data will we move?» Data has to be separated from the instrument at some point (RTA now does this for us)» May need to move to several locations for analysis, safe archive etc» Terabytes of data are likely to be involved» Moving terabyte datasets around networks is non-trivial even in an advanced IT infrastructure

Sanger NGS Data Output Instrument Upgrades Yearly Capillary output

Storage Planning» This is difficult and getting it wrong can break budgets and science projects» Think first in terms of bases produced, not in bytes needed» Work out bytes-perbase multipliers that are sensible for your scientific objectives

Storage Planning An Example from Sanger» We allow ~15 bytes/base for pipeline output storage.» Drive this down with more efficient storage formats!» Allow 15x-20x inflation for analysis (e.g. alignments and SNP calling)» Allow ~5x for long term storage of results

Compute Planning» Depends on type of analysis.» Work out how many millions of short reads your preferred aligner can process per hour» Extrapolate to the number of CPU days/day you will need to keep up» Analysis is rarely a clean process. Much reanalysis takes place

Compute + Storage = I/O» If your compute and storage requirements are big your network and disk I/O will be critical to efficiency.» Moving data around is very slow» Keep compute and storage close and well connected.

Archive (ENA) Sequencing Data Flow 1.RTA/CIFs 10 x 50Tb NFS Staging Area 2. pipeline analysis 3. archive Sequencing farm Analysis farm Analysis farm Lustre scratch storage Oracle Database (100Tb) 4. secondary analysis

Instrument Data Management Staging Storage RTA/CIFS IL3 IL3 IL2 IL2 IL1 IL1 10-15Tb per instrument 4-6 wk production buffer Staged data deletion policy Incoming Incoming Analysis Analysis Outgoing Outgoing Pipeline Monitor

What have we learned?

Manufacturers are upgrading instruments constantly» Illumina went from 10 Gbases per run in Q1 2009 to a 50 Gbases now and projected 95 Gbases per run by end 2009.» Storage requirements increase 10-fold in one year.» But real world data yields rarely match those advertised» At some point the informatics/it budget passes the sequencing budget

Plan for Change» Just have to accept that instruments, software and data processing requirements are changing very rapidly (month by month).» Plan our storage infrastructure carefully - or data management quickly gets out of control and projects will suffer

Precision is Difficult» We almost always underestimate the informatics resources needed to support data production and analysis.» Lab protocols and analysis techniques are changing rapidly. We need an agile approach to developing our software» It will probably be obsolete in less than 12 months

In Conclusion» Next gen sequencing is still a very rapidly moving field.» Plan for change!» keeping our infrastructure flexible» keep disk space expandable» keep software agile