The NGS IT notes. George Magklaras PhD RHCE



Similar documents
Data storage considerations for HTS platforms. George Magklaras -- node manager

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

CHALLENGES IN NEXT-GENERATION SEQUENCING

Storage Solutions for Bioinformatics

SMB Direct for SQL Server and Private Cloud

How To Evaluate Netapp Ethernet Storage System For A Test Drive

Next Generation Sequencing

Data management challenges in todays Healthcare and Life Sciences ecosystems

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Michael Kagan.

Transforming the UL into a Big Data University. Current status and planned evolutions

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Next generation DNA sequencing technologies. theory & prac-ce

New Storage System Solutions

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

An Affordable Commodity Network Attached Storage Solution for Biological Research Environments.

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Oracle Maximum Availability Architecture with Exadata Database Machine. Morana Kobal Butković Principal Sales Consultant Oracle Hrvatska

Building a Scalable Storage with InfiniBand

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Next Generation Sequencing: Technology, Mapping, and Analysis


<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Best Practices for Data Sharing in a Grid Distributed SAS Environment. Updated July 2010

Solution Brief: Creating Avid Project Archives

PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA

Backup and Recovery Solutions for Exadata. Ľubomír Vaňo Principal Sales Consultant

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

NGS data analysis. Bernardo J. Clavijo

HPC Update: Engagement Model

July 7th 2009 DNA sequencing

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Cluster Implementation and Management; Scheduling

CRIBI. Calcolo Scientifico e Bioinformatica oggi Università di Padova 13 gennaio 2012

Supported File Systems

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Data Center Op+miza+on

ZFS Storage Solutions for Unstructured Data Challenges

EMC ISILON X-SERIES. Specifications. EMC Isilon X200. EMC Isilon X210. EMC Isilon X410 ARCHITECTURE

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Integrated Grid Solutions. and Greenplum

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

A Survey of Shared File Systems

Storage Architectures. Ron Emerick, Oracle Corporation

Next Gen Data Center. KwaiSeng Consulting Systems Engineer

Big data in cancer research : DNA sequencing and personalised medicine

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

High Performance SQL Server with Storage Center 6.4 All Flash Array

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

Isilon: Scalable solutions using clustered storage

AIX NFS Client Performance Improvements for Databases on NAS

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Isilon IQ Scale-out NAS for High-Performance Applications

Computational Genomics. Next generation sequencing (NGS)

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

3 Red Hat Enterprise Linux 6 Consolidation

Maurice Askinazi Ofer Rind Tony Wong. Cornell Nov. 2, 2010 Storage at BNL

EMC ISILON AND ELEMENTAL SERVER

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Bioinformatics Grid - Enabled Tools For Biologists.

HADOOP IN THE LIFE SCIENCES:

Addendum No. 1 to Packet No Enterprise Data Storage Solution and Strategy for the Ingham County MIS Department

E4 UNIFIED STORAGE powered by Syneto

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Datasheet NetApp FAS8000 Series

SPC BENCHMARK 1 EXECUTIVE SUMMARY 3PAR INC. 3PARINSERV T800 STORAGE SERVER SPC-1 V1.10.1

Future technologies for storage networks. Shravan Pargal Director, Compellent Consulting

Next generation sequencing (NGS)

Lessons learned from parallel file system operation

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

WHITE PAPER. How To Build a SAN. The Essential Guide for Turning Your Windows Server Into Shared Storage on Your IP Network

Backup and Recovery Solutions for Exadata. Cor Beumer Storage Sales Specialist Oracle Nederland

HPC Advisory Council

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Oracle Database Scalability in VMware ESX VMware ESX 3.5

RFP-MM Enterprise Storage Addendum 1

Cisco-EMC Microsoft SQL Server Fast Track Warehouse 3.0 Enterprise Reference Configurations. Data Sheet

Lustre failover experience

Enabling High performance Big Data platform with RDMA

EMC VMAX3 FAMILY - VMAX 100K, 200K, 400K

Configuration Maximums

Enabling Technologies for Distributed Computing

Hadoop on the Gordon Data Intensive Cluster

UCS M-Series Modular Servers

Transcription:

The NGS IT notes George Magklaras PhD RHCE Biotechnology Center of Oslo & The Norwegian Center of Molecular Medicine University of Oslo, Norway http://www.biotek.uio.no http://www.ncmm.uio.no http://www.no.embnet.org

Agenda NGS basic notions and history Outline of NGS IT challenges The storage part The networking part The computing part The big picture

N G S brave IT engineer How not to hit the NGS IT platform brick wall

NGS basics: The genome the genome is the entirety of an organism's hereditary information. It is encoded either in DNA or RNA. The genome includes both the genes and the non-coding sequences of the DNA/RNA. The DNA/RNA sequence is the code. This code is used as the blueprint to construct proteins, the building blocks living things are made of.

NGS (why?) Next Generation Sequencing (NGS, also known as High Throughput Sequencing) aims to help us: Record the code of life at a reasonable time and cost (sequencing). Understand how living things are made (transcriptome, proteome) and how the code works. Understand how the environment affects us (metagenomics) and how we differ from each other (comparative genomics) Assemble this information into a complete genome project

Sequencing throughput (ST) ST= Sequences generated/ time (OR) cost Gb - Gigabases Mb - Megabases kb - Kilobases

Reverse engineering workflow (software) This program performs functions A and B by using algorithms X and Y.

NGS workflow Sequence Assembly Annotated Genome Project

NGS technologies 454 (Roche) Illumina Genome Analyzer Applied Biosystems SOLiD Analyzer Single Molecule Sequencing Analyzers

Illumina NGS workflow

NGS history (2) 1984 Leroy E. Hood builds the first semi-automated DNA sequencer 1987 ABI 370 Protein Sequencer by Applied Biosciences

NGS history Walter Gilbert Frederick Sanger Paul Berg 1980 - Nobel Prize in Chemistry "for their contributions concerning the determination of base sequences in nucleic acids" (Gilbert, Sanger) "for his fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant-dna" (Berg)

NGS history (3) 1996 Pål Nyrén and Mostafa Ronaghi invent pyrosequencing at the Royal Institute of Technology (KTH) Sweden 2004 454 Pyrosequencer

NGS history (4) David Walt Shankar Balasubramanian David Klenerman We have an idea that could increase the rate and lower the cost of gene sequencing by [a factor of] 104 or 105

The NGS IT challenge Life Scientists generate more data than they can analyze. They need the data, but they need IT to manage the NGS workflow. You need to know how to: Store: How much and at what speed? Move: NGS data needs to move around. Process the data: CPU and memory requirements.

Enter the Petabyte era

The Petabyte table HTS device runs/ year Tier 1 Gbytes Tier 2 Gbytes Tier 3 Gbytes Tier 4 Gbytes TOTAL Tbytes Illumina 100 9728 100 100 400 990 454 100 200 50 25 75 27 SOLiD 100 6144 100 100 200 80

NGS devices data production Courtesy Elliott H. Margulies PhD, NIH

Tiered storage Tier 1: raw unprocessed data as they come out from the instrument (mostly images) Tier 2: including base (or colour) calls, intensities and first pass quality scores Tier 3: Ιncludes aligned and analyzed data (alignments of all the reads to a reference or denovo assembly, if required) Tier 4: Backup up off-site of Tiers 2 and 3, in order to provide disaster recovery/regulatory requirements.

Tiered storage equations Tier1store=Σ(Nhts x Gbpr + (Nhts x Gbpr)/4) (x Nruns) Nhts=number of per type HTS devices, Gbpr=Gigabytes per run Tier2,3store=Σ(Nruns x Ganalysis + (Nruns x Ganalysis)/3) Nruns=expected number of runs per year, Ganalysis=Gigabytes per run for Tiers 2 and 3 (Table 1) Tier4store=Tier2,3store + Rperiod x Tier2,3store Rperiod= number of years to keep the data4_epamhp

tiered storage example 2 x Illumina 2 x 454 1 x SOLiD 3 year data retention period

Filesystems A filesystem is a key component of the Operating System that dictates how the files are stored and accessed. Commonly used disk filesystems: ext3/4 (Linux), NTFS (Windows), HFS+ (MACOSX), ZFS (Sun), XFS (SGI) Shared/clustered/SAN filesystems: GFS (RedHat),XSAN (Apple) Distributed File Systems (Network File Systems): NFS(9), CIFS/SMB Distributed parallel fault-tolerant file systems: GPFS (IBM), XtreemeFS, OneFS (Isilon), PanFS(Panasas), Lustre

Filesystem Requirements Next Gen Sequencing (NGS) filesystems need to: Be scalable in size Be Scalable in the number of IOPS for read/ writes/nested directory access Allow concurrent access Have file redundancy/replication features

The data network and storage (2)

The data network and storage (3) Questions for your IT architect/system administrator(s): Can you afford the pure Fiber Channel solutions today? How many storage interconnects you have (GigE, FC, Infiniband). Would it not be nice to have a smaller number of storage interconnects (consolidation)?

The data network and storage (4) FCoE

The data network and storage - FCOE(5)

NGS Computing Multi-core CPU intensive jobs : Image processing, Quality Score Multi-core memory intensive jobs: Sequence Assembly Sequence assembly is the most challenging computational task in NGS, besides data storage

Sequence assembly graphs Sequencing Reads Graph Data Structure

De Bruijn Graph http://www.nature.com/nbt/ journal/v29/n11/full/nbt. 2023.html How to get the complete sequence from short read overlapping sequences

Common Sequence Assemblers Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ ALLPATHS-LG: ftp://ftp.broadinstitute.org/pub/crd/allpaths/ Release-LG/ SOAPdenovo: http://soap.genomics.org.cn/soapdenovo.html Cortex: http://cortexassembler.sourceforge.net/ ABySS: http://www.bcgsc.ca/platform/bioinfo/software/abyss Curtain: http://code.google.com/p/curtain/

Sequence assembly requirements Installation: OpenMP, other middleware and libraries Large RAM requirement: Get at least one memory fat node: RAM >= 256 Gbytes (512 Gb to 1Tb of RAM are common) Standalone or part of a queue/batch system Multi/core<->Multi-thread

Tier 1 Architecture http://ctdb.samba.org/

Bill of materials (data network) Cisco Nexus Switch 5000 series QLE8152 Dual Port 10GbE Ethernet to PCIe Converged Network Adapter (CNA). www.qlogic.com

Bill of materials (storage and computing) Dell EMC CX4-960 (8Gbit and 4Gbit FC/ 10 with FCoE support modules) Dell R815 Memory fat node - 32 cores- 512 Gbytes of RAM Dell 1950, 64 Gbytes of RAM/Qlogic CN cards (as access/ front end nodes), 8 cores.

Questions? admin@embnet.uio.no and http://www.embnet.org/join/contactregistration