CHALLENGES IN NEXT-GENERATION SEQUENCING



Similar documents
EMC ISILON X-SERIES. Specifications. EMC Isilon X200. EMC Isilon X210. EMC Isilon X410 ARCHITECTURE

EMC ISILON NL-SERIES. Specifications. EMC Isilon NL400. EMC Isilon NL410 ARCHITECTURE

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

HADOOP IN THE LIFE SCIENCES:

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

THE BRIDGE FROM PACS TO VNA: SCALE-OUT STORAGE

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

WHITE PAPER. Get Ready for Big Data:

Integrated Grid Solutions. and Greenplum

The NGS IT notes. George Magklaras PhD RHCE

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

EMC ISILON AND ELEMENTAL SERVER

Isilon IQ Scale-out NAS for High-Performance Applications

EMC ISILON HD-SERIES. Specifications. EMC Isilon HD400 ARCHITECTURE

Introduction to NGS data analysis

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

Next generation sequencing (NGS)

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

EMC SOLUTION FOR SPLUNK

EMC IRODS RESOURCE DRIVERS

Scaling up to Production

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

LifeScope Genomic Analysis Software 2.5

SECTION C EMPLOYER S REQUIREMENTS

BEST PRACTICES FOR INTEGRATING TELESTREAM VANTAGE WITH EMC ISILON ONEFS

SHAREPOINT 2010 REMOTE BLOB STORES WITH EMC ISILON NAS AND METALOGIX STORAGEPOINT

Quantum StorNext. Product Brief: Distributed LAN Client

EMC XTREMIO EXECUTIVE OVERVIEW

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Data management challenges in todays Healthcare and Life Sciences ecosystems

G E N OM I C S S E RV I C ES

VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS

EMC ISILON INSIGHTIQ Customizable analytics platform to accelerate workflows and applications on Isilon clusters

Scala Storage Scale-Out Clustered Storage White Paper

Analysis of NGS Data

New solutions for Big Data Analysis and Visualization

Big Data Challenges in Bioinformatics

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

A High-Performance Storage and Ultra-High-Speed File Transfer Solution

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Hadoopizer : a cloud environment for bioinformatics data analysis

Introduction to NetApp Infinite Volume

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Isilon OneFS. Version 7.2. OneFS Migration Tools Guide

Deep Sequencing Data Analysis

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Computational Genomics. Next generation sequencing (NGS)

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

Analysis of ChIP-seq data in Galaxy

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

EMC ISILON ONEFS OPERATING SYSTEM

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Storage Solutions for Bioinformatics

Practical Solutions for Big Data Analytics

Software-defined Storage Architecture for Analytics Computing

Challenges associated with analysis and storage of NGS data

EMC ISILON STORAGE BEST PRACTICES FOR ELECTRONIC DESIGN AUTOMATION

Basic processing of next-generation sequencing (NGS) data

Isilon OneFS. Version OneFS Migration Tools Guide

Isilon IQ Network Configuration Guide

Data Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Identifying the Hidden Risk of Data Deduplication: How the HYDRAstor TM Solution Proactively Solves the Problem

EMC DATA DOMAIN OPERATING SYSTEM

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Avid ISIS

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

EMC DATA DOMAIN OPERATING SYSTEM

Delivering the power of the world s most successful genomics platform

A Primer of Genome Science THIRD

Building Optimized Scale-Out NAS Solutions with Avere and Arista Networks

De Novo Assembly Using Illumina Reads

How To Make A Backup System More Efficient

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Panasas: High Performance Storage for the Engineering Workflow

White Paper. Recording Server Virtualization

Solid State Storage in Massive Data Environments Erik Eyberg

How to Choose your Red Hat Enterprise Linux Filesystem

Integrated Rule-based Data Management System for Genome Sequencing Data

Next Generation Sequencing: Technology, Mapping, and Analysis

Transcription:

CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture with distributed data access. Bring computations to the data, rather than data to the computations. Amdahl s Law in high-performance computing (HPC) 2 : Adding more microprocessor cores to a process does not speed it up linearly. Speedup = 1/((1 - P) + P/N) N = number of parallel parts P = Parallel portion of program (percentage) (1-P) = sequential piece P/N = concurrent/parallel piece Multicore processors do not alleviate this issue, because NGS is highly parallelized. Modern HPC systems are CPU rich and I/O poor. Figure 1: Amdahl's Law The goal of next-generation sequencing (NGS) is to create large, biologically meaningful contiguous regions of the DNA sequence the building blocks of the genome from billions of short-fragment data pieces. Whole genome shotgun sequencing is the best approach based on costs per run, compute resources, and clinical significance. Shotgun sequencing is the random sampling of read sequences from NGS instruments with optimal coverage. NGS coverage is defined as: number of reads x (read length/length of genome). The number of reads is usually in the millions with the read length and length of genome quoted in base pairs. The length of the human genome is about 3 billion base pairs. The steps in shotgun sequencing are: 1. Extract and fragment DNA. 2. Clone DNA and sequence both ends of clone. 3. Get raw data from sequencer and call bases. 4. Assemble sequence by creating a De Bruijn graph 3 (which can have 100 million-plus nodes with segments called k-mers) and its sub-graphs; detect overlaps and join; reduce graph and create scaffolds. 5. Finish sequence by closing gaps higher coverage means fewer and smaller gaps. 6. Find phenotypic and clinical significance via single-nucleotide polymorphisms (SNPs), inserts/deletions (indels), variants, copy-number variations (CNVs), and mutations. The architectural components of NGS are the instrument(s), high-performance computing (HPC), storage, and the network. The goal of HPC in NGS is to reduce latency and maximize DNA sequence data processed in unit time. The slowest components in an HPC environment are the network and disk. Several reasons for not achieving total parallelization 4 include: Algorithmic limitations: These occur due to mutual dependencies or portions of the process that can only be performed sequentially. Bottlenecks: Data access is the largest bottleneck in HPC workflows. Because loop-based algorithms move large amounts of data in and out of the CPU, on-chip resources tend to be underutilized and performance is limited by the slowest data paths to memory and its storage. Access to a shared resource execution paths in core, shared memory paths in multi-core chips, and I/O devices serializes the execution. This affects concurrency as well. Startup overhead: Base calling and other large numbers of small file writes also incur this overhead before caching. Communication: Full concurrency between parts of a parallel system is more theoretical than practical. Since communication is the cornerstone of parallel algorithms, there is always some serialization involved. Other bottlenecks, however small, are based on driver and version jitters within the OS and network. SOLUTION OVERVIEW

NGS use cases The faster-than-moore s-law advances in genomics science in the past five years have led to the acceleration of Molecular Diagnostics (MDx), which Dr. Leroy Hood describes as P4 Medicine 5 : personalized, predictive, preventive, and participatory. The use cases to enable clinical genomics using NGS are shown in Table 1. Table 1: NGS use cases and analysis strategy NAME NUCLEIC ACID POPULATION ANALYSIS STRATEGY RNA-SEQ RNA (may be poly-a, mrna, or total RNA) Alignment of reads to genes ; variations for detecting splice junctions and quantifying abundance SMALL RNA SEQUENCING Small RNA (often mirna) Alignment of reads to small RNA references (e.g. mirbase), then to the genome; quantify abundance CHIP-SEQ DNA bound to protein, captured via antibody (ChIP=Chromatin ImmunoPrecipitation) Align reads to reference genome, identify peaks and motifs STRUCTURAL VARIATION ANALYSIS Genomic DNA, with two reads (mate-pair reads) per DNA template Align mate-pairs to reference sequence and interpret structural variants DE NOVO SEQUENCING Genomic DNA, possibly with external data (e.g. cdna, genomes of closely related species, etc.) Piece together reads to assemble contigs, scaffolds, and (ideally) whole-genome sequence METAGENOMICS Entire RNA or DNA from a (usually microbial) community Phylogenetic analysis of sequences

The analysis strategy is driven mostly by open source software as shown in Table 2 below. However, as the clinical genomics field matures, supported platforms like Avadis and CLCbio are gaining traction due to documentation, audit, and regulatory requirements. Techniques like smallrna, ChIPseq, and RNAseq along with de novo sequencing are showing clinical promise. Table 2: Commonly used analysis tools in NGS STEPS IN THE BIOINFORMATICS PROCESS INTEGRATED ALIGN/ DE NOVO SNP/INDEL ANNOTATION, TOOKITS ASSEMBLE ALIGNMENT DISCOVERY BROWSER TO REFERENCE AND GENOME ASSEMBLY CASAVA BoWTie ABySS SOAPsnp Avadis CLCbio BWA SOAPdenovo EagleView Genomics Workbench Galaxy GATK ELAND Maq SAMtools BFAST Harvard Genotator NCBI MapView SAMtools UCSC Genome Browser Illumina HiSeq 2000 data throughput Throughput approx. 30 to 50 Gbp per day Read length of about 100bp, approx. 50x to 75x coverage ~ 8 TB raw data per run (excluding images, log files; including base call files) ILLUMINA HISEQ The NGS architecture shown in Figure 2 is for a single production sequencer example the Illumina HiSeq. With sequencing as a service gaining popularity, the researcher focuses on the Binary Alignment Map (BAM) file to begin the sequence analysis from a functional perspective: SNPs, inserts/deletions (indels), variants, CNVs, and mutations. Figure 2. Reference architecture ~ 100 GB results data per run 2 runs per week per instrument ~ 8 TB per week per instrument 4 whole genome sequences (WGS) or 16 exomes per week per instrument Realistic throughput of 160 whole genomes or 480 exomes per year per instrument. This is about 350 TB per year.

Figure 3: EMC Isilon storage types and functionality The solution: EMC Isilon EMC ISILON ONEFS The EMC Isilon OneFS operating system combines the three layers of traditional storage architectures the file system, volume manager, and RAID into one unified software layer, creating a single intelligent distributed file system that runs on one storage cluster. The advantages of OneFS for NGS are many: Scalable: Scale out as needs grow. Linear scale with increasing capacity from 18 TB to 20 PB in a single file system and a single global namespace. Predictable: Dynamic content balancing is performed as nodes are added, upgraded, or capacity changes. No added management time is required because this process is simple. Available: OneFS is self-healing. It protects your data from power loss, node or disk failures, and loss of quorum and storage rebuild by distributing data, metadata, and parity across all nodes. Efficient: Compared to the average 50 percent efficiency of traditional RAID systems, OneFS provides over 80 percent efficiency, independent of CPU compute or cache. This efficiency is achieved by tiering the process into three types, as shown in Figure 3, and by the pools within these node types. Enterprise-ready: Administration of the storage clusters is via an intuitive Webbased UI. Connectivity to your process is through standard protocols: CIFS, SMB, NFS, FTP/HTTP, Object, and HDFS. Standardized authentication and access control are available at scale: AD, LDAP, and NIS. Note that either the X- or S-Series and the NL-Series should be procured together to build a balanced system. Assumptions: a. Raw files are handled separately b. Process starts at base call files c. Common archival layer Output files: BAM files: 30 bytes per read + 2 bytes per base pair; approximately 100 GB to 250 GB for human WGS SRA files: 10 bytes per base pair (~ 30 GB for human genome WGS) Analysis and interpretation times not included these vary from a week to several months. Table 3: Two storage reference architectures ONE-TO-ONE ARCHITECTURE MANY-TO-ONE ARCHITECTURE One sequencer to one storage cluster Many sequencers to one storage cluster 3 sequencers 3 sequencers EXAMPLE WITH SINGLE SEQUENCER EXAMPLE WITH 3 SEQUENCERS EMC Isilon X-Series 18 TB raw (3x 6 TB) EMC Isilon S-Series 36 TB raw with minimum 400 GB SSD, 48 GB RAM (3x 12 TB) with minimum 800 GB SSD, Move to archival layer after every run 144 GB RAM using SmartPools or SnapshotIQ TM Scratch area (tmp folders) on HPC Scratch area (tmp folders) on cluster if NGS analysis software allows this configuration; otherwise scratch storage cluster area on the S-Series cluster IB backend and minimum 1 Gbps IB backend and 10 Gbps front-end Ethernet (2x 1 Gbps LACP or 10 Gbps for improved performance) front-end Ethernet SCALE THIS AS SEQUENCERS ARE ADDED Archival layer: NL-Series 108 TB raw (3x 36 TB): Design with 324 TB raw data (3x 108 TB) if planning for one year of storage SCALE THIS AS SEQUENCERS ARE ADDED Archival layer: Batched movement using SmartPools rules or SnapshotIQ: NL-Series 324 TB raw (3x 108 TB): Design with 2x NL clusters if planning for one year of storage

Data flow and performance NGS DATA FLOW AND SIZING Despite the similarity of NGS components i.e., sequencer(s), HPC, and storage the workflows can be quite different as shown in Table 1. It is imperative that the research and the IT team understand and plan the HPC and storage architecture accordingly. Figure 4 illustrates one such workflow with its data paths, file, and data sizes. This is representative of most WGS workflows. It is also critical to understand the instrument throughput. Figure 4: Data flow and sizing PERFORMANCE METRICS As with understanding the workflow, the degree of parallelization of the algorithms in the process is critical for performance. The other main factors are RAM (3 GB/core), NFS tuning, and jumbo frames (TCP MTU 9000). Thread tuning for SGE and NFS could also help. Test setup: 72-core (2.6 GHz) HPC platform with 144 GB RAM on CentOS 6.2 Son-of Grid Engine and PDSH, parallel make with GCC EMC Isilon S-Series node with 12 TB raw, 540 GB SSD, and 144 GB RAM with IB backend on OneFS 6.5.4 1 Gbps front-end Ethernet, NFSv4 Illumina CASAVA 6 1.8 analysis platform Human whole genome sequence from Illumina

References: Figure 5: NGS performance with Illumina CASAVA 1. Jim Gray, Scalable Computing (presentation at Nortel: Microsoft Research April 1999). 2. Amdahl, Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. AFIPS Conference Proceedings (30): 483 485, 1967. 3. Compeau, Pevzner, and Tesler, How to apply de Bruijn graphs to genome assembly, Nature Biotechnology. Volume 29, Number 11, November 2011. 4. Hagar and Wellein, Introduction to High Performance Computing for Scientists and Engineers, CRC Press, 2011. 5. Hood and Friend, Predictive, personalized, preventive, participatory (P4) cancer medicine, Nature Reviews Clinical Oncology 8, 184-187, March, 2011. 6. CASAVA 1.8 User Guide, Illumina Part# 15011196 Rev B, May, 2011. CONCLUSION Next-generation sequencing is a complex and multivariate endeavor with the instrument end changing and evolving faster than the storage or HPC layers as the process moves into clinical genomics. EMC Isilon makes the storage design, implementation, and upgrade process for NGS painless and simple. CONTACT US To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized reseller or visit us at www.emc.com/isilon. EMC 2, EMC, the EMC logo, Isilon, OneFS, SmartPools, and SnapshotIQ are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. Copyright 2013 EMC Corporation. All rights reserved. Published in the USA. 06/13 Solution Overview H10628.3 EMC believes the information in this document is accurate as of its publication date. The information is subject to change without notice. www.emc.com