Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Similar documents
Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Hadoopizer : a cloud environment for bioinformatics data analysis

A Tutorial in Genetic Sequence Classification Tools and Techniques

Basic processing of next-generation sequencing (NGS) data

Delivering the power of the world s most successful genomics platform

Module 1. Sequence Formats and Retrieval. Charles Steward

Integrated Rule-based Data Management System for Genome Sequencing Data

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

High Performance Compu2ng Facility

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

New solutions for Big Data Analysis and Visualization

Hadoop-BAM and SeqPig

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Next generation sequencing (NGS)

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Hadoop. Bioinformatics Big Data

UGENE Quick Start Guide

Version 5.0 Release Notes

SAP HANA Enabling Genome Analysis

Comparing Methods for Identifying Transcription Factor Target Genes

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Introduction to next-generation sequencing data

Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis

GeneProf and the new GeneProf Web Services

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

A Performance Analysis of Distributed Indexing using Terrier

DNA Sequencing Data Compression. Michael Chung

Processing NGS Data with Hadoop-BAM and SeqPig

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Accelerating variant calling

CSE-E5430 Scalable Cloud Computing. Lecture 4

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Big Data Challenges in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics

Developing MapReduce Programs

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Analysis of NGS Data

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Data formats and file conversions

Introduction to Parallel Programming and MapReduce

Organization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator

Copy Number Variation: available tools

The Galaxy workflow. George Magklaras PhD RHCE

LifeScope Genomic Analysis Software 2.5

Frequently Asked Questions Next Generation Sequencing

Analysis of ChIP-seq data in Galaxy

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Automated and Scalable Data Management System for Genome Sequencing Data

Big Data With Hadoop

Pipelining and load-balancing in parallel joins on distributed machines

Importance of Statistics in creating high dimensional data

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Jozef Matula. Visualisation Team Leader IBL Software Engineering. 13 th ECMWF MetOps Workshop, 31 th Oct - 4 th Nov 2011, Reading, United Kingdom

Mining various patterns in sequential data in an SQL-like manner *

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

MiSeq: Imaging and Base Calling

Deployment Planning Guide

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

CREDIT CARD FRAUD DETECTION SYSTEM USING GENETIC ALGORITHM

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Rethinking SIMD Vectorization for In-Memory Databases

Chapter 7. Using Hadoop Cluster and MapReduce

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequencing Variants

Similarity Search in a Very Large Scale Using Hadoop and HBase

An FPGA Acceleration of Short Read Human Genome Mapping

Writing Assignment #2 due Today (5:00pm) - Post on your CSC101 webpage - Ask if you have questions! Lab #2 Today. Quiz #1 Tomorrow (Lectures 1-7)

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

FQbin: a compatible and optimized format for storing and managing sequence data

NGS data analysis. Bernardo J. Clavijo

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Bioinformatics Resources at a Glance

Practical Guideline for Whole Genome Sequencing

Step by Step Guide to Importing Genetic Data into JMP Genomics

CSE-E5430 Scalable Cloud Computing Lecture 2

Architectures for massive data management

Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Speeding Up Cloud/Server Applications Using Flash Memory

Table of Contents. June 2010

GC3 Use cases for the Cloud

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

SeqArray: an R/Bioconductor Package for Big Data Management of Genome-Wide Sequence Variants

Introduction to NGS data analysis

Twister4Azure: Data Analytics in the Cloud

Large Data Visualization using Shared Distributed Resources

Solid State Drive Architecture

Transcription:

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1

Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 2

NGS Advantages Explosion of Next-Generation Faster and cheaper Sequencing Data E.g., over one billion short reads per instrument run More accurate: higher resolution and deeper coverage Challenges Urgent need for turning raw data into knowledge Parallelism is the key 3

Historical Trends in Storage Prices v.s. DNA Sequencing Costs 1,000,000 Hard Disk Storage Price (MB per Dollar) 100,000 10,000 1,000 100 10 1 Hard Disk Storage Pre-next Generation Sequencing Next Generation Sequencing 0 1990 1994 1998 2002 2006 2010 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 0 DNA Sequencing Cost (Base Pairs per Dollar) Reported by Lincoln Stein 4

Different Formats Varieties of NGS Data Formats SAM (Sequence Alignment/Map) The de-facto text format for storing large nucleotide sequence alignments BAM (Binary Alignment/Map) The compressed, indexable, binary form of the SAM format Indexing is supported by BAI (BAM Index) file Other formats BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc. 5

Current Pipeline Analysis Pipeline Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST Reality Cross-utilization Problem: sequencing data input Some other analysis steps stay sequential Needs for removing other sequential bottlenecks 6

Motivation: Removing Other Parallel Format Conversion Sequential Bottlenecks Current format conversion commonly makes use of a single core Current downstream tools may not be exchanged between different aligners Not hard to implement but important to scale out Parallelizing Certain Statistical Analysis Steps E.g., parallel analysis on the histogram data 7

Framework Sequence Data Format Converter Input: SAM/BAM Output: BAM/SAM FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML Statistical Analysis Module only discuss the first component today Parallelize other statistical analysis steps E.g., non-local means (NL-Means) and false discovery rate (FDR) computation 8

Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 9

3 Converter Instances SAM Format Converter BAM Format Converter Sequence Data Format Converter Preprocessing-Optimized SAM Format Converter Support partial format conversion on a specific chromosome region 10

SAM Format Converter No communication among procs after partitioning partitioning is the key step for parallelization Extensibility and Programmability 11

Partitioning Algorithm Key: each SAM record is delimited by a line breaker 1.Initial even partitioning 2.Adjust partition boundaries by detecting line breakers 12

Challenge BAM Format Converter No explicit delimiter: Even partitioning -> unparsable records Solution: add a preprocessing phase Partition data by supporting random access Cannot be parallelized because of the third-party API 13

BAMX (BAM extended) File BAMX and BAIX Transform each varying-length BAM record into a regular-layout BAMX record Align varying-length BAM fields by padding BAIX (BAI extended File) Index file of the BAMX file Store the alignment starting positions in BAM (logically) and in BAMX (physically) 14

Partial Conversion If only interested in a subset, no need for full conversion Based on the BAIX file Given logical alignment starting and ending positions, locate the physical starting and ending positions in the BAMX file (by binary search) Evenly partition the subset and proceed in parallel 15

Main Ideas Preprocessing-Optimized SAM Format Converter Preprocessing can also optimize the SAM format conversion Such preprocessing can be parallelized because of the easy partitioning on the SAM format M procs N procs M N target files

Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 17

Dataset Experimental Setup Whole genome DNA-sequencing of three mouse samples Approximately 125 million sequences providing about 40-fold coverage of the genome In the SAM/BAM format Cluster 8 GB Memory Up to 32 8-core machines (256 cores in total) 18

Performance of SAM Format Converter Input: 100 GB SAM data Output: BED, BEDGRAPH and FASTA Speedup 80 70 60 50 40 30 BED BEDGRAPH FASTA 20 10 0 8 16 32 64 128 # of Cores 19

Performance of BAM Format Converter Input: 117 GB BAM data Output: BED, BEDGRAPH and FASTA 140 Speedup 120 100 80 60 40 BED BEDGRAPH FASTA 20 0 8 16 32 64 128 # of Cores 20

SAM Format Converter Comparison: Preprocessing-Optimized vs. Original Input: 15.7 GB BAM data Output: BED, BEDGRAPH and FASTA Speedup 100 90 80 70 60 50 40 30 BED_P BED BEDGRAPH_P BEDGRAPH FASTA_P FASTA 20 10 0 8 16 32 64 128 # of Cores 21

Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 22

Conclusion In the NGS analysis pipeline, the overall latency cannot be reduced unless all sequential bottlenecks are removed The first framework that can easily support parallel sequence format conversion in distributed environment SAM format converter BAM format converter Preprocessing-optimized SAM format converter 23