Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 2
NGS Advantages Explosion of Next-Generation Faster and cheaper Sequencing Data E.g., over one billion short reads per instrument run More accurate: higher resolution and deeper coverage Challenges Urgent need for turning raw data into knowledge Parallelism is the key 3
Historical Trends in Storage Prices v.s. DNA Sequencing Costs 1,000,000 Hard Disk Storage Price (MB per Dollar) 100,000 10,000 1,000 100 10 1 Hard Disk Storage Pre-next Generation Sequencing Next Generation Sequencing 0 1990 1994 1998 2002 2006 2010 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 0 DNA Sequencing Cost (Base Pairs per Dollar) Reported by Lincoln Stein 4
Different Formats Varieties of NGS Data Formats SAM (Sequence Alignment/Map) The de-facto text format for storing large nucleotide sequence alignments BAM (Binary Alignment/Map) The compressed, indexable, binary form of the SAM format Indexing is supported by BAI (BAM Index) file Other formats BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc. 5
Current Pipeline Analysis Pipeline Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST Reality Cross-utilization Problem: sequencing data input Some other analysis steps stay sequential Needs for removing other sequential bottlenecks 6
Motivation: Removing Other Parallel Format Conversion Sequential Bottlenecks Current format conversion commonly makes use of a single core Current downstream tools may not be exchanged between different aligners Not hard to implement but important to scale out Parallelizing Certain Statistical Analysis Steps E.g., parallel analysis on the histogram data 7
Framework Sequence Data Format Converter Input: SAM/BAM Output: BAM/SAM FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML Statistical Analysis Module only discuss the first component today Parallelize other statistical analysis steps E.g., non-local means (NL-Means) and false discovery rate (FDR) computation 8
Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 9
3 Converter Instances SAM Format Converter BAM Format Converter Sequence Data Format Converter Preprocessing-Optimized SAM Format Converter Support partial format conversion on a specific chromosome region 10
SAM Format Converter No communication among procs after partitioning partitioning is the key step for parallelization Extensibility and Programmability 11
Partitioning Algorithm Key: each SAM record is delimited by a line breaker 1.Initial even partitioning 2.Adjust partition boundaries by detecting line breakers 12
Challenge BAM Format Converter No explicit delimiter: Even partitioning -> unparsable records Solution: add a preprocessing phase Partition data by supporting random access Cannot be parallelized because of the third-party API 13
BAMX (BAM extended) File BAMX and BAIX Transform each varying-length BAM record into a regular-layout BAMX record Align varying-length BAM fields by padding BAIX (BAI extended File) Index file of the BAMX file Store the alignment starting positions in BAM (logically) and in BAMX (physically) 14
Partial Conversion If only interested in a subset, no need for full conversion Based on the BAIX file Given logical alignment starting and ending positions, locate the physical starting and ending positions in the BAMX file (by binary search) Evenly partition the subset and proceed in parallel 15
Main Ideas Preprocessing-Optimized SAM Format Converter Preprocessing can also optimize the SAM format conversion Such preprocessing can be parallelized because of the easy partitioning on the SAM format M procs N procs M N target files
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 17
Dataset Experimental Setup Whole genome DNA-sequencing of three mouse samples Approximately 125 million sequences providing about 40-fold coverage of the genome In the SAM/BAM format Cluster 8 GB Memory Up to 32 8-core machines (256 cores in total) 18
Performance of SAM Format Converter Input: 100 GB SAM data Output: BED, BEDGRAPH and FASTA Speedup 80 70 60 50 40 30 BED BEDGRAPH FASTA 20 10 0 8 16 32 64 128 # of Cores 19
Performance of BAM Format Converter Input: 117 GB BAM data Output: BED, BEDGRAPH and FASTA 140 Speedup 120 100 80 60 40 BED BEDGRAPH FASTA 20 0 8 16 32 64 128 # of Cores 20
SAM Format Converter Comparison: Preprocessing-Optimized vs. Original Input: 15.7 GB BAM data Output: BED, BEDGRAPH and FASTA Speedup 100 90 80 70 60 50 40 30 BED_P BED BEDGRAPH_P BEDGRAPH FASTA_P FASTA 20 10 0 8 16 32 64 128 # of Cores 21
Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 22
Conclusion In the NGS analysis pipeline, the overall latency cannot be reduced unless all sequential bottlenecks are removed The first framework that can easily support parallel sequence format conversion in distributed environment SAM format converter BAM format converter Preprocessing-optimized SAM format converter 23