Processing NGS Data with Hadoop-BAM and SeqPig

Transcription

1 Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Helsinki Institute for Information Technology HIIT and Department of Computer Science, Aalto University firstname.lastname@aalto.fi 2 International Computer Science Institute, Berkeley, CA, USA 3 CRS4 Center for Advanced Studies, Research and Development, Italy 4 CSC IT Center for Science 1/26

2 Next Generation Sequencing and Big Data The amount of NGS data worldwide is predicted to double every 5 months This growth is much faster than Moore s law for the growth rate of computing (historically transistor counts have doubled every months), Kryder s law for the growth of storage capacity (historically doubling approx every 13 months), and Butter s law for growth in optical communications bandwidth (historically doubling approx every 9 months) Without increased expenditure in distributed computing methods genomics research will hit computational limits 2/26

3 No Processor Clock Speed Increases Ahead Herb Sutter: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb s Journal, 30(3), March 2005 (updated graph in August 2009). 3/26

4 Implications of the End of Free Lunch The clock speeds of microprocessors are not going to improve much in the foreseeable future The efficiency gains in single threaded performance are going to be only moderate The number of transistors in a microprocessor is still growing at a high rate One of the main uses of transistors has been to increase the number of computing cores the processor has The number of cores in a low end workstation (as those employed in large scale datacenters) is going to keep on steadily growing Programming models need to change to efficiently exploit all the available concurrency - scalability to high number of cores/processors will need to be a major focus 4/26

5 Tape is Dead, Disk is Tape, RAM locality is King Trends of RAM, SSD, and HDD prices. From: H. Plattner and A. Zeier: In-Memory Data Management: An Inflection Point for Enterprise Applications 5/26

6 Tape is Dead, Disk is Tape, RAM locality is King RAM (and SSDs) are radically faster than HDDs: One should use RAM/SSDs whenever possible RAM is roughly the same price as HDDs were a decade earlier Workloads that were viable with hard disks a decade ago are now viable in RAM One should only use hard disk based storage for datasets that are not yet economically viable in RAM (or SSD) In memory distributed filesystems such as Tachyon are needed for temp files! The Big Data applications (HDD based massive storage) should consist of applications that were not economically feasible a decade ago using HDDs 6/26

7 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System Distribution for Big Data Based on Google architecture design Cheap commodity hardware for storage Fault tolerant distributed filesystems: HDFS, Tachyon Batch processing systems: Hadoop MapReduce, Apache Hive, and Apache Pig (HDD); Apache Spark (RAM) Parallel SQL implementations for analytics: Apache Hive, Cloudera Impala, Apache Shark, Facebook Presto Fault tolerant distributed database: HBase Distributed machine learning libraries, text indexing & search, etc. Project Web page: Hadoop MapReduce is just one example application on top of the Hadoop Open Source distribution! 7/26

8 Commercial Hadoop Support Cloudera: Probably the largest Hadoop distributor, partially owned by Intel (740 million USD investment for 18% share). Available from: Hortonworks: Yahoo! spin-off from their large Hadoop development team: MapR: A rewrite of much of Apache Hadoop in C++, including a new filesystem. API-compatible with Apache Hadoop. 8/26

9 Hadoop-BAM A library to interface NGS data formats with both Hadoop and Spark Includes tools for e.g., sorting of reads, as needed by merging results of parallel read aligners Supported fileformats: BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF Some fileformats like BAM notoriously badly designed for parallel processing Released in Dec 2010, at Version 7.0 of the Hadoop-BAM: Downloads of the library Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., and Heljanko, K.: Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud. Bioinformatics 28(6): , ( 9/26

10 Mean speedup 50 GB sorted 50 GB summarized for B=2,4,8,16, Ideal Input file import Sorting Output file export Total elapsed Ideal Input file import Summarizing Output file export Total elapsed Mean speedup Mean speedup Workers Workers 10/26

11 SeqPig Parallel scripting language for NGS data sets based on the Apache Pig language Compiles into Java, executed by Hadoop MapReduce SQL-like functionality with helper functions for NGS data: Filtering data, computing aggregate statistics, doing joins Supported fileformats: BAM, SAM, FASTQ, QSEQ, and FASTA Schumacher, A., Pireddu, L., Niemenmaa, M., Kallio, A., Korpelainen, E., Zanetti, G., and Heljanko, K.: SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30 (1): , (dx.doi.org/ /bioinformatics/btt601.) See also supplement: suppl/2013/10/17/btt601.dc1/supplement.pdf 11/26

12 SeqPig Use Case Examples Automatically parallelizing Pig example scripts for: File format conversion Filtering out unmapped reads and PCR or optical duplicates Filtering out reads with low mapping quality Filtering by regions (samtools syntax) Sorting BAM files Computing read coverage Computing base frequencies (counts) for each reference coordinate Pileup Collecting read-mapping-quality statistics Collecting per-base statistics of reads... 12/26

13 Scalability of SeqPig Speedup versus FastQC avg readqual read length basequal stats GC contents all at once Worker nodes Figure: Scalabilty of SeqPig vs sequential FastQC. Computing statistics on 61.4 GB input file with up to 63 computer Hadoop cluster 13/26

14 SeqPig Benefits and Drawbacks Benefits: Automatic parallelization of data processing scripts Easy to learn scripting language with full power of MapReduce Most scripts are at most tens of lines of code vs. hundreds to thousands of lines of Java Also allows calling back user defined functions written in Java/Python Implements SQL like functionality Drawbacks: MapReduce has 10+ second startup delay: No for interactive use A specialized language instead of a standard like SQL 14/26

15 Apache Spark Apache Spark is fast and general purpose in memory cluster-computing system. Spark can cache data-sets, and has much flexible DAG execution scheme. More suitable than Hadoop for iterative algorithms. Can be up to 100 times faster than Apache Hadoop. Runs Standalone, on Yarn, Mesos, and EC2. Can utilize HDFS, HBase, Cassandra and any Hadoop Source (including Hadoop-BAM). 15/26

16 Spork Can run existing Pig scripts with minimal change. Compiles Pig scripts to Spark jobs instead of Hadoop jobs. Extends Pig CACHE operator. Open source but in alpha stage. Codebase: Some open issues: 16/26

17 Pig vs Spork Execution Path 17/26

18 SeqSpork Can run existing SeqPig scripts with minimal change. Extends Spork to read Genomics File Types. Extends Spork with some Genomics Functionality. Tries to optimize Spark parameters according to needs of Genomics Domain. Compiles into Apache Spark Jobs. Currently up to 2 times faster compared to SeqPig Codebase: 18/26

19 SeqSpork Advantages & Disadvantages Advantages More flexible and faster than SeqPig Provides caching functionality. No separate map-reduce jobs thus not temp files between map-reduce jobs. Data processing is done mostly in-memory. Disadvantages Alpha quality software & somewhat instable. Still imperative like SeqPig not declarative like SQL. Start-up delay still exists for interactive jobs. 19/26

20 Future of SeqSpork Increase software quality and stability. Address some performance issues in group and join operations. Implement more genomic functionality. Parallel variant detection needs to be integrated with the pipeline. Using different datasource than HDFS such as object storage systems or in-memory filesystem. 20/26

21 Moving from BAM files to SQL Data Warehouse A proper data warehouse system can Very efficiently evaluate parallel queries over Petabytes of data Allows for efficient compression and indexing, including indexing several BAM files in a single table Allows to ride on the Hadoop+Spark software investments Many new analytics SQL implementations on top of Hadoop designed for handling Petabyte-class datasets Apache Hive Cloudera Impala Spark SQL Presto from Facebook 21/26

22 Hadoop-BAM Integrated with Hive We have interfaced Hive to Hadoop-BAM, allowing Hive to directly read BAM files Hive can then convert the NGS data into columnar formats such as RCFile or Parquet for improved compression and query performance Using RCFile or Parquet storage also allows the contents of BAM files to be also queried by Spark SQL (and Shark), Impala, and Presto See details: Master s Thesis: Matti Niemenmaa: Analysing sequencing data in Hadoop: The road to interactivity via SQL, Hive interfacing code available on request 22/26

23 Initial Hive, Shark, and Impala SQL Benchmarks Benchmarks on SQL queries on BAM file contents, including quality control, joining with BED file, etc. RCFile columnar format used (currently BAM reading only supported by Hive), frameworks as of end of 2013: Hive 0.11 Shark 0.7 Impala Hive was stable but never faster than 10 seconds, making it unsuitable for interactive use The new frameworks do really well on small queries, on large data sets and node counts Hive catches up Both Shark and Impala had some stability and/or correctness issues at end of /26

24 Hive, Shark, Impala Benchmarks (Linear Scale) 24/26

25 Hive, Shark, Impala Benchmarks (Log Scale) 25/26

26 Future plans Porting all our tools over to Spark from Hadoop Parallel variant detection using black box variant callers needs to be integrated with the pipeline - Work in progress Releasing a tool with SeqPig like functionality on a parallel SQL Data Warehouse exploiting all available query engines: Spark SQL, Impala, Presto, Hive Using standard columnar data formats like Parquet and RCFile for storage of NGS data Interactive ad-hoc queries of Big Data in Genomics In memory-filesystem and object storage integration Parallel cloud datastore (HBase) for parallelized database functionality We are open to working with you on Parallel Next Generation Data Processing on Hadoop and Spark Ecosystems 26/26