Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of Information and Computer Science School of Science, Aalto University firstname.lastname@aalto.fi 2 CSC IT Center for Science firstname.lastname@csc.fi

Next Generation Sequencing The amount of NGS data worldwide is predicted to grow very rapidly The datasets are large: The 1000 Genomes project (http://www.1000genomes.org) is a freely available 50 TB+ data set of human genomes (as of Jun 2010) Without scalable cloud computing and storage methods genomics research will hit computing infrastructural and/or budgeting limits 2/17

Google MapReduce A scalable batch processing framework developed at Google for Big Data: computing the Web index Uses commodity PC hardware to create a massively parallel storage subsystem Each node is a usually a Linux PC with 4-12 hard disks. Data is by default stored on three nodes for very high availability Adding a new server not only improves CPU capacity but also adds storage capacity and bandwidth 3/17

Google MapReduce (cnt.) Much cheaper per Terabyte than traditional SAN storage The MapReduce framework takes care of all issues related to parallelization, synchronization, load balancing, and fault tolerance. All these details are hidden from the programmer 4/17

MapReduce Diagram Worker (1)fork (2)assign map User program (1)fork Master (2) assign reduce (1)fork split 0 split 1 (3)read split 2 Worker (4) local write (5)Remote read Worker (6)write output file 0 split 3 split 4 Worker output file 1 Worker Input files Map phase Intermediate files (on local disks) Reduce phase Output files 22.1.2010 2 Figure: J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004 5/17

Apache Hadoop An Open Source implementation of the MapReduce framework, heavily used by e.g., Yahoo! and Facebook Moving Computation is Cheaper than Moving Data - Ship code to data, not data to code Map and Reduce workers are also storage nodes for the underlying distributed filesystem: Job allocation is first tried to a node having a copy of the data, and if that fails, then to a node in the same rack (to maximize network bandwidth) Project Web page: http://hadoop.apache.org/ 6/17

Apache Hadoop (cnt.) Builds reliable systems out of unreliable commodity hardware by replicating both data and in case of node failures: automatically rerunning computations Tuned for large (Gigabytes of data) files Designed to handle very large 1 PB+ data sets Designed for streaming data accesses in batch processing, designed for high bandwidth instead of low latency For scalability NOT a POSIX filesystem - all applications do not port without modifications to this storage system A good match for upcoming NGS data storage needs 7/17

Two Large Hadoop Installations Yahoo! (2009): 4000 nodes, 16 PB raw disk, 64TB RAM, 32K cores Facebook (2010): 2000 nodes, 21 PB storage, 64TB RAM, 22.4K cores A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, H. Liu: Data warehousing and analytics infrastructure at Facebook. SIGMOD Conference 2010: 1013-1020. http://doi.acm.org/10.1145/1807167.1807278 8/17

Hadoop-BAM Opensource library that enables Hadoop to natively process BAM (Binary Alignment/Map) files in parallel Based on parallelizing parts of the samtools Java implementation called Picard on top of Hadoop As a proof of concept has been so far used for sorting and visualizing BAM files (coverage summaries) using upto 180 cores in parallel Can be easily extended to do other BAM file processing in parallel - many avenues for future research 9/17

Hadoop-BAM (cnt.) Technical challenges: BAM file needs to be split into 64MB parts that can be processed in independently in parallel by different machines This is difficult as the BAM file format is compressed - find a compressed block binary Furthermore, compression blocks boundaries are not aligned with record boundaries. We find the correct record boundaries by trying to speculatively parse the file starting at different offsets until a correct offset is found The BAM format is quite difficult but not impossible to work with in parallel http://sourceforge.net/projects/hadoop-bam/ 10/17

CSC Genome Browser Bioinformatics is the largest customer group of CSC (in user numbers) Interactive browsing with zooming in and out of a BAM file for read coverage summaries, Google Earth -style Single BAM files can be 100GB+ with interactive visualization at different zoom levels Preprocessing with a new library Hadoop-BAM to compute summary data for the higher zoom levels Benchmarked to scale upto running 15 x 12 = 180 cores in parallel for the visualization 11/17

Genome Browser GUI 12/17

Read aggregation: problem (cnt.) 13/17

Triton Cloud Testbed I Cloud computing testbed at Aalto University I 112 AMD Opteron 2.6 GHz compute nodes with 12 cores, 32-64GB memory each, totalling 1344 cores I Infiniband and 1 GBit Ethernet networks I 30 TB+ local disk I 40 TB+ fileserver work space 14/17

Experiments One 50 GB (compressed) BAM input file from 1000 Genomes Run on the Triton cluster 1-15 compute nodes with 12 cores each Four repetitions for increasing number of worker nodes Two types of runs: sorting according to starting position ( sorted ) and read aggregation using five summary-block sizes at once ( summarized ) 15/17

Mean speedup 50 GB sorted 50 GB summarized for B=2,4,8,16,32 16 14 12 Ideal Input file import Sorting Output file export Total elapsed 16 14 12 Ideal Input file import Summarizing Output file export Total elapsed Mean speedup 10 8 6 Mean speedup 10 8 6 4 4 2 2 0 1 2 4 8 15 0 1 2 4 8 15 Workers Workers 16/17

Conclusions Parallel processing does not improve runtimes if data transfer in&out of a single fileserver does not scale The large cloud players are using commodity PCs + distributed scalable fault tolerant filesystems to get storage performance up and keep prices down The NGS storage and processing requirements are a good match for MapReduce and its open source Apache Hadoop implementation Integrating alignment tools, filtering of BAM files, and parallel ad-hoc database queries are on our TODO list Version 3.0 of the hadoop-bam available: http://sourceforge.net/projects/hadoop-bam/ We are looking for challenges in parallelizing NGS data processing tasks 17/17