HADOOP IN THE LIFE SCIENCES:

White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS) and its adoption in the Life Sciences with an example in Genomics data analysis. December 2012

Copyright 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. Part number H10574.1 2

Table of Contents Audience... 3 Executive Summary... 4 Hadoop: an Introduction... 5 Genomics example: CrossBow... 8 Enterprise-Class Hadoop on EMC Isilon... 9 Conclusion... 10 References... 10 Audience This white paper introduces the new data processing and analysis paradigm, Hadoop TM, within the context of its usage in the life sciences, specifically Genomics Sequencing. It is intended for audiences with basic knowledge of storage and computing technology; a rudimentary understanding of DNA sequencing and the bioinformatics analysis associated with it. 3

Executive Summary Life Sciences data will reach the ExaByte (10 18 bytes, EB) scale soon. This is Big Data. As a reference point, all words ever spoken by all human beings when transcribed are about 5 EB of data. In a recent article titled Will Computers Crash Genomics? 1, the analysis points to exponential growth of the total genomics sequencing market capacity, as outlined in Figure 1 below: 10 Tera base-pairs (10 12 bp) per day, with an astounding 5x year-on-year growth rate (500%). The human genome is approximately 3 billion base pairs long a base pair (bp) comprising of DNA molecules in G-C or A-T pairs Figure 1: Genomics Growth Each base-pair represents a total of about 100 bytes (of raw, analyzed and interpreted data). Therefore the genomics market capacity in 2010 storage terms (from Fig. 1) was about 200 PetaBytes (PB), with the capacity growing to about 1 ExaByte (EB) by late 2012. This capacity is drowning out technologies attempting to handle the deluge of Big Data in the life sciences. Proteomics (study of proteins) and imaging data are early stages of this exponential rise. It is not just the data storage volume, but also its velocity and variability that make this a challenge requiring scale-out technologies: grow simply and painlessly as the data center and business needs grow. Within the past year, one computing and storage framework has matured into a contender to handle this tsunami of Big Data: Hadoop. Life Sciences workflows require a High Performance Computing (HPC) infrastructure to process and analyze the data to determine the variations in the genome and the proper scale of storage to retain this data. With Next Generation (genome) Sequencing (NGS) workflows generating up to 2 TeraBytes (TB) of data per run per week per sequencer not including the raw images the need for a scale-out storage that integrates easily with HPC is a line item requirement. EMC Isilon has provided the scale-out storage for nearly all the workflows for all the DNA sequencer instrument manufacturers in the market today at more than 150 customers. Since 2008, the EMC Isilon OneFS storage platform has a Life Sciences installed base of more than 65 PetaBytes (PB). 4

As genomics has very large, semi-structured, file-based data and is modeled on postprocess streaming data access and I/O patterns that can be parallelized, it is ideally suited for Hadoop. It consists of two main components: a file system and a compute system the Hadoop Distributed File System (HDFS) and the MapReduce framework respectively. The Hadoop ecosystem consists of many open source tools, as shown in Figure 2 below: Figure 2: Hadoop Components To make the Hadoop storage scale-out and truly distributed, the EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) just like any other shared file system protocol: NFS, CIFS or SMB 3. This allows for the data co-location of the storage with its compute nodes using the standard higher level Java application programming interface (API) to build MapReduce jobs. Hadoop: an Introduction Hadoop was created by Doug Cutting of the Apache Lucene project 4 initially as the Nutch Distributed File System (NDFS), which was inspired by Google s BigTable data infrastructure and the MapReduce 5 application layer in 2004. Hadoop is an Apache Foundation derivative which is comprised of a MapReduce layer for data analysis and a Hadoop Distributed File System (HDFS) layer written in the Java programming language to distribute and scale the MapReduce data. The Hadoop MapReduce framework runs on the compute cluster using the data stored on the HDFS. MapReduce 'jobs' aim to provide a key/value based processing ability in a highly parallelized fashion. Since the data is distributed over the cluster, a MapReduce job can be split-up to run many parallel processes over the data stored on the cluster. The Map parts of MapReduce only run on the data they can see that is the data blocks on the particular machine its running on. The Reduce brings together the output from the Maps. The result is a system that provides a highly- 5

paralleled batch processing capability. The system scales well, since you just need to add more hardware to increase its storage capability or decrease the time a MapReduce job takes to run. The partitioning of the storage and compute framework into master and worker node types is outlined in the Figure 3 below: Figure 3: Hadoop Cluster Hadoop is a Write Once Ready Many (WORM) system with no random writes. This makes Hadoop faster than HPC and Storage integrated separately. The life sciences has been at the forefront of the technology adoption curve: one of the earliest usecases of the Sun GridEngine 6 HPC was the DNA sequence comparison BLAST 16 search. Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV 7. The R (statistical language) Hadoop interface, RHIPE 8, is also popular in the life sciences community. The HDFS layer has a Name Node, the controller, with data locality through the name node and uses the share nothing architecture which is a distributed independent node based scheme 7. From a platform perspective, the OneFS HDFS interface is compatible with Apache Hadoop, EMC GreenPlum 3 and Cloudera. In a traditional Hadoop implementation, the HDFS Name Node is a single point of failure since it is the sole keeper of all the metadata for all the data that lives in the filesystem the OneFS HDFS interface resolves this by distributing the name node data 3. HDFS creates a 3x replica for redundancy OneFS drastically reduces the need for a 3x copy. A good example of the MapReduce algorithm key-value pair process for analyzing word count of specific words across documents 9 is shown in Figure 3 below: 6

Figure 4: Hadoop Example word count across documents Hadoop is not suited for low-latency, in process use-cases like real-time, spectral or video analysis; or for large numbers of small files (<8KB). When small files have to be used, the Hadoop Archive (HAR) can be used to archive small files for processing. Since its early days, life sciences organizations have been Hadoop s earliest adopters. Following the publication of the first Apache Hadoop project 10 in January 2008, the first large-scale MapReduce project was initiated by the Broad Institute resulting in the comprehensive Genome Analysis Tool Kit (GATK) 11. The Hadoop CrossBow project 12 from Johns Hopkins University came soon after. Other projects are Cloud-based: they include CloudBurst, Contrail, Myrna and CloudBLAST 13. An interesting implementation is the NERSC (Department of Energy) Flash-based Hadoop cluster within the Magellan Science Cloud 14. 7

Genomics example: CrossBow Figure 5: Crossbow example SNP calls across DNA fragments The Hadoop word count across documents example in Fig. 4 can be extended to DNA Sequencing: count for single base changes across millions of short DNA fragments and across hundreds of samples. A Single Nucleotide Polymorphism (SNP) occurs when one nucleotide (A, T, C or G) varies in the DNA sequence of members of the same biological species. Next Generation Sequencers (NGS) like Illumina HiSeq can produce data in the order of 200 Giga base pairs in a single one-week run for a 60x human genome coverage this means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. This data requires specialized software algorithms called short read aligners. CrossBow 12 is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 5 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown in Figure 5 is a traditional N-node Hadoop cluster. 1. The Map step is the short read alignment algorithm, called BoWTie (Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. 2. The Sort step apportions the alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset 8

for that partition). The data here are the sorted alignments. 3. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the Hadoop cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS; then archived in SOAPsnp format. Enterprise-Class Hadoop on EMC Isilon As demonstrated by previous examples, the data and analysis scalability required for Genomics is ideally suited for Hadoop. EMC Isilon s OneFS distributes the Hadoop Name Node to provide high availability and load balancing, thereby eliminating the single point of failure. The Isilon NAS storage solution provides a highly efficient single file system/single volume, scalable up to 20 PB. Data can be staged from other protocols to HDFS using OneFS as a staging gateway. EMC Isilon provides Enterprise Grade data services to the Hadoop infrastructure via SnapshotIQ and SyncIQ for advanced backup and disaster recovery capabilities. The equation for Hadoop scalability can be represented as: Big(Data + Analytics) = Hadoop EMC:Isilon These advantages are summarized in Fig. 6 below: Figure 6: Hadoop advantages with EMC Isilon When combined the EMC GreenPlum Analytics appliance and solution 17, the Hadoop architecture becomes a complete Enterprise package. 9

Conclusion What began as an internal project at Google in 2004 has now matured into a scalable framework for two computing paradigms that are particularly suited for the life sciences: parallelization and distribution. The post-processing streaming data patterns for text strings, clustering and sorting the core process patterns in the life sciences are ideal workflows for Hadoop. The CrossBow example discussed above aligned Illumina NGS reads for SNP calling over a 35x coverage of the human genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude better than traditional HPC technology for parallel processes. Even though Hadoop implementations in the Cloud are popular on the Public Cloud instances, several issues have resulted in most large institutions maintaining their own data repositories internally: large data transfer from the on-premise storage to the Cloud; data regulations and security; data availability; data redundancy and HPC throughput. This is especially true as genome sequencing moves into the Clinic for diagnostic testing. The convergence of these issues is evidenced by the mirroring of Short Read sequence Archive (SRA) at the National Center for Biotechnology Information (NCBI) on the DNANexus SRA Cloud 15 its business model is slowly evolving into a full data and analysis offsite model via Hadoop. The Hybrid Cloud model (a source data mirror between Private Cloud and Community Cloud) with Hadoop as a Service (HaaS) is the current state-of-the-art. Hadoop s advantages far outweigh its challenges it is ready to become the life sciences analytics framework of the future. The EMC Isilon platform is bringing that future to you today. References 1. Pennisi, E; Science 11 February 2011: Vol. 331 no. 6018 pp. 666-668 2. Editorial, Challenges and Opportunities, Science 11 February 2011: Vol. 331 no. 6018 pp 692. 3. Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528 4. Cafarella, M and Cutting D, Building Nutch, Open Source Search, ACM Queue vol. 2, no. 2, April 2004. 5. Dean J and Ghemawat S, "MapReduce: Simplfied Data Processing on Large Clusters", OSDI conference proceedings, 2004. 6. Vasiliu B, Integrating BLAST with Sun GridEngine, July 2003, http://developers.sun.com/solaris/articles/integrating_blast.html, last visited Dec 2011. 7. White, Tom: Hadoop -- The Definitive Guide 2 nd Edition, Published by O Reilly, Oct 2010 8. RHIPE: http://ml.stat.purdue.edu/rhipe/, last visited Dec 2011 10

9. MapReduce example: http://markusklems.files.wordpress.com/2008/07/mapreduce.png, last visited Dec 2011. 10. Hadoop wins Terabyte sort benchmark, Apr 2008, Apr 2009, http://sortbenchmark.org/yahoohadoop.pdf, http://sortbenchmark.org/yahoo2009.pdf last accessed Dec 2011 11. McKenna A, et al, "The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data", Genome Research, 20:1297 1303, July 2010. 12. Langmead B, Schatz MC, et al, Human SNPs from short reads in hours using cloud computing Poster Presentation, WABI Sep 2009, http://www.cbcb.umd.edu/~mschatz/posters/crossbow_wabi_sept2009.pdf, last accessed Dec 2011. 13. Taylor RC, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics" BMC Bioinformatics 2010, 11(Suppl 12):S1, http://www.biomedcentral.com/1471-2105/11/s12/s1, last accessed Dec 2011. 14. Ramakrishnan L, Evaluating Cloud Computing for HPC Applications, DoE NeRSC, http://www.nersc.gov/assets/events/magellannersclunchtalk.pdf, last accessed Dec 2011. 15. DNAnexus to mirror SRA database in Google Cloud, BioIT World, Page 41, http://www.bio-itworld.com/uploadedfiles/bio- IT_World/1111BITW_download.pdf, last visited Dec 2011. 16. Altschul SF, et al, "Basic local alignment search tool". J Mol Biol 215 (3): 403 410, October 1990. 17. Lockner J.,"EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and GreenPlum HD", White Paper, The Enterprise Strategy Group, Inc (ESG), February 2012 11