COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY

SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): 113-118 COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY MUHAMMAD SHAHZAD 1 AND KAMRAN AHSAN 2 1 Department of Computer Science, PAF Karachi Institute of Economics & Technology 1, Karachi, Pakistan 2 Federal Urdu University of Arts Science & Technology, Karachi, Pakistan Abstract Due to the exponential growth of genomic sequence in the field of biological science, it becomes the immense challenge for biomedical practitioners and researchers to access and analyze it. We have awash of data today in different application areas particularly in the healthcare systems. Collection of these data sets is not only exceeding in volume of Exabyte and Zettabyte but also in different varieties and velocities. Scientist continuously encounter challenges to make decisions about what to store and what to discard, and how to analyze and extract information within optimal time. In the field of life sciences and biological sciences, next generation sequencing methods have been highly affected by the generationof biological Big Data. These diversities of omics information including genomes, transcriptomes and epigenomes will take us to the Yottabyte (10 21 ) data scale in the coming few years. These radical changes in the generation and acquisition of Big Data begin open challenges for capturing, curation, storage, searching, sharing, transfer, visualization and analysis of information. For big data solution, MPI running of HPC and MapReduce running on Hadoop Cluster have been used. This paper investigates three latest bioinformatics tools used on Hadoop. We adopt comparative methodology in conjunction with functions including Mapping and Dealing with sequence files. In mapping function, we give insight for the alignment of read with respect to reference genome sequence. And in dealing with sequence files, different sequence file formats supports have been discussed. This research will facilitate potential researchers from the field of biological sciences to choose appropriate bioinformatics Hadoop tools in their scientific investigation and findings. Introduction From the past two decades, magnificent advances in the areas of biomedicine and biology has been developed that covers from human disease to evolution of microbial ecology. During this time period, specifically for few past years, costs for the sequencing have been tremendously reduced event quicker than Moore s Law. Whereas capacity of sequencing have been increased with faster pace. These advances in the field of biological sciences facilitate the accomplishments of genomics landmark. This is the strength of genome sequences including main model of organisms used for biological research including roundworms, yeast, bacteria, mouse and fruit flies and that the human as well. To understand and learn many more about the variations in the human, and many human diseases basis, it has been desired to apply DNA sequencing on large scale data sets. For few decades, bioinformaticians continuously working on biological problems specially handling and providing solutions related to huge amount of omics data covering genomes, transcriptomes and epigenomes. Fast sequencing and computing of large scale genome data analysis has been desired for a long time. Till date, many solutions have been introduced and used including message passing interface and graphics processing units which are pioneer in parallel computing and in high performance computing. Recently, one of the latest advancement in a parallel distributed framework is cloud computing (Vaquero2008).With respect to previous distributing computing architectures, cloud computing offers all traditional services at different level like software as a service, platform as a service and infrastructure as a service. Next generation sequencing (Editorial, 2010)(Backer, 2010)in computational biology is the most of the recent challenge for bioinformatics researchers, which needs high performance computing for bioinformatics big data analysis. Although MPI has the capability to cope with these challenges, but due to the lack of programming capabilities, the researchers from the biological background feel programming complications. Hadoop (Hadoop, 2014) (White, 2009) is a software framework written in Java and developed by Google. It is used to easily write applications to process petabyte data sets in parallel on thousands of nodes of Linux commodity hardware. MapReduce (Dean, 2008) is the core of Hadoop which performs two distinct and separate tasks. There are two kinds of operations in MapReduce function. Where first action of MapReduce function is mapping, in which set of data has been taken as an input and translates it into another form of data set. This translation generates tuples of each individual element into key/value pairs. Whereas second action is reduce, which consider map s output as input and joint each of the generated tuples into some sort of small tuples set. As the order of MapReduce specifies, the map action is always executed before the reduce action.

SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): 113-118 114 This paper explores different Hadoop tools used in bioinformatics big data which offers researchers to choose alternatives in different scenario. The rest of this paper is organized as follows. In Section 2, we discuss all our selected Hadoop bioinformatics tools individually including their advantages and applicability. Section 3 describes the gap analysis which also recommends suitable tool for particular scenario. In Section 4, we discuss our conclusion and plans for future work. Data Handling for Solving Biological Problems: Due to the complex nature of biological data this is required to handle it with big data tools so that more accurate, authoritative, reliable and explicit conclusion can be drawn. Sze et al (2011) extend the verdict of microbiomes in the human lung characteristic of asthma and chronic obstructive pulmonary disease (COPD) using surface brushings and bronchoalveolar lavage fluid. Lung tissue sample is used to obtain DNA from 8 non smokers, 8 smokers with COPD, 8 patients of cystic fibrosis and 8 from very severe COPD patients. Polymerase chain reaction amplifying 16S rrna gene fragments is used for bacterial community analysis. Terminal restriction fragment length polymorphism analysis and pyrotag sequencing is used to judge bacterial community composition and quantitative polymerase chain reaction. This study delivers the quantification and identification of bacteria in human lung tissues. Kruskal-Wallis nonparametric test together with Bonferroni correction for multiple comparisons is used to investigate QPCR and phyla analysis of the pyrotag sequencing data. Terminal Restriction Fragment Length Polymorphism method is used to produce fingerprints of an unidentified microbial community. The sample size and data analyzed in the above mentioned study is in the computational capacity of available system but if the same data is collected from individuals from different parts of the world, from different genders, from different climatic region, from different working environments then the complexity, heterogeneity, and highly dynamic schema of the millions of tissues will be beyond the computational power of available high performance systems. This data will require the use of parallel computation of multiple machines. In another example whole-genome and whole-exome sequencing in order to detect genomic alterations in breast cancers is done by Banerji (2012). dranger algorithm is applied to the 22 cases with paired tumor/normal whole-genome sequencing data. A total of 108 samples, 17 both whole-genome sequencing and whole-exome sequencing, 86 whole-exome sequencing only, and 5 whole-genome sequencing only, passed initial qualification metrics, library construction, and successfully achieved desired sequencing depth (100 3whole-exome sequencing; 30 whole-genome sequencing) on the Illumina sequencing platform. In order to identify Germline mutation event additional mutation calling is done. Although, as compared to other studies, they have taken a large sample size to analyze result but if the size in increased, such as in thousands, the obtained result would provide more accurate and more consistent result as it is a fact that sample size and standard error are inversely proportional to each other. Moreover if the requires data is gathered and stored on permanent basis through years then it can be very rich, fruitful, productive and prolific. The collected data will serve the researcher in order to find out more insight view against the breast carcinoma and the collected data will not even serve for current patients but also can serve for healthy individual to sidestep the hazard of disease. But the problem will arise if this data is collected on large scale basis. Piecing together and analyzing genomic data requires the computation power of high performance machine with parallel execution of algorithm to properly sequence alignment. As a third example Pragman et al (2012) presented the largest analyses of COPD (Chronic obstructive pulmonary disease) patient comprising thirty-two samples of 454 pyrosequencing of ambulatory patients with moderate or severe disease. In total 460,000 sequences were taking from each sample averaging 14,451 sequences after trimming and quality control filtering. Principal coordinate analysis (PCoA) using Fast UniFrachas been done in order to confirm the similarities between obtained sample which also validates the result. They reached to the result that age is more crucial factor as compared to severity of COPD with increased microbial diversity. If the same process has been performed even with larger data set then the result obtained would be more accurate and worthwhile for the healthcare community. Comparison of Bioinformatics Hadoop Tools: CloudBurst: It was projected in 2009 to map short reads in contrast to a reference genome with the use of MapReduce framework (Schatz, 2009). Utilization of seed-and-extend technique of alignment increases the efficiency of mapping single end NGS (next generation sequence data) to reference genomes. Alignment algorithms technique of CloudBurst uses map reads with any number of mismatches or differences. With the inclusion of Hadoop, it scales constantly in linear manner with the increase in number of reads and speeds up in a linear form with the increase of the cluster size. These characteristics of CloudBurst replace RMAP in a

SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): 113-118 115 pipeline data analysis with the same results, but offers with a great extent in a performance because of the opensource implementation of the distributed programming framework MapReduce called Hadoop (Smith, 2009) (Hadoop, 2014). In MapReduce function, the input file(s) are partitioned automatically into different pieces which depend on size of each piece and the preferred amount of mappers. Each mapper object as shown in the figure below as m1 and m2 performs user-defined function on each portion of the input and generates key value pairs. The shuffle parts of MapReduce function, makes a list of values related with each key as shown below ask1, k2 and kn. The reducer part as shown in given below figure as r1 and r2 calculate a user-defined function for their subset of the keys and associated list of values, to create the set of output files. Fig. 1. Diagrammatic overview of MapReduce. This greatest performance of CloudBurst is made possible because of the efficient power of Hadoop. This distributed programming framework makes it straight forward to create application with highly scalable with automatically provision of parallel and distributed computing in many aspects. Hadoop s capability to bring high performance, even in the face of enormously large datasets, is a perfect match for different kinds of problems in bioinformatics and computational biology. Later work in Cloudburst is to generate quality values in the algorithms scoring and in mapping and to improve provision for paired reads. Explorations for the possibilities of integrating CloudBurst with RNA-seq analysis pipeline, which can also represent gene splice sites, can also be an important future work in this effort. Algorithms without hashtable, like BWT basedon short-read aligners, can also use Hadoop to parallelize executionand the HDFS. Hadoop-BAM: It is a new library that uses Hadoop distributed computing framework for the scalable handling of aligned NGS data. It works at middle layer that integrates between applications analysis and files of BAM type (Li, 2009) that are treated using Hadoop. A Binary Alignment/Map file (.bam) is the binary form of SAM file. A Sequence Alignment/Map (SAM) file (.sam) is a text files with tab-delimited that has sequence alignment data (Integrative Genomics Viewer, 2014).The aligned data is usually stored in an indexed BAM (Binary Alignment/Map), which is later utilized for more analysis like SNP genotyping. Due to increase in data size, many of the MapReduce frameworks have been used to offer useful parts for Next Generation Sequence pipeline data analysis. Though, they do not have capability to allow efficient parallel process of BAM files. This issue related to data accessing from BAM file has been solved by Hadoop-BAM (Niemenmaa1, 2012)with the provision of an API for employing map-reduce functions. It constructson top layer of the Picard SAM JDK, so those tools that based on Picard API can be easily converted to support large-scale distributed processing. Hadoop-bam requires version 0.20.2 of Hadoop (Hadoop, 2014) installed in the system, as currently only this version is supported by the application. Hadoop-bam uses version1.27 of Picard (Picard, 2014) to read and write the bam files. Fig. 2showspreprocessed data using Chipster genome browser (Kallio, 2011) to show an interactive high level overview of coverage profile.

SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): 113-118 116 Fig. 2. Diagrammatic overview of MapReduce. Myrna: Our third selected Hadoop based tool, for computing differential gene expression from large RNA-sea datasets. It assimilates short read alignment with normalization, interval calculations, statistical modeling and aggregation in a single computational pipeline. After the alignment process, Myrna computes coverage for genes, coding regions, exons and differential expression using either non-parametric permutation or parametric tests. Finally results are returned back in the form of pergene Q-values and P-values for raw count table, differential expression, reads per kilobaseofexon model per million mapped reads (RPKM table), coverage plots for significant genes that can be directly incorporated into publications (Fig. 3), and other diagnostic plots (Langmead, 2010). Fig. 3.Myrna Pipeline.

SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): 113-118 117 Myrna is developed with the power of Hadoop/MapReduce model. There are three kinds of Myrna s implementation and execution model. One way to run Myrna is on the cloud using Amazon Elastic MapReduce. Second way based on Hadoop cluster. And the third way to run Myrnaon is a singleton computer but without Hadoop. In Cloud mode appropriate account is being required for the user and authorizations roles so that to set up the system beforehand. No extra or any special software installation is required in Cloud mode; i.e. before Myrna is run, the appropriate software is automatically installed on the EC2 instances. Whereas in Hadoop mode, Myrna needs Hadoop cluster in functioning, with R/Bioconductor and Bowtie, that must be installed on each and every nodes. And in singleton mode, there is a requirement of R/Bioconductor and Bowtie to be installed on the computer, but without Hadoop. Discussion Biomedical researchers and bioinformatians have been continuously confronted with large-scale data. These biological datasets are generated not only from Next Generation Sequences but also from definite databases and resources. For processing and analyzing these data sets, amongst all other distributed computing power, Hadoop framework has been found recently most appropriate technology and method for entertaining biological data (Almeida, 2012). The above tools that have been discussed above have their own pros and cons. Each of these tools can be very useful at one end and can be insufficient at the other aspect. CloudBurst with Hadoop features has ability to cope with mapping next generation data to human genome and other reference genomes. It can map with any number of mismatch reads because of its seed-and-extend technique for alignment. Running time of CloudBurst is linear with the number of reads mapped. With the inclusion of Hadoop, highly scalable application development and high performance deliverables in the field of computational biology is achievable. But still there is a gap in CloudBurst that needs to be copped up. Many other tools like SOAP, MAQ, RMAP and ZOOM have very sufficient rich features of map/align as compared to CloudBurst. Also there is a need to for incorporating quality scoring algorithms and quality mapping values. By adopting these features, pair reading becomes more effective. Another very severe limitation of the CloudBurst is that there is no support for handling fast format input and pair-end reads. Hadoop-BAM is a library that provides integration services between BAM (Binary Alignment/Map) files and application analysis. It solves the issues of accessing compact data like BAM files by providing API for the Hadoop framework in the implementation of map-reduce function. Moreover it can be used for any kind of analysis that based on BAM files. For more simple and efficient access of Hadoop-BAM, there is a need for add-up some kind of user friendly and well-organized query languages for Hadoop environment so that one can work on BAM files conveniently. Hadoop-BAM community is also planning for Samtools- like functionality. Moreover another, strict limitation of Hadoop-BAM, that it cannot work for BED files. With these values added features in Hadoop-BAM, it can offer more scalability and allows avoiding movement of data in and out from Hadoop in between analyzing steps. Myrna is used for analysis of differential expression between genes using cloud computing. Due to the efficient computational pipeline feature of Myrna, its statistical method becomes more sophisticated. It has been tested for analyzing RNA-Seq which is publically available, over the billions of reads simultaneously. Results of test showed wonderful results in finding differential expression of large data set of genes. But the current version of Myrna, still lacking in alignment of reads across junction of exons. The alignment in junction reads may lose the expression signal which is serious gap of Myrna that needs to be entertained. Another gap of Myrna is that, it trims all input so that to process reads with fixed length before alignment operation. The three types of Myrna support like Myrna on cloud with Amazon, Myrna with Hadoop, and third type is singleton mode, have also some strengths and weakness. Though Myrna with Cloud have greater scalability than the other two types, but data transformation is quite inconvenient for users on cloud. Therefore preference is given to singleton mode with local Hadoop cluster resources. So this is another challenge for Myrna community to cope with these issues. With the future adoption of these capabilities will increase the performance of Myrna s characteristics. In the short term, all the traditional resources of bioinformatics, ranging from tools to databases to literature, will be needed to restructured, redesigned or planted and restudy so that to entertain and support Hadoop technology and services. This kind of comparative study on bioinformatics Hadoop tools will certainly help and assist not only to biologist to conduct more concrete findings in the field of biology but also to increase the researchers from both field of computer science and biological science to guide their attention and consideration towards high-performance computing.

SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): 113-118 118 Conclusion In this paper we have evaluated CloudBurst, Hadoop-BAM and Myrna for their use as modern bioinformatics Hadoop tools in the field of distributed and parallel computing. We used comparative methodology as test case. This research founds that, No system is preferred in all situations. All systems have their own strengths and weaknesses. Acknowledgement The authors would like to thank all the faculty members of the Department of Computer Science, Federal Urdu University and International Center for Chemical and Biological Sciences, HEJ University of Karachi, for their useful suggestions and comments. References Almeida, J. S., Grüneberg, A., Maass, W. and Vinga, S. (2012). Fractal MapReduce decomposition of sequence alignment.algorithms for Molecular Biology, 7(1): 12. Baker, M. (2010). Next-generation sequencing: adjusting to data overload. Nature methods, 7(7), 495-499. Editorial, (2010). Gathering clouds and a sequencing storm: Why cloud computing could broaden community access to next-generation sequencing. Nature Biotechnology, 28(1). Banerji, S., Cibulskis, K., Rangel-Escareno, C., Brown, K.K., Carter, S.L., Frederick, A.M., and Meyerson, M. (2012). Sequence analysis of mutations and translocations across breast cancer subtypes. Nature, 486(7403): 405-409. Hadoop, (2014). [Online] Available from:http://hadoop.apache.org/, [Accessed 28 th July 2014] Integrative Genomics Viewer (2014). [Online] Available from: http://www.broadinstitute.org/igv/bam, [Accessed 28 th July 2014]. J. Dean and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters, Communications of the ACM 51 (1) 107 113. Kallio, M.A., Tuimala, J.T., Hupponen, T., Klemelä, P., Gentile, M., Scheinin, I.,... and Korpelainen, E.I. (2011). Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC genomics, 12(1): 507. Langmead, B., Hansen, K.D., and Leek, J.T. (2010). Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biology, 11(8), R83. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,... and Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., and Heljanko, K. (2012). Hadoop- BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics, 28(6): 876-877. Picard, (2014). [Online] Available from:http://picard.sourceforge.net/, [Accessed 28 th July 2014]. Schatz, M.C. (2009). CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11), 1363-1369. Pragman, A.A., Kim, H.B., Reilly, C.S., Wendt, C., and Isaacson, R.E. (2012). The lung microbiome in moderate and severe chronic obstructive pulmonary disease. Plos one, 7(10): e47305. Smith, A.D., Chung, W.Y., Hodges, E., Kendall, J., Hannon, G., Hicks, J.,... and Zhang, M.Q. (2009). Updates to the RMAP short-read mapping software.bioinformatics, 25(21): 2841-2842. Vaquero, L.M., Rodero-Merino, L., Cáceres, J. and Lindner, M. (2008). A break in the clouds: towards a cloud definition. ACM SIGCOMM Computer Communication Review. Sze, M. A., Dimitriu, P.A., Hayashi, S., Elliott, W.M., McDonough, J.E., Gosselink, J.V.,... and Hogg, J.C. (2012). The lung tissue microbiome in chronic obstructive pulmonary disease. American journal of respiratory and critical care medicine 185(10): 1073-1080.ew 39(1), 50 55. White, T. (2009). Hadoop: The Definitive Guide. Sebastopol: O Reilly Media..