COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY
|
|
- Francis McDonald
- 8 years ago
- Views:
Transcription
1 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY MUHAMMAD SHAHZAD 1 AND KAMRAN AHSAN 2 1 Department of Computer Science, PAF Karachi Institute of Economics & Technology 1, Karachi, Pakistan 2 Federal Urdu University of Arts Science & Technology, Karachi, Pakistan Abstract Due to the exponential growth of genomic sequence in the field of biological science, it becomes the immense challenge for biomedical practitioners and researchers to access and analyze it. We have awash of data today in different application areas particularly in the healthcare systems. Collection of these data sets is not only exceeding in volume of Exabyte and Zettabyte but also in different varieties and velocities. Scientist continuously encounter challenges to make decisions about what to store and what to discard, and how to analyze and extract information within optimal time. In the field of life sciences and biological sciences, next generation sequencing methods have been highly affected by the generationof biological Big Data. These diversities of omics information including genomes, transcriptomes and epigenomes will take us to the Yottabyte (10 21 ) data scale in the coming few years. These radical changes in the generation and acquisition of Big Data begin open challenges for capturing, curation, storage, searching, sharing, transfer, visualization and analysis of information. For big data solution, MPI running of HPC and MapReduce running on Hadoop Cluster have been used. This paper investigates three latest bioinformatics tools used on Hadoop. We adopt comparative methodology in conjunction with functions including Mapping and Dealing with sequence files. In mapping function, we give insight for the alignment of read with respect to reference genome sequence. And in dealing with sequence files, different sequence file formats supports have been discussed. This research will facilitate potential researchers from the field of biological sciences to choose appropriate bioinformatics Hadoop tools in their scientific investigation and findings. Introduction From the past two decades, magnificent advances in the areas of biomedicine and biology has been developed that covers from human disease to evolution of microbial ecology. During this time period, specifically for few past years, costs for the sequencing have been tremendously reduced event quicker than Moore s Law. Whereas capacity of sequencing have been increased with faster pace. These advances in the field of biological sciences facilitate the accomplishments of genomics landmark. This is the strength of genome sequences including main model of organisms used for biological research including roundworms, yeast, bacteria, mouse and fruit flies and that the human as well. To understand and learn many more about the variations in the human, and many human diseases basis, it has been desired to apply DNA sequencing on large scale data sets. For few decades, bioinformaticians continuously working on biological problems specially handling and providing solutions related to huge amount of omics data covering genomes, transcriptomes and epigenomes. Fast sequencing and computing of large scale genome data analysis has been desired for a long time. Till date, many solutions have been introduced and used including message passing interface and graphics processing units which are pioneer in parallel computing and in high performance computing. Recently, one of the latest advancement in a parallel distributed framework is cloud computing (Vaquero2008).With respect to previous distributing computing architectures, cloud computing offers all traditional services at different level like software as a service, platform as a service and infrastructure as a service. Next generation sequencing (Editorial, 2010)(Backer, 2010)in computational biology is the most of the recent challenge for bioinformatics researchers, which needs high performance computing for bioinformatics big data analysis. Although MPI has the capability to cope with these challenges, but due to the lack of programming capabilities, the researchers from the biological background feel programming complications. Hadoop (Hadoop, 2014) (White, 2009) is a software framework written in Java and developed by Google. It is used to easily write applications to process petabyte data sets in parallel on thousands of nodes of Linux commodity hardware. MapReduce (Dean, 2008) is the core of Hadoop which performs two distinct and separate tasks. There are two kinds of operations in MapReduce function. Where first action of MapReduce function is mapping, in which set of data has been taken as an input and translates it into another form of data set. This translation generates tuples of each individual element into key/value pairs. Whereas second action is reduce, which consider map s output as input and joint each of the generated tuples into some sort of small tuples set. As the order of MapReduce specifies, the map action is always executed before the reduce action.
2 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): This paper explores different Hadoop tools used in bioinformatics big data which offers researchers to choose alternatives in different scenario. The rest of this paper is organized as follows. In Section 2, we discuss all our selected Hadoop bioinformatics tools individually including their advantages and applicability. Section 3 describes the gap analysis which also recommends suitable tool for particular scenario. In Section 4, we discuss our conclusion and plans for future work. Data Handling for Solving Biological Problems: Due to the complex nature of biological data this is required to handle it with big data tools so that more accurate, authoritative, reliable and explicit conclusion can be drawn. Sze et al (2011) extend the verdict of microbiomes in the human lung characteristic of asthma and chronic obstructive pulmonary disease (COPD) using surface brushings and bronchoalveolar lavage fluid. Lung tissue sample is used to obtain DNA from 8 non smokers, 8 smokers with COPD, 8 patients of cystic fibrosis and 8 from very severe COPD patients. Polymerase chain reaction amplifying 16S rrna gene fragments is used for bacterial community analysis. Terminal restriction fragment length polymorphism analysis and pyrotag sequencing is used to judge bacterial community composition and quantitative polymerase chain reaction. This study delivers the quantification and identification of bacteria in human lung tissues. Kruskal-Wallis nonparametric test together with Bonferroni correction for multiple comparisons is used to investigate QPCR and phyla analysis of the pyrotag sequencing data. Terminal Restriction Fragment Length Polymorphism method is used to produce fingerprints of an unidentified microbial community. The sample size and data analyzed in the above mentioned study is in the computational capacity of available system but if the same data is collected from individuals from different parts of the world, from different genders, from different climatic region, from different working environments then the complexity, heterogeneity, and highly dynamic schema of the millions of tissues will be beyond the computational power of available high performance systems. This data will require the use of parallel computation of multiple machines. In another example whole-genome and whole-exome sequencing in order to detect genomic alterations in breast cancers is done by Banerji (2012). dranger algorithm is applied to the 22 cases with paired tumor/normal whole-genome sequencing data. A total of 108 samples, 17 both whole-genome sequencing and whole-exome sequencing, 86 whole-exome sequencing only, and 5 whole-genome sequencing only, passed initial qualification metrics, library construction, and successfully achieved desired sequencing depth (100 3whole-exome sequencing; 30 whole-genome sequencing) on the Illumina sequencing platform. In order to identify Germline mutation event additional mutation calling is done. Although, as compared to other studies, they have taken a large sample size to analyze result but if the size in increased, such as in thousands, the obtained result would provide more accurate and more consistent result as it is a fact that sample size and standard error are inversely proportional to each other. Moreover if the requires data is gathered and stored on permanent basis through years then it can be very rich, fruitful, productive and prolific. The collected data will serve the researcher in order to find out more insight view against the breast carcinoma and the collected data will not even serve for current patients but also can serve for healthy individual to sidestep the hazard of disease. But the problem will arise if this data is collected on large scale basis. Piecing together and analyzing genomic data requires the computation power of high performance machine with parallel execution of algorithm to properly sequence alignment. As a third example Pragman et al (2012) presented the largest analyses of COPD (Chronic obstructive pulmonary disease) patient comprising thirty-two samples of 454 pyrosequencing of ambulatory patients with moderate or severe disease. In total 460,000 sequences were taking from each sample averaging 14,451 sequences after trimming and quality control filtering. Principal coordinate analysis (PCoA) using Fast UniFrachas been done in order to confirm the similarities between obtained sample which also validates the result. They reached to the result that age is more crucial factor as compared to severity of COPD with increased microbial diversity. If the same process has been performed even with larger data set then the result obtained would be more accurate and worthwhile for the healthcare community. Comparison of Bioinformatics Hadoop Tools: CloudBurst: It was projected in 2009 to map short reads in contrast to a reference genome with the use of MapReduce framework (Schatz, 2009). Utilization of seed-and-extend technique of alignment increases the efficiency of mapping single end NGS (next generation sequence data) to reference genomes. Alignment algorithms technique of CloudBurst uses map reads with any number of mismatches or differences. With the inclusion of Hadoop, it scales constantly in linear manner with the increase in number of reads and speeds up in a linear form with the increase of the cluster size. These characteristics of CloudBurst replace RMAP in a
3 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): pipeline data analysis with the same results, but offers with a great extent in a performance because of the opensource implementation of the distributed programming framework MapReduce called Hadoop (Smith, 2009) (Hadoop, 2014). In MapReduce function, the input file(s) are partitioned automatically into different pieces which depend on size of each piece and the preferred amount of mappers. Each mapper object as shown in the figure below as m1 and m2 performs user-defined function on each portion of the input and generates key value pairs. The shuffle parts of MapReduce function, makes a list of values related with each key as shown below ask1, k2 and kn. The reducer part as shown in given below figure as r1 and r2 calculate a user-defined function for their subset of the keys and associated list of values, to create the set of output files. Fig. 1. Diagrammatic overview of MapReduce. This greatest performance of CloudBurst is made possible because of the efficient power of Hadoop. This distributed programming framework makes it straight forward to create application with highly scalable with automatically provision of parallel and distributed computing in many aspects. Hadoop s capability to bring high performance, even in the face of enormously large datasets, is a perfect match for different kinds of problems in bioinformatics and computational biology. Later work in Cloudburst is to generate quality values in the algorithms scoring and in mapping and to improve provision for paired reads. Explorations for the possibilities of integrating CloudBurst with RNA-seq analysis pipeline, which can also represent gene splice sites, can also be an important future work in this effort. Algorithms without hashtable, like BWT basedon short-read aligners, can also use Hadoop to parallelize executionand the HDFS. Hadoop-BAM: It is a new library that uses Hadoop distributed computing framework for the scalable handling of aligned NGS data. It works at middle layer that integrates between applications analysis and files of BAM type (Li, 2009) that are treated using Hadoop. A Binary Alignment/Map file (.bam) is the binary form of SAM file. A Sequence Alignment/Map (SAM) file (.sam) is a text files with tab-delimited that has sequence alignment data (Integrative Genomics Viewer, 2014).The aligned data is usually stored in an indexed BAM (Binary Alignment/Map), which is later utilized for more analysis like SNP genotyping. Due to increase in data size, many of the MapReduce frameworks have been used to offer useful parts for Next Generation Sequence pipeline data analysis. Though, they do not have capability to allow efficient parallel process of BAM files. This issue related to data accessing from BAM file has been solved by Hadoop-BAM (Niemenmaa1, 2012)with the provision of an API for employing map-reduce functions. It constructson top layer of the Picard SAM JDK, so those tools that based on Picard API can be easily converted to support large-scale distributed processing. Hadoop-bam requires version of Hadoop (Hadoop, 2014) installed in the system, as currently only this version is supported by the application. Hadoop-bam uses version1.27 of Picard (Picard, 2014) to read and write the bam files. Fig. 2showspreprocessed data using Chipster genome browser (Kallio, 2011) to show an interactive high level overview of coverage profile.
4 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): Fig. 2. Diagrammatic overview of MapReduce. Myrna: Our third selected Hadoop based tool, for computing differential gene expression from large RNA-sea datasets. It assimilates short read alignment with normalization, interval calculations, statistical modeling and aggregation in a single computational pipeline. After the alignment process, Myrna computes coverage for genes, coding regions, exons and differential expression using either non-parametric permutation or parametric tests. Finally results are returned back in the form of pergene Q-values and P-values for raw count table, differential expression, reads per kilobaseofexon model per million mapped reads (RPKM table), coverage plots for significant genes that can be directly incorporated into publications (Fig. 3), and other diagnostic plots (Langmead, 2010). Fig. 3.Myrna Pipeline.
5 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): Myrna is developed with the power of Hadoop/MapReduce model. There are three kinds of Myrna s implementation and execution model. One way to run Myrna is on the cloud using Amazon Elastic MapReduce. Second way based on Hadoop cluster. And the third way to run Myrnaon is a singleton computer but without Hadoop. In Cloud mode appropriate account is being required for the user and authorizations roles so that to set up the system beforehand. No extra or any special software installation is required in Cloud mode; i.e. before Myrna is run, the appropriate software is automatically installed on the EC2 instances. Whereas in Hadoop mode, Myrna needs Hadoop cluster in functioning, with R/Bioconductor and Bowtie, that must be installed on each and every nodes. And in singleton mode, there is a requirement of R/Bioconductor and Bowtie to be installed on the computer, but without Hadoop. Discussion Biomedical researchers and bioinformatians have been continuously confronted with large-scale data. These biological datasets are generated not only from Next Generation Sequences but also from definite databases and resources. For processing and analyzing these data sets, amongst all other distributed computing power, Hadoop framework has been found recently most appropriate technology and method for entertaining biological data (Almeida, 2012). The above tools that have been discussed above have their own pros and cons. Each of these tools can be very useful at one end and can be insufficient at the other aspect. CloudBurst with Hadoop features has ability to cope with mapping next generation data to human genome and other reference genomes. It can map with any number of mismatch reads because of its seed-and-extend technique for alignment. Running time of CloudBurst is linear with the number of reads mapped. With the inclusion of Hadoop, highly scalable application development and high performance deliverables in the field of computational biology is achievable. But still there is a gap in CloudBurst that needs to be copped up. Many other tools like SOAP, MAQ, RMAP and ZOOM have very sufficient rich features of map/align as compared to CloudBurst. Also there is a need to for incorporating quality scoring algorithms and quality mapping values. By adopting these features, pair reading becomes more effective. Another very severe limitation of the CloudBurst is that there is no support for handling fast format input and pair-end reads. Hadoop-BAM is a library that provides integration services between BAM (Binary Alignment/Map) files and application analysis. It solves the issues of accessing compact data like BAM files by providing API for the Hadoop framework in the implementation of map-reduce function. Moreover it can be used for any kind of analysis that based on BAM files. For more simple and efficient access of Hadoop-BAM, there is a need for add-up some kind of user friendly and well-organized query languages for Hadoop environment so that one can work on BAM files conveniently. Hadoop-BAM community is also planning for Samtools- like functionality. Moreover another, strict limitation of Hadoop-BAM, that it cannot work for BED files. With these values added features in Hadoop-BAM, it can offer more scalability and allows avoiding movement of data in and out from Hadoop in between analyzing steps. Myrna is used for analysis of differential expression between genes using cloud computing. Due to the efficient computational pipeline feature of Myrna, its statistical method becomes more sophisticated. It has been tested for analyzing RNA-Seq which is publically available, over the billions of reads simultaneously. Results of test showed wonderful results in finding differential expression of large data set of genes. But the current version of Myrna, still lacking in alignment of reads across junction of exons. The alignment in junction reads may lose the expression signal which is serious gap of Myrna that needs to be entertained. Another gap of Myrna is that, it trims all input so that to process reads with fixed length before alignment operation. The three types of Myrna support like Myrna on cloud with Amazon, Myrna with Hadoop, and third type is singleton mode, have also some strengths and weakness. Though Myrna with Cloud have greater scalability than the other two types, but data transformation is quite inconvenient for users on cloud. Therefore preference is given to singleton mode with local Hadoop cluster resources. So this is another challenge for Myrna community to cope with these issues. With the future adoption of these capabilities will increase the performance of Myrna s characteristics. In the short term, all the traditional resources of bioinformatics, ranging from tools to databases to literature, will be needed to restructured, redesigned or planted and restudy so that to entertain and support Hadoop technology and services. This kind of comparative study on bioinformatics Hadoop tools will certainly help and assist not only to biologist to conduct more concrete findings in the field of biology but also to increase the researchers from both field of computer science and biological science to guide their attention and consideration towards high-performance computing.
6 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): Conclusion In this paper we have evaluated CloudBurst, Hadoop-BAM and Myrna for their use as modern bioinformatics Hadoop tools in the field of distributed and parallel computing. We used comparative methodology as test case. This research founds that, No system is preferred in all situations. All systems have their own strengths and weaknesses. Acknowledgement The authors would like to thank all the faculty members of the Department of Computer Science, Federal Urdu University and International Center for Chemical and Biological Sciences, HEJ University of Karachi, for their useful suggestions and comments. References Almeida, J. S., Grüneberg, A., Maass, W. and Vinga, S. (2012). Fractal MapReduce decomposition of sequence alignment.algorithms for Molecular Biology, 7(1): 12. Baker, M. (2010). Next-generation sequencing: adjusting to data overload. Nature methods, 7(7), Editorial, (2010). Gathering clouds and a sequencing storm: Why cloud computing could broaden community access to next-generation sequencing. Nature Biotechnology, 28(1). Banerji, S., Cibulskis, K., Rangel-Escareno, C., Brown, K.K., Carter, S.L., Frederick, A.M., and Meyerson, M. (2012). Sequence analysis of mutations and translocations across breast cancer subtypes. Nature, 486(7403): Hadoop, (2014). [Online] Available from: [Accessed 28 th July 2014] Integrative Genomics Viewer (2014). [Online] Available from: [Accessed 28 th July 2014]. J. Dean and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters, Communications of the ACM 51 (1) Kallio, M.A., Tuimala, J.T., Hupponen, T., Klemelä, P., Gentile, M., Scheinin, I.,... and Korpelainen, E.I. (2011). Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC genomics, 12(1): 507. Langmead, B., Hansen, K.D., and Leek, J.T. (2010). Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biology, 11(8), R83. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,... and Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., and Heljanko, K. (2012). Hadoop- BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics, 28(6): Picard, (2014). [Online] Available from: [Accessed 28 th July 2014]. Schatz, M.C. (2009). CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11), Pragman, A.A., Kim, H.B., Reilly, C.S., Wendt, C., and Isaacson, R.E. (2012). The lung microbiome in moderate and severe chronic obstructive pulmonary disease. Plos one, 7(10): e Smith, A.D., Chung, W.Y., Hodges, E., Kendall, J., Hannon, G., Hicks, J.,... and Zhang, M.Q. (2009). Updates to the RMAP short-read mapping software.bioinformatics, 25(21): Vaquero, L.M., Rodero-Merino, L., Cáceres, J. and Lindner, M. (2008). A break in the clouds: towards a cloud definition. ACM SIGCOMM Computer Communication Review. Sze, M. A., Dimitriu, P.A., Hayashi, S., Elliott, W.M., McDonough, J.E., Gosselink, J.V.,... and Hogg, J.C. (2012). The lung tissue microbiome in chronic obstructive pulmonary disease. American journal of respiratory and critical care medicine 185(10): ew 39(1), White, T. (2009). Hadoop: The Definitive Guide. Sebastopol: O Reilly Media..
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationHADOOP IN THE LIFE SCIENCES:
White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS)
More informationHadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
More informationCSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
More informationSeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationCloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationLeading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik
Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated
More informationOpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
More informationCloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis
2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance
More informationG E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
More informationCloud-Based Big Data Analytics in Bioinformatics
Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large
More informationRichmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,
Massive Multi-Omics Microbiome Database (M 3 DB): A Scalable Data Warehouse and Analytics Platform for Microbiome Datasets Shaun W. Norris 1 (norrissw@vcu.edu) Steven P. Bradley 2 (bradleysp@vcu.edu) Hardik
More informationHIV NOMOGRAM USING BIG DATA ANALYTICS
HIV NOMOGRAM USING BIG DATA ANALYTICS S.Avudaiselvi and P.Tamizhchelvi Student Of Ayya Nadar Janaki Ammal College (Sivakasi) Head Of The Department Of Computer Science, Ayya Nadar Janaki Ammal College
More informationDelivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
More informationHadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture
More informationHadoop s Rise in Life Sciences
Exploring EMC Isilon scale-out storage solutions Hadoop s Rise in Life Sciences By John Russell, Contributing Editor, Bio IT World Produced by Cambridge Healthtech Media Group By now the Big Data challenge
More informationENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and
More informationProcessing NGS Data with Hadoop-BAM and SeqPig
Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3
More informationNew solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology
More informationHow-To: SNP and INDEL detection
How-To: SNP and INDEL detection April 23, 2014 Lumenogix NGS SNP and INDEL detection Mutation Analysis Identifying known, and discovering novel genomic mutations, has been one of the most popular applications
More informationAGILENT S BIOINFORMATICS ANALYSIS SOFTWARE
ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationAnalyze Human Genome Using Big Data
Analyze Human Genome Using Big Data Poonm Kumari 1, Shiv Kumar 2 1 Mewar University, Chittorgargh, Department of Computer Science of Engineering, NH-79, Gangrar-312901, India 2 Co-Guide, Mewar University,
More informationDeep Sequencing Data Analysis
Deep Sequencing Data Analysis Ross Whetten Professor Forestry & Environmental Resources Background Who am I, and why am I teaching this topic? I am not an expert in bioinformatics I started as a biologist
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationWorkload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective
Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective Kyeongyeol Lim, Geehan Park, Minsuk Choi, Youjip Won Hanyang University 7 Seongdonggu Hangdangdong, Seoul, Korea {lkyeol,
More informationHigh Performance Computing with Hadoop WV HPC Summer Institute 2014
High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop
More informationBasic processing of next-generation sequencing (NGS) data
Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance
More informationPREDA S4-classes. Francesco Ferrari October 13, 2015
PREDA S4-classes Francesco Ferrari October 13, 2015 Abstract This document provides a description of custom S4 classes used to manage data structures for PREDA: an R package for Position RElated Data Analysis.
More informationGC3 Use cases for the Cloud
GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina Trieste, 24.10.2013 Who am I System Architect
More informationNext Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013
Next Generation Sequencing: Adjusting to Big Data Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Outline Human Genome Project Next-Generation Sequencing Personalized Medicine
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationFocusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
More informationWhite Paper. Version 1.2 May 2015 RAID Incorporated
White Paper Version 1.2 May 2015 RAID Incorporated Introduction The abundance of Big Data, structured, partially-structured and unstructured massive datasets, which are too large to be processed effectively
More informationVersion 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
More informationComparing Methods for Identifying Transcription Factor Target Genes
Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF
More informationBig Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014
White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page
More informationRemoving Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1 Outline
More informationUCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production
Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department
More informationGeneProf and the new GeneProf Web Services
GeneProf and the new GeneProf Web Services Florian Halbritter florian.halbritter@ed.ac.uk Stem Cell Bioinformatics Group (Simon R. Tomlinson) simon.tomlinson@ed.ac.uk December 10, 2012 Florian Halbritter
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationHuman Genome Organization: An Update. Genome Organization: An Update
Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion
More informationEoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) stephane.le_crom@upmc.fr Paris November 2013 The Sanger DNA sequencing method Sequencing
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationCloud-Based Big Data Analytics in Bioinformatics: A Review
Cloud-Based Big Data Analytics in Bioinformatics: A Review Cephas MAWERE 1, Kudakwashe ZVAREVASHE 2, Thamari SENGUDZWA 3, Tendai PADENGA 4 1 Harare Institute of Technology, School of Industrial Sciences
More informationShouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center
Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing
More informationBig Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?
More informationBIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics
BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationData Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
More informationHigh Performance Compu2ng Facility
High Performance Compu2ng Facility Center for Health Informa2cs and Bioinforma2cs Accelera2ng Scien2fic Discovery and Innova2on in Biomedical Research at NYULMC through Advanced Compu2ng Efstra'os Efstathiadis,
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationScalable Cloud Computing
Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/44 Business Drivers of Cloud Computing Large data centers allow for economics
More informationFile S1: Supplementary Information of CloudDOE
File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.
More informationBuilding Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT
Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this
More informationHealthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationRecognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationTwister4Azure: Data Analytics in the Cloud
Twister4Azure: Data Analytics in the Cloud Thilina Gunarathne, Xiaoming Gao and Judy Qiu, Indiana University Genome-scale data provided by next generation sequencing (NGS) has made it possible to identify
More informationGene Resequencing with Myrna on Intel Distribution of Hadoop
Gene Resequencing with Myrna on Intel Distribution of Hadoop V1.00 Intel Corporation Authors Abhi Basu Contributors Terry Toy Gaurav Kaul I Page 1 TABLE OF CONTENTS GENE RESEQUENCING WITH MYRNA ON INTEL
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationHigh Throughput Sequencing Data Analysis using Cloud Computing
High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure
More informationAccelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com. 2013 DataDirect Networks. All Rights Reserved
DDN Case Study Accelerate > Converged Storage Infrastructure 2013 DataDirect Networks. All Rights Reserved The University of Florida s (ICBR) offers access to cutting-edge technologies designed to enable
More informationBig data in cancer research : DNA sequencing and personalised medicine
Big in cancer research : DNA sequencing and personalised medicine Philippe Hupé Conférence BIGDATA 04/04/2013 1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005 Deciphering
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationSchool of Nursing. Presented by Yvette Conley, PhD
Presented by Yvette Conley, PhD What we will cover during this webcast: Briefly discuss the approaches introduced in the paper: Genome Sequencing Genome Wide Association Studies Epigenomics Gene Expression
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationData Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute
Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationUsing the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova
Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel
More informationSurfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationHadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
More informationBIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
More informationEuro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationInternational Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
More informationCloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce
CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce Wei-Chun Chung 1,2,3, Chien-Chih Chen 1,2, Jan-Ming Ho 1,3, Chung-Yen Lin 1, Wen-Lian
More informationBIG DATA IN BUSINESS ENVIRONMENT
Scientific Bulletin Economic Sciences, Volume 14/ Issue 1 BIG DATA IN BUSINESS ENVIRONMENT Logica BANICA 1, Alina HAGIU 2 1 Faculty of Economics, University of Pitesti, Romania olga.banica@upit.ro 2 Faculty
More informationOpenCB development - A Big Data analytics and visualisation platform for the Omics revolution
OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution Ignacio Medina, Paul Calleja, John Taylor (University of Cambridge, UIS, HPC Service (HPCS)) Abstract The advent
More informationGeneric Log Analyzer Using Hadoop Mapreduce Framework
Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationManaging Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
More informationCloudflow A Framework for MapReduce Pipeline Development in Biomedical Research
Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research Lukas Forer 1,2, Enis Afgan 3,4, Hansi Weißensteiner 1,2, Davor Davidović 3, Günther Specht 2, Florian Kronenberg 1, Sebastian
More informationMapReducing a Genomic Sequencing Workflow
MapReducing a Genomic Sequencing Workflow Luca Pireddu CRS4 Pula, CA, Italy luca.pireddu@crs4.it Simone Leo CRS4 Pula, CA, Italy simone.leo@crs4.it Gianluigi Zanetti CRS4 Pula, CA, Italy gianluigi.zanetti@crs4.it
More information