COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY

Size: px
Start display at page:

Download "COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY"

Transcription

1 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): COMPARISON OF BIG DATA ANALYTICS TOOLS: A BIOINFORMATICS CASE STUDY MUHAMMAD SHAHZAD 1 AND KAMRAN AHSAN 2 1 Department of Computer Science, PAF Karachi Institute of Economics & Technology 1, Karachi, Pakistan 2 Federal Urdu University of Arts Science & Technology, Karachi, Pakistan Abstract Due to the exponential growth of genomic sequence in the field of biological science, it becomes the immense challenge for biomedical practitioners and researchers to access and analyze it. We have awash of data today in different application areas particularly in the healthcare systems. Collection of these data sets is not only exceeding in volume of Exabyte and Zettabyte but also in different varieties and velocities. Scientist continuously encounter challenges to make decisions about what to store and what to discard, and how to analyze and extract information within optimal time. In the field of life sciences and biological sciences, next generation sequencing methods have been highly affected by the generationof biological Big Data. These diversities of omics information including genomes, transcriptomes and epigenomes will take us to the Yottabyte (10 21 ) data scale in the coming few years. These radical changes in the generation and acquisition of Big Data begin open challenges for capturing, curation, storage, searching, sharing, transfer, visualization and analysis of information. For big data solution, MPI running of HPC and MapReduce running on Hadoop Cluster have been used. This paper investigates three latest bioinformatics tools used on Hadoop. We adopt comparative methodology in conjunction with functions including Mapping and Dealing with sequence files. In mapping function, we give insight for the alignment of read with respect to reference genome sequence. And in dealing with sequence files, different sequence file formats supports have been discussed. This research will facilitate potential researchers from the field of biological sciences to choose appropriate bioinformatics Hadoop tools in their scientific investigation and findings. Introduction From the past two decades, magnificent advances in the areas of biomedicine and biology has been developed that covers from human disease to evolution of microbial ecology. During this time period, specifically for few past years, costs for the sequencing have been tremendously reduced event quicker than Moore s Law. Whereas capacity of sequencing have been increased with faster pace. These advances in the field of biological sciences facilitate the accomplishments of genomics landmark. This is the strength of genome sequences including main model of organisms used for biological research including roundworms, yeast, bacteria, mouse and fruit flies and that the human as well. To understand and learn many more about the variations in the human, and many human diseases basis, it has been desired to apply DNA sequencing on large scale data sets. For few decades, bioinformaticians continuously working on biological problems specially handling and providing solutions related to huge amount of omics data covering genomes, transcriptomes and epigenomes. Fast sequencing and computing of large scale genome data analysis has been desired for a long time. Till date, many solutions have been introduced and used including message passing interface and graphics processing units which are pioneer in parallel computing and in high performance computing. Recently, one of the latest advancement in a parallel distributed framework is cloud computing (Vaquero2008).With respect to previous distributing computing architectures, cloud computing offers all traditional services at different level like software as a service, platform as a service and infrastructure as a service. Next generation sequencing (Editorial, 2010)(Backer, 2010)in computational biology is the most of the recent challenge for bioinformatics researchers, which needs high performance computing for bioinformatics big data analysis. Although MPI has the capability to cope with these challenges, but due to the lack of programming capabilities, the researchers from the biological background feel programming complications. Hadoop (Hadoop, 2014) (White, 2009) is a software framework written in Java and developed by Google. It is used to easily write applications to process petabyte data sets in parallel on thousands of nodes of Linux commodity hardware. MapReduce (Dean, 2008) is the core of Hadoop which performs two distinct and separate tasks. There are two kinds of operations in MapReduce function. Where first action of MapReduce function is mapping, in which set of data has been taken as an input and translates it into another form of data set. This translation generates tuples of each individual element into key/value pairs. Whereas second action is reduce, which consider map s output as input and joint each of the generated tuples into some sort of small tuples set. As the order of MapReduce specifies, the map action is always executed before the reduce action.

2 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): This paper explores different Hadoop tools used in bioinformatics big data which offers researchers to choose alternatives in different scenario. The rest of this paper is organized as follows. In Section 2, we discuss all our selected Hadoop bioinformatics tools individually including their advantages and applicability. Section 3 describes the gap analysis which also recommends suitable tool for particular scenario. In Section 4, we discuss our conclusion and plans for future work. Data Handling for Solving Biological Problems: Due to the complex nature of biological data this is required to handle it with big data tools so that more accurate, authoritative, reliable and explicit conclusion can be drawn. Sze et al (2011) extend the verdict of microbiomes in the human lung characteristic of asthma and chronic obstructive pulmonary disease (COPD) using surface brushings and bronchoalveolar lavage fluid. Lung tissue sample is used to obtain DNA from 8 non smokers, 8 smokers with COPD, 8 patients of cystic fibrosis and 8 from very severe COPD patients. Polymerase chain reaction amplifying 16S rrna gene fragments is used for bacterial community analysis. Terminal restriction fragment length polymorphism analysis and pyrotag sequencing is used to judge bacterial community composition and quantitative polymerase chain reaction. This study delivers the quantification and identification of bacteria in human lung tissues. Kruskal-Wallis nonparametric test together with Bonferroni correction for multiple comparisons is used to investigate QPCR and phyla analysis of the pyrotag sequencing data. Terminal Restriction Fragment Length Polymorphism method is used to produce fingerprints of an unidentified microbial community. The sample size and data analyzed in the above mentioned study is in the computational capacity of available system but if the same data is collected from individuals from different parts of the world, from different genders, from different climatic region, from different working environments then the complexity, heterogeneity, and highly dynamic schema of the millions of tissues will be beyond the computational power of available high performance systems. This data will require the use of parallel computation of multiple machines. In another example whole-genome and whole-exome sequencing in order to detect genomic alterations in breast cancers is done by Banerji (2012). dranger algorithm is applied to the 22 cases with paired tumor/normal whole-genome sequencing data. A total of 108 samples, 17 both whole-genome sequencing and whole-exome sequencing, 86 whole-exome sequencing only, and 5 whole-genome sequencing only, passed initial qualification metrics, library construction, and successfully achieved desired sequencing depth (100 3whole-exome sequencing; 30 whole-genome sequencing) on the Illumina sequencing platform. In order to identify Germline mutation event additional mutation calling is done. Although, as compared to other studies, they have taken a large sample size to analyze result but if the size in increased, such as in thousands, the obtained result would provide more accurate and more consistent result as it is a fact that sample size and standard error are inversely proportional to each other. Moreover if the requires data is gathered and stored on permanent basis through years then it can be very rich, fruitful, productive and prolific. The collected data will serve the researcher in order to find out more insight view against the breast carcinoma and the collected data will not even serve for current patients but also can serve for healthy individual to sidestep the hazard of disease. But the problem will arise if this data is collected on large scale basis. Piecing together and analyzing genomic data requires the computation power of high performance machine with parallel execution of algorithm to properly sequence alignment. As a third example Pragman et al (2012) presented the largest analyses of COPD (Chronic obstructive pulmonary disease) patient comprising thirty-two samples of 454 pyrosequencing of ambulatory patients with moderate or severe disease. In total 460,000 sequences were taking from each sample averaging 14,451 sequences after trimming and quality control filtering. Principal coordinate analysis (PCoA) using Fast UniFrachas been done in order to confirm the similarities between obtained sample which also validates the result. They reached to the result that age is more crucial factor as compared to severity of COPD with increased microbial diversity. If the same process has been performed even with larger data set then the result obtained would be more accurate and worthwhile for the healthcare community. Comparison of Bioinformatics Hadoop Tools: CloudBurst: It was projected in 2009 to map short reads in contrast to a reference genome with the use of MapReduce framework (Schatz, 2009). Utilization of seed-and-extend technique of alignment increases the efficiency of mapping single end NGS (next generation sequence data) to reference genomes. Alignment algorithms technique of CloudBurst uses map reads with any number of mismatches or differences. With the inclusion of Hadoop, it scales constantly in linear manner with the increase in number of reads and speeds up in a linear form with the increase of the cluster size. These characteristics of CloudBurst replace RMAP in a

3 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): pipeline data analysis with the same results, but offers with a great extent in a performance because of the opensource implementation of the distributed programming framework MapReduce called Hadoop (Smith, 2009) (Hadoop, 2014). In MapReduce function, the input file(s) are partitioned automatically into different pieces which depend on size of each piece and the preferred amount of mappers. Each mapper object as shown in the figure below as m1 and m2 performs user-defined function on each portion of the input and generates key value pairs. The shuffle parts of MapReduce function, makes a list of values related with each key as shown below ask1, k2 and kn. The reducer part as shown in given below figure as r1 and r2 calculate a user-defined function for their subset of the keys and associated list of values, to create the set of output files. Fig. 1. Diagrammatic overview of MapReduce. This greatest performance of CloudBurst is made possible because of the efficient power of Hadoop. This distributed programming framework makes it straight forward to create application with highly scalable with automatically provision of parallel and distributed computing in many aspects. Hadoop s capability to bring high performance, even in the face of enormously large datasets, is a perfect match for different kinds of problems in bioinformatics and computational biology. Later work in Cloudburst is to generate quality values in the algorithms scoring and in mapping and to improve provision for paired reads. Explorations for the possibilities of integrating CloudBurst with RNA-seq analysis pipeline, which can also represent gene splice sites, can also be an important future work in this effort. Algorithms without hashtable, like BWT basedon short-read aligners, can also use Hadoop to parallelize executionand the HDFS. Hadoop-BAM: It is a new library that uses Hadoop distributed computing framework for the scalable handling of aligned NGS data. It works at middle layer that integrates between applications analysis and files of BAM type (Li, 2009) that are treated using Hadoop. A Binary Alignment/Map file (.bam) is the binary form of SAM file. A Sequence Alignment/Map (SAM) file (.sam) is a text files with tab-delimited that has sequence alignment data (Integrative Genomics Viewer, 2014).The aligned data is usually stored in an indexed BAM (Binary Alignment/Map), which is later utilized for more analysis like SNP genotyping. Due to increase in data size, many of the MapReduce frameworks have been used to offer useful parts for Next Generation Sequence pipeline data analysis. Though, they do not have capability to allow efficient parallel process of BAM files. This issue related to data accessing from BAM file has been solved by Hadoop-BAM (Niemenmaa1, 2012)with the provision of an API for employing map-reduce functions. It constructson top layer of the Picard SAM JDK, so those tools that based on Picard API can be easily converted to support large-scale distributed processing. Hadoop-bam requires version of Hadoop (Hadoop, 2014) installed in the system, as currently only this version is supported by the application. Hadoop-bam uses version1.27 of Picard (Picard, 2014) to read and write the bam files. Fig. 2showspreprocessed data using Chipster genome browser (Kallio, 2011) to show an interactive high level overview of coverage profile.

4 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): Fig. 2. Diagrammatic overview of MapReduce. Myrna: Our third selected Hadoop based tool, for computing differential gene expression from large RNA-sea datasets. It assimilates short read alignment with normalization, interval calculations, statistical modeling and aggregation in a single computational pipeline. After the alignment process, Myrna computes coverage for genes, coding regions, exons and differential expression using either non-parametric permutation or parametric tests. Finally results are returned back in the form of pergene Q-values and P-values for raw count table, differential expression, reads per kilobaseofexon model per million mapped reads (RPKM table), coverage plots for significant genes that can be directly incorporated into publications (Fig. 3), and other diagnostic plots (Langmead, 2010). Fig. 3.Myrna Pipeline.

5 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): Myrna is developed with the power of Hadoop/MapReduce model. There are three kinds of Myrna s implementation and execution model. One way to run Myrna is on the cloud using Amazon Elastic MapReduce. Second way based on Hadoop cluster. And the third way to run Myrnaon is a singleton computer but without Hadoop. In Cloud mode appropriate account is being required for the user and authorizations roles so that to set up the system beforehand. No extra or any special software installation is required in Cloud mode; i.e. before Myrna is run, the appropriate software is automatically installed on the EC2 instances. Whereas in Hadoop mode, Myrna needs Hadoop cluster in functioning, with R/Bioconductor and Bowtie, that must be installed on each and every nodes. And in singleton mode, there is a requirement of R/Bioconductor and Bowtie to be installed on the computer, but without Hadoop. Discussion Biomedical researchers and bioinformatians have been continuously confronted with large-scale data. These biological datasets are generated not only from Next Generation Sequences but also from definite databases and resources. For processing and analyzing these data sets, amongst all other distributed computing power, Hadoop framework has been found recently most appropriate technology and method for entertaining biological data (Almeida, 2012). The above tools that have been discussed above have their own pros and cons. Each of these tools can be very useful at one end and can be insufficient at the other aspect. CloudBurst with Hadoop features has ability to cope with mapping next generation data to human genome and other reference genomes. It can map with any number of mismatch reads because of its seed-and-extend technique for alignment. Running time of CloudBurst is linear with the number of reads mapped. With the inclusion of Hadoop, highly scalable application development and high performance deliverables in the field of computational biology is achievable. But still there is a gap in CloudBurst that needs to be copped up. Many other tools like SOAP, MAQ, RMAP and ZOOM have very sufficient rich features of map/align as compared to CloudBurst. Also there is a need to for incorporating quality scoring algorithms and quality mapping values. By adopting these features, pair reading becomes more effective. Another very severe limitation of the CloudBurst is that there is no support for handling fast format input and pair-end reads. Hadoop-BAM is a library that provides integration services between BAM (Binary Alignment/Map) files and application analysis. It solves the issues of accessing compact data like BAM files by providing API for the Hadoop framework in the implementation of map-reduce function. Moreover it can be used for any kind of analysis that based on BAM files. For more simple and efficient access of Hadoop-BAM, there is a need for add-up some kind of user friendly and well-organized query languages for Hadoop environment so that one can work on BAM files conveniently. Hadoop-BAM community is also planning for Samtools- like functionality. Moreover another, strict limitation of Hadoop-BAM, that it cannot work for BED files. With these values added features in Hadoop-BAM, it can offer more scalability and allows avoiding movement of data in and out from Hadoop in between analyzing steps. Myrna is used for analysis of differential expression between genes using cloud computing. Due to the efficient computational pipeline feature of Myrna, its statistical method becomes more sophisticated. It has been tested for analyzing RNA-Seq which is publically available, over the billions of reads simultaneously. Results of test showed wonderful results in finding differential expression of large data set of genes. But the current version of Myrna, still lacking in alignment of reads across junction of exons. The alignment in junction reads may lose the expression signal which is serious gap of Myrna that needs to be entertained. Another gap of Myrna is that, it trims all input so that to process reads with fixed length before alignment operation. The three types of Myrna support like Myrna on cloud with Amazon, Myrna with Hadoop, and third type is singleton mode, have also some strengths and weakness. Though Myrna with Cloud have greater scalability than the other two types, but data transformation is quite inconvenient for users on cloud. Therefore preference is given to singleton mode with local Hadoop cluster resources. So this is another challenge for Myrna community to cope with these issues. With the future adoption of these capabilities will increase the performance of Myrna s characteristics. In the short term, all the traditional resources of bioinformatics, ranging from tools to databases to literature, will be needed to restructured, redesigned or planted and restudy so that to entertain and support Hadoop technology and services. This kind of comparative study on bioinformatics Hadoop tools will certainly help and assist not only to biologist to conduct more concrete findings in the field of biology but also to increase the researchers from both field of computer science and biological science to guide their attention and consideration towards high-performance computing.

6 SHAHZAD AND AHSAN (2014), FUUAST J. BIOL., 4(1): Conclusion In this paper we have evaluated CloudBurst, Hadoop-BAM and Myrna for their use as modern bioinformatics Hadoop tools in the field of distributed and parallel computing. We used comparative methodology as test case. This research founds that, No system is preferred in all situations. All systems have their own strengths and weaknesses. Acknowledgement The authors would like to thank all the faculty members of the Department of Computer Science, Federal Urdu University and International Center for Chemical and Biological Sciences, HEJ University of Karachi, for their useful suggestions and comments. References Almeida, J. S., Grüneberg, A., Maass, W. and Vinga, S. (2012). Fractal MapReduce decomposition of sequence alignment.algorithms for Molecular Biology, 7(1): 12. Baker, M. (2010). Next-generation sequencing: adjusting to data overload. Nature methods, 7(7), Editorial, (2010). Gathering clouds and a sequencing storm: Why cloud computing could broaden community access to next-generation sequencing. Nature Biotechnology, 28(1). Banerji, S., Cibulskis, K., Rangel-Escareno, C., Brown, K.K., Carter, S.L., Frederick, A.M., and Meyerson, M. (2012). Sequence analysis of mutations and translocations across breast cancer subtypes. Nature, 486(7403): Hadoop, (2014). [Online] Available from: [Accessed 28 th July 2014] Integrative Genomics Viewer (2014). [Online] Available from: [Accessed 28 th July 2014]. J. Dean and Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters, Communications of the ACM 51 (1) Kallio, M.A., Tuimala, J.T., Hupponen, T., Klemelä, P., Gentile, M., Scheinin, I.,... and Korpelainen, E.I. (2011). Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC genomics, 12(1): 507. Langmead, B., Hansen, K.D., and Leek, J.T. (2010). Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biology, 11(8), R83. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N.,... and Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., and Heljanko, K. (2012). Hadoop- BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics, 28(6): Picard, (2014). [Online] Available from: [Accessed 28 th July 2014]. Schatz, M.C. (2009). CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11), Pragman, A.A., Kim, H.B., Reilly, C.S., Wendt, C., and Isaacson, R.E. (2012). The lung microbiome in moderate and severe chronic obstructive pulmonary disease. Plos one, 7(10): e Smith, A.D., Chung, W.Y., Hodges, E., Kendall, J., Hannon, G., Hicks, J.,... and Zhang, M.Q. (2009). Updates to the RMAP short-read mapping software.bioinformatics, 25(21): Vaquero, L.M., Rodero-Merino, L., Cáceres, J. and Lindner, M. (2008). A break in the clouds: towards a cloud definition. ACM SIGCOMM Computer Communication Review. Sze, M. A., Dimitriu, P.A., Hayashi, S., Elliott, W.M., McDonough, J.E., Gosselink, J.V.,... and Hogg, J.C. (2012). The lung tissue microbiome in chronic obstructive pulmonary disease. American journal of respiratory and critical care medicine 185(10): ew 39(1), White, T. (2009). Hadoop: The Definitive Guide. Sebastopol: O Reilly Media..

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

HADOOP IN THE LIFE SCIENCES:

HADOOP IN THE LIFE SCIENCES: White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS)

More information

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoopizer : a cloud environment for bioinformatics data analysis Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,

More information

CSE-E5430 Scalable Cloud Computing. Lecture 4

CSE-E5430 Scalable Cloud Computing. Lecture 4 Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System

More information

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop-BAM and SeqPig

Hadoop-BAM and SeqPig Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer

More information

Cloud-based Analytics and Map Reduce

Cloud-based Analytics and Map Reduce 1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated

More information

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio

More information

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis 2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance

More information

G E N OM I C S S E RV I C ES

G E N OM I C S S E RV I C ES GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

Richmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,

Richmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University, Massive Multi-Omics Microbiome Database (M 3 DB): A Scalable Data Warehouse and Analytics Platform for Microbiome Datasets Shaun W. Norris 1 (norrissw@vcu.edu) Steven P. Bradley 2 (bradleysp@vcu.edu) Hardik

More information

HIV NOMOGRAM USING BIG DATA ANALYTICS

HIV NOMOGRAM USING BIG DATA ANALYTICS HIV NOMOGRAM USING BIG DATA ANALYTICS S.Avudaiselvi and P.Tamizhchelvi Student Of Ayya Nadar Janaki Ammal College (Sivakasi) Head Of The Department Of Computer Science, Ayya Nadar Janaki Ammal College

More information

Delivering the power of the world s most successful genomics platform

Delivering the power of the world s most successful genomics platform Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE

More information

Hadoop. Bioinformatics Big Data

Hadoop. Bioinformatics Big Data Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture

More information

Hadoop s Rise in Life Sciences

Hadoop s Rise in Life Sciences Exploring EMC Isilon scale-out storage solutions Hadoop s Rise in Life Sciences By John Russell, Contributing Editor, Bio IT World Produced by Cambridge Healthtech Media Group By now the Big Data challenge

More information

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013 ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and

More information

Processing NGS Data with Hadoop-BAM and SeqPig

Processing NGS Data with Hadoop-BAM and SeqPig Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3

More information

New solutions for Big Data Analysis and Visualization

New solutions for Big Data Analysis and Visualization New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology

More information

How-To: SNP and INDEL detection

How-To: SNP and INDEL detection How-To: SNP and INDEL detection April 23, 2014 Lumenogix NGS SNP and INDEL detection Mutation Analysis Identifying known, and discovering novel genomic mutations, has been one of the most popular applications

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Analyze Human Genome Using Big Data

Analyze Human Genome Using Big Data Analyze Human Genome Using Big Data Poonm Kumari 1, Shiv Kumar 2 1 Mewar University, Chittorgargh, Department of Computer Science of Engineering, NH-79, Gangrar-312901, India 2 Co-Guide, Mewar University,

More information

Deep Sequencing Data Analysis

Deep Sequencing Data Analysis Deep Sequencing Data Analysis Ross Whetten Professor Forestry & Environmental Resources Background Who am I, and why am I teaching this topic? I am not an expert in bioinformatics I started as a biologist

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective

Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective Kyeongyeol Lim, Geehan Park, Minsuk Choi, Youjip Won Hanyang University 7 Seongdonggu Hangdangdong, Seoul, Korea {lkyeol,

More information

High Performance Computing with Hadoop WV HPC Summer Institute 2014

High Performance Computing with Hadoop WV HPC Summer Institute 2014 High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop

More information

Basic processing of next-generation sequencing (NGS) data

Basic processing of next-generation sequencing (NGS) data Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1 Reminder: we are measuring expression of protein coding genes by transcript abundance

More information

PREDA S4-classes. Francesco Ferrari October 13, 2015

PREDA S4-classes. Francesco Ferrari October 13, 2015 PREDA S4-classes Francesco Ferrari October 13, 2015 Abstract This document provides a description of custom S4 classes used to manage data structures for PREDA: an R package for Position RElated Data Analysis.

More information

GC3 Use cases for the Cloud

GC3 Use cases for the Cloud GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina Trieste, 24.10.2013 Who am I System Architect

More information

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Next Generation Sequencing: Adjusting to Big Data Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa 29.10.2013 Outline Human Genome Project Next-Generation Sequencing Personalized Medicine

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

White Paper. Version 1.2 May 2015 RAID Incorporated

White Paper. Version 1.2 May 2015 RAID Incorporated White Paper Version 1.2 May 2015 RAID Incorporated Introduction The abundance of Big Data, structured, partially-structured and unstructured massive datasets, which are too large to be processed effectively

More information

Version 5.0 Release Notes

Version 5.0 Release Notes Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com

More information

Comparing Methods for Identifying Transcription Factor Target Genes

Comparing Methods for Identifying Transcription Factor Target Genes Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1 Outline

More information

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department

More information

GeneProf and the new GeneProf Web Services

GeneProf and the new GeneProf Web Services GeneProf and the new GeneProf Web Services Florian Halbritter florian.halbritter@ed.ac.uk Stem Cell Bioinformatics Group (Simon R. Tomlinson) simon.tomlinson@ed.ac.uk December 10, 2012 Florian Halbritter

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Human Genome Organization: An Update. Genome Organization: An Update

Human Genome Organization: An Update. Genome Organization: An Update Human Genome Organization: An Update Genome Organization: An Update Highlights of Human Genome Project Timetable Proposed in 1990 as 3 billion dollar joint venture between DOE and NIH with 15 year completion

More information

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) stephane.le_crom@upmc.fr Paris November 2013 The Sanger DNA sequencing method Sequencing

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Cloud-Based Big Data Analytics in Bioinformatics: A Review

Cloud-Based Big Data Analytics in Bioinformatics: A Review Cloud-Based Big Data Analytics in Bioinformatics: A Review Cephas MAWERE 1, Kudakwashe ZVAREVASHE 2, Thamari SENGUDZWA 3, Tendai PADENGA 4 1 Harare Institute of Technology, School of Industrial Sciences

More information

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

High Performance Compu2ng Facility

High Performance Compu2ng Facility High Performance Compu2ng Facility Center for Health Informa2cs and Bioinforma2cs Accelera2ng Scien2fic Discovery and Innova2on in Biomedical Research at NYULMC through Advanced Compu2ng Efstra'os Efstathiadis,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Scalable Cloud Computing

Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/44 Business Drivers of Cloud Computing Large data centers allow for economics

More information

File S1: Supplementary Information of CloudDOE

File S1: Supplementary Information of CloudDOE File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.

More information

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Twister4Azure: Data Analytics in the Cloud

Twister4Azure: Data Analytics in the Cloud Twister4Azure: Data Analytics in the Cloud Thilina Gunarathne, Xiaoming Gao and Judy Qiu, Indiana University Genome-scale data provided by next generation sequencing (NGS) has made it possible to identify

More information

Gene Resequencing with Myrna on Intel Distribution of Hadoop

Gene Resequencing with Myrna on Intel Distribution of Hadoop Gene Resequencing with Myrna on Intel Distribution of Hadoop V1.00 Intel Corporation Authors Abhi Basu Contributors Terry Toy Gaurav Kaul I Page 1 TABLE OF CONTENTS GENE RESEQUENCING WITH MYRNA ON INTEL

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

High Throughput Sequencing Data Analysis using Cloud Computing

High Throughput Sequencing Data Analysis using Cloud Computing High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure

More information

Accelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com. 2013 DataDirect Networks. All Rights Reserved

Accelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com. 2013 DataDirect Networks. All Rights Reserved DDN Case Study Accelerate > Converged Storage Infrastructure 2013 DataDirect Networks. All Rights Reserved The University of Florida s (ICBR) offers access to cutting-edge technologies designed to enable

More information

Big data in cancer research : DNA sequencing and personalised medicine

Big data in cancer research : DNA sequencing and personalised medicine Big in cancer research : DNA sequencing and personalised medicine Philippe Hupé Conférence BIGDATA 04/04/2013 1 - Titre de la présentation - nom du département émetteur et/ ou rédacteur - 00/00/2005 Deciphering

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

School of Nursing. Presented by Yvette Conley, PhD

School of Nursing. Presented by Yvette Conley, PhD Presented by Yvette Conley, PhD What we will cover during this webcast: Briefly discuss the approaches introduced in the paper: Genome Sequencing Genome Wide Association Studies Epigenomics Gene Expression

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova Using the Grid for the interactive workflow management in biomedicine Andrea Schenone BIOLAB DIST University of Genova overview background requirements solution case study results background A multilevel

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce

CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce Wei-Chun Chung 1,2,3, Chien-Chih Chen 1,2, Jan-Ming Ho 1,3, Chung-Yen Lin 1, Wen-Lian

More information

BIG DATA IN BUSINESS ENVIRONMENT

BIG DATA IN BUSINESS ENVIRONMENT Scientific Bulletin Economic Sciences, Volume 14/ Issue 1 BIG DATA IN BUSINESS ENVIRONMENT Logica BANICA 1, Alina HAGIU 2 1 Faculty of Economics, University of Pitesti, Romania olga.banica@upit.ro 2 Faculty

More information

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution Ignacio Medina, Paul Calleja, John Taylor (University of Cambridge, UIS, HPC Service (HPCS)) Abstract The advent

More information

Generic Log Analyzer Using Hadoop Mapreduce Framework

Generic Log Analyzer Using Hadoop Mapreduce Framework Generic Log Analyzer Using Hadoop Mapreduce Framework Milind Bhandare 1, Prof. Kuntal Barua 2, Vikas Nagare 3, Dynaneshwar Ekhande 4, Rahul Pawar 5 1 M.Tech(Appeare), 2 Asst. Prof., LNCT, Indore 3 ME,

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and

More information

Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research

Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research Cloudflow A Framework for MapReduce Pipeline Development in Biomedical Research Lukas Forer 1,2, Enis Afgan 3,4, Hansi Weißensteiner 1,2, Davor Davidović 3, Günther Specht 2, Florian Kronenberg 1, Sebastian

More information

MapReducing a Genomic Sequencing Workflow

MapReducing a Genomic Sequencing Workflow MapReducing a Genomic Sequencing Workflow Luca Pireddu CRS4 Pula, CA, Italy luca.pireddu@crs4.it Simone Leo CRS4 Pula, CA, Italy simone.leo@crs4.it Gianluigi Zanetti CRS4 Pula, CA, Italy gianluigi.zanetti@crs4.it

More information