Large-scale DNA Sequence Analysis in the Cloud: A Stream-based Approach
|
|
|
- Aubrey Marsh
- 10 years ago
- Views:
Transcription
1 Large-scale DNA Sequence Analysis in the Cloud: A Stream-based Approach Romeo Kienzler 1, Rémy Bruggmann 2, Anand Ranganathan 3, Nesime Tatbul 1 1 Department of Computer Science, ETH Zurich, Switzerland [email protected], [email protected] 2 Bioinformatics, Department of Biology, University of Berne, Switzerland [email protected] 3 IBM T.J. Watson Research Center, NY, USA [email protected] Abstract. Cloud computing technologies have made it possible to analyze big data sets in scalable and cost-effective ways. DNA sequence analysis, where very large data sets are now generated at reduced cost using the Next-Generation Sequencing (NGS) methods, is an area which can greatly benefit from cloud-based infrastructures. Although existing solutions show nearly linear scalability, they pose significant limitations in terms of data transfer latencies and cloud storage costs. In this paper, we propose to tackle the performance problems that arise from having to transfer large amounts of data between clients and the cloud based on a streaming data management architecture. Our approach provides an incremental data processing model which can hide data transfer latencies while maintaining linear scalability. We present an initial implementation and evaluation of this approach for SHRiMP, a well-known software package for NGS read alignment, based on the IBM InfoSphere Streams computing platform deployed on Amazon EC2. Keywords: DNA sequence analysis, Next-Generation Sequencing (NGS), NGS read alignment, cloud computing, data stream processing, incremental data processing 1 Introduction Today, huge amounts of data is being generated at ever increasing rates by a wide range of sources from networks of sensing devices to social media and special scientific devices such as DNA sequencing machines and astronomical telescopes. It has become both an exciting opportunity to use these data sets in intelligent applications such as detecting and preventing diseases or spotting business trends, as well as a major challenge to manage their capture, transfer, storage, and analysis. Recent advances in cloud computing technologies have made it possible to analyze very large data sets in scalable and cost-effective ways. Various platforms and frameworks have been proposed to be able to use the cloud infrastructures for solving this problem such as the MapReduce framework [2], [4]. Most of
2 2 these solutions are primarily designed for batch processing of data stored in a distributed file system. While such a design supports scalable and fault-tolerant processing very well, it may pose some limitations when transferring data. More specifically, large amounts of data has to be uploaded into the cloud before the processing starts, which not only causes significant data transfer latencies, but also adds to the cloud storage costs [19], [26]. In this short paper, we mainly investigate the performance problems that arise from having to transfer large amounts of data in and out of the cloud based on a real data-intensive use case from bioinformatics, for which we propose a stream-based approach as a promising solution. Our key idea is that data transfer latencies can be hidden by providing an incremental data processing architecture, similar in spirit to pipelined query evaluation models in traditional database systems [15]. It is important though that this is done in a way to also support linear scalability through parallel processing, which is an indispensable requirement for handling data and compute-intensive workloads in the cloud. More specifically, we propose to use a stream-based data management architecture, which not only provides an incremental and parallel data processing model, but also facilitates in-memory processing, since data is processed on the fly and intermediate data need not be materialized on disk (unless it is explicitly needed by the application), which can further reduce end-to-end response time and cloud storage costs. The rest of this paper is outlined as follows: In Section 2, we describe our use case for large-scale DNA sequence analysis which has been the main motivation for the work presented in this paper. We present our stream-based solution approach in Section 3, including an initial implementation and evaluation of our use case based on the IBM InfoSphere Streams computing platform [5] deployed on Amazon EC2 [1]. Finally, we conclude with a discussion of future work in Section 4. 2 Large-scale DNA Sequence Analysis Determining the order of the nucleotide bases in DNA molecules and analyzing the resulting sequences have become very essential in biological research and applications. Since 1970s, the Sanger method (also known as dideoxy or chain terminator method) had been the standard technique [22]. With this method, it is possible to read about 80 kilo base pairs (bp) per instrument-day at a total cost of $150. The Next-Generation Sequencing (NGS) methods, invented in 2004, dramatically increased this per-day bp throughput, and therefore, the amount of data generated that needed to be stored and processed [27]. To compare with the Sanger method above, with NGS, the cost for sequencing 80 kbps has fallen to less than $0.01 and is done in less than 10 seconds. Table 1 shows an overview of speed and cost of three different NGS technologies compared to the Sanger method. The higher throughput and lower cost of these technologies have led to the generation of very large datasets that need to be efficiently analyzed. As stated by Stein [25]:
3 3 Sanger Roche 454 Illumina 2k SOLID 5 read length GB per day cost per GB $2,000,000 $20,000 $75 $75 Table 1. Compared to the Sanger method, NGS methods have significantly higher throughput at a significant fraction of their costs. Genome biologists will have to start acting like the high energy physicists, who filter the huge datasets coming out of their collectors for a tiny number of informative events and then discard the rest. NGS is used to sequence DNA in an automated and high-throughput process. DNA molecules are fragmented into pieces of 100 to 800 bps, and digital versions of DNA fragments are generated. These fragments, called reads, originate from random positions of DNA molecules. In re-sequencing experiments the reads are mapped back to a reference genome (e.g., human) [19] or - without a reference genome - they can be assembled de novo [23]. However, de novo assembly is more complex due to the short read length as well as to potential repetitive regions in the genome. In re-sequencing experiments, polymorphisms between analyzed DNA and the reference genome can be observed. A polymorphism of a single bp is called Single Nucleotide Polymorphism (SNP) and is recognized as the main cause of human genetic variability [9]. Figure 1 shows an example, with a reference genome at the top row and two SNPs identified on the analyzed DNA sequences depicted below. As stated by Fernald et al, once NGS technology becomes available on a clinical level, it will become part of the standard healthcare process to check patients SNPs before medical treatment (a.k.a., personalized medicine ) [12]: We are on the verge of the genomic era: doctors and patients will have access to genetic data to customize medical treatment. Aligning NGS reads to genomes is computationally intensive. Li et al give an overview of algorithms and tools currently in use [19]. To align reads containing SNPs, probabilistic algorithms have to be used, since finding an exact match between reads and a given reference is not sufficient because of polymorphisms and sequencing errors. Most of these algorithms are based on a basic pattern called seed and extend [8], where small matching regions between reads and the reference genome are identified first (seeding), and then further extended. Additionally, to be able to identify seeds that contain SNPs, a special algorithm that allows for a certain difference during seeding needs to be used [16]. Unfortunately, this adaptation further increases the computational complexity. For example, on a small cluster used by FGCZ [3] (25 nodes with a total of 232 CPU compute cores and 800 GB main memory), a single genome alignment process can take up to 10 hours.
4 4 Fig. 1. SNP identification: The top row shows a subsequence of the reference genome. The following rows are aligned NGS reads. Two SNPs can be identified. T is replaced by C (7th column) and C is replaced by T (25th column). In one read (line 7), a sequencing error can be observed where A has been replaced by G (last column). Source: Read alignment algorithms have been shown to have a great potential for linear scalability [24]. However, sequencing throughput increases faster than computational power and storage size [25]. As a result, although NGS machines are becoming cheaper, using dedicated compute clusters for read alignment is still a significant investment. Fortunately, even small labs can do the alignment by using cloud resources [11]. Li et al state that cloud computing might be a possible solution for small labs, but also raises concerns about data transfer bottlenecks and storage costs [19]. Thus, existing cloud-based solutions such as CloudBurst [24] and Crossbow [17] as well as the cloud-enabled version of Galaxy [14] have a common disadvantage: before processing starts, large amounts of data has to be uploaded into the cloud, potentially causing significant data transfer latency and storage costs [26]. In this work, our main focus is to develop solutions for the performance problems that stem from having to transfer large amounts of data in and out of the cloud for data-intensive use cases such as the one described above. If we roughly capture the overall processing time with a function f(n, s) cs + s n,
5 5 where n is the number of CPU cores, s is the problem size 4, and c is a constant for data transfer rate between a client and the cloud, our main goal is to bring down the first component (cs) in this formula. Furthermore, we would like to do it in a way that supports linear scalability. In the next section, we will present the solution that we propose, together with an initial evaluation study which indicates that our approach is a promising one. 3 A Stream-based Approach In the following, we first present our stream-based approach in general terms, and then describe how we applied it to a specific NGS read alignment use case together with results of a preliminary evaluation study. 3.1 Incremental Data Processing with an SPE We propose to use a stream-based data management platform in order to reduce the total processing time of data-intensive applications deployed in the cloud by eliminating their data transfer latencies. Our main motivation to do so is to exploit the incremental and in-memory data processing model of Stream Processing Engines (SPEs) (such as the Borealis engine [7] or the IBM InfoSphere Streams (or Streams for short) engine [13]). SPEs have been primarily designed to provide low-latency processing over continuous streams of time-sensitive data from push-based data sources. Applications are built by defining directed acyclic graphs, where nodes represent operators and edges represent the dataflows between them. Operators transform data between their inputs and outputs, working on finite chunks of data sequences (a.k.a., sliding windows). SPEs provide query algebras with a well-defined set of commonly used operators, which can be easily extended with custom, userdefined operators. There are also special operators/adapters for supporting access to a variety of data sources including files, sockets, and databases. Once an application is defined, SPEs take care of all system-level requirements to execute it in a correct and efficient way such as interprocess communication, data partitioning, operator distribution, fault tolerance, and dynamic scaling. In our approach, we do not provide any new algorithms, but we provide an SPE-based platform to bring existing algorithms/software into the cloud in a way that they can work with their input data in an incremental fashion. One generic way of doing this is to use command line tools provided by most of these software. For example, in the NGS software packages that we looked at, we have so far seen two types of command line tools: those that are able to read and write to standard Unix pipes, and those that can not. We build custom streaming operators by wrapping the Unix processes. If standard Unix pipe communication is supported, using one thread, the Unix process is provided with incoming data 4 Problem size for NGS read alignment depends on a number of factors including the number of reads to be aligned, the size of the reference genome, and the fuzziness of the alignment algorithm.
6 6 streams and results are read by a second thread. Otherwise, data is written in chunks to files residing on an in-memory file system. For each chunk, the Unix process is run once and the produced output data is read and passed on to the next operator as a data stream. Figure 2 and Figure 3 contrast how data-intensive applications are typically being deployed in the cloud today vs. how they could be deployed using our approach, respectively. Although the figures illustrate our NGS read alignment use case specifically, the architectural and conceptual differences apply in general. Fig. 2. Using SHRiMP on multiple nodes as standalone application requires to split the raw NGS read data into equal-sized chunks, transfer them to multiple cloud nodes, run SHRiMP in parallel, copy back the results to the client, and finally, merge them into a single file. Fig. 3. With our stream-based approach, the client streams the reads into the cloud, where they instantly get mapped to a reference genome and results are immediately streamed back to the client.
7 7 Fig. 4. Operator and dataflow graph for our stream-based incremental processing implementation of SHRiMP. 3.2 Use Case Implementation We now describe how we implemented our approach for a well-known NGS read alignment software package called SHRiMP [21] using IBM InfoSphere Streams [5] as our SPE and Amazon EC2 [1] as our cloud computing platform. Figure 4 shows a detailed data flow graph of our implementation. A client application implemented in Java compresses and streams raw NGS read data into the cloud, where a master Streams node first receives it. At the master node, the read stream gets uncompressed by an Uncompress operator and is then fed into a TCPSource operator. In order to be able to run parallel instances of SHRiMP for increased scalability, TCPSource operator feeds the stream into a ThreadedSplit operator. ThreadedSplit is aware of the data input rates that can be handled by its downstream operators, and therefore, it can provide an optimal load distribution. The number of substreams that ThreadedSplit generates determines the the number of processing (i.e., slave) nodes in the compute cluster, each of which will run a SHRiMP instance. SHRiMP instances are created by instantiating a custom Streams operator using standard Unix pipes. The resulting aligned read data (in the form of SAM output [6]) on different SHRiMP nodes are merged by the master node using a Merge operator. Then a TCPSink operator passes the output stream to a Compress operator, which ensures that results are sent back to the client application in compact form, where they should be uncompressed again before being presented to the user. The whole chain, including the compression stages, is fully incremental. 3.3 Initial Evaluation In this section, we present an initial evaluation of our approach on the implemented use case in terms of scalability, costs, and ease of use. For scalability, we have done an experiment that compares the two architectures shown in Figure 2 and Figure 3. In the experiment, we have aligned reads of Streptococcus suis, an important pathogen of pigs, against its reference genome. Doing this on
8 8 Fig. 5. At a cluster with size of 4 nodes and above, the stream-based solution incurs less total processing time than the standalone application. This is because data transfer time always adds up to the curve of the standalone application. a single Amazon EC2 m1.large instance takes around 28 minutes. In order to be able to project this to analyzing more complicated organisms (like humans), we have scaled all our results up by a factor of 60 (e.g., 28 hours instead of 28 minutes). In all cases, data is compressed before being transferred into the cloud. To serve as a reference point, assuming a broadband Internet connection, transferring the compressed input data set into the cloud takes about 90 minutes. Scalability Figure 5 shows the result of our scalability experiment. The bottom flat line corresponds to the data transfer time of 90 minutes for our specific input dataset. This time is included in the SHRiMP standalone curve, where input data has to be uploaded into the cloud in advance. On the other hand, the streambased approach does not transfer any data in advance, thus does not incur this additional latency. Both approaches show linear scalability in total processing time as the number of Amazon EC2 nodes are increased. Upto 4 nodes, the standalone approach takes less processing time. However, we see that as the cluster size increases beyond this value, the relative effect of the initial data transfer latency for the standalone approach starts to show itself, reaching to almost a 30-minute difference in processing time over the stream-based approach for the 16-node setup. We expect this difference to become even more significant as the input dataset size and the cluster size further increase. Costs As our solution allows data processing to start as soon as the data arrives in the cloud, we can show that the constant c in the formula f(n, s) cs + s n introduced in the previous section can be brought to nearly zero, leading to
9 f (n, s) s n for the overall data processing time. Since we have shown linear scale out, we can calculate the CPU cost using p(n, s) nf (n, s) n s n s. Since the cost ends up being dependent only on the problem size, one can minimize the processing time f (n, s) by maximizing n without any significant effect on the cost. Data transfer and storage costs are relatively small in comparison to the CPU cost, therefore, we have ignored them in this initial cost analysis. Nevertheless, it is not difficult to see that these costs will also decrease with our stream-based approach. 9 Ease of Use Our client, a command line tool, behaves exactly the same way as a command line tool for any read alignment software package. Therefore, existing data processing chains can be sped up by simply replacing the existing aligner with our client without changing anything else. Even flexible and more complex bioinformatics data processing engines (e.g., Galaxy [14] or Pegasus [10]) can be transparently enhanced by simply replacing the original data processing stages with our solution. 4 Conclusions and Future Work In this paper, we proposed a stream-based approach to bringing data- and CPUintensive applications into the cloud without transferring data in advance. We applied this idea to a large-scale DNA sequence analysis use case and showed that overall processing time can be significantly reduced, while providing linear scalability, reduced monetary costs, and ease of use. We would like to extend this work along several directions. At the moment, only SHRiMP [21] and Bowtie [18] have been enabled to run on our system. We would like to explore other algorithms (e.g., SNP callers [20]) that can benefit from our solution. As some of these presuppose sorted input, this will be an additional challenge that we need to handle. Furthermore, we would like to take a closer look at recent work on turning MapReduce into an incremental framework and compare those approaches with our stream-based approach. The last but not the least, we will explore how fault-tolerance techniques in stream processing can be utilized to make our solution more robust and reliable. Acknowledgements. This work has been supported in part by an IBM faculty award. References 1. Amazon Elastic Compute Cloud Apache Hadoop Functional Genomics Center Zurich Google MapReduce IBM InfoSphere Streams. streams
10 10 6. The SAM Format Specification. samtools.sourceforge.net/sam1.pdf 7. Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The Design of the Borealis Stream Processing Engine. In: Conference on Innovative Data Systems Research (CIDR 05). Asilomar, CA (January 2005) 8. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215(3) (October 1990) 9. Collins, F.S., Guyer, M., Chakravarti, A.: Variations on a Theme: Cataloging Human DNA Sequence Variation. Science 278(5343) (November 1997) 10. Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: mapping large-scale workflows to distributed resources. Workflows for e-science pp (2007) 11. Dudley, J.T., Butte, A.J.: In Silico Research in the Era of Cloud Computing. Nature Biotechnology 28(11) (2010) 12. Fernald, G.H., Capriotti, E., Daneshjou, R., Karczewski, K.J., Altman, R.B.: Bioinformatics Challenges for Personalized Medicine. Bioinformatics 27(13) (July 2011) 13. Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: The System S Declarative Stream Processing Engine. In: ACM SIGMOD Conference. Vancouver, BC, Canada (June 2008) 14. Goecks, J., Nekrutenko, A., Taylor, J., Team, G.: Galaxy: A Comprehensive Approach for Supporting Accessible, Reproducible, and Transparent Computational Research in the Life Sciences. Genome Biology 11(8) (2010) 15. Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Computing Surveys 25(2) (June 1993) 16. Keich, U., Ming, L., Ma, B., Tromp, J.: On Spaced Seeds for Similarity Search. Discrete Applied Mathematics 138(3) (April 2004) 17. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs with Cloud Computing. Genome Biology 10(11) (2009) 18. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and Memoryefficient Alignment of Short DNA Sequences to the Human Genome. Genome Biology 10(3) (2009) 19. Li, H., Homer, N.: A Survey of Sequence Alignment Algorithms for Next- Generation Sequencing. Briefings in Bioinformatics 11(5) (September 2010) 20. Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP Detection for Massively Parallel Whole-Genome Resequencing. Genome Research 19(6) (June 2009) 21. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.: SHRiMP: Accurate Mapping of Short Color-space Reads. PLOS Computational Biology 5(5) (May 2009) 22. Sanger, F., Coulson, A.R.: A Rapid Method for Determining Sequences in DNA by Primed Synthesis with DNA Polymerase. Journal of Mol. Biol. 94(3) (May 1975) 23. Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using secondgeneration sequencing. Genome research 20(9), 1165 (2010) 24. Schatz, M.C.: CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics 25(11) (June 2009) 25. Stein, L.D.: The Case for Cloud Computing in Genome Informatics. Genome Biology 11(5) (2010) 26. Viedma, G., Olias, A., Parsons, P.: Genomics Processing in the Cloud. International Science Grid This Week, genomics-processing-cloud (February 2011) 27. Voelkerding, K.V., Dames, S.A., Durtschi, J.D.: Next-Generation Sequencing: From Basic Research to Diagnostics. Clinical Chemistry 55(4) (February 2009)
Cloud-Based Big Data Analytics in Bioinformatics
Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large
Hadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) [email protected], INRIA/Irisa, Campus de Beaulieu, 35042,
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Local Alignment Tool Based on Hadoop Framework and GPU Architecture Che-Lun Hung * Department of Computer Science and Communication Engineering Providence University Taichung, Taiwan [email protected] *
Early Cloud Experiences with the Kepler Scientific Workflow System
Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 1630 1634 International Conference on Computational Science, ICCS 2012 Early Cloud Experiences with the Kepler Scientific Workflow
HADOOP IN THE LIFE SCIENCES:
White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS)
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) [email protected] Paris November 2013 The Sanger DNA sequencing method Sequencing
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Resource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis
2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance
Hadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio [email protected] [email protected] Big Data Too much information! Big Data Explosive data growth proliferation of data capture
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Development of Bio-Cloud Service for Genomic Analysis Based on Virtual
Development of Bio-Cloud Service for Genomic Analysis Based on Virtual Infrastructure 1 Jung-Ho Um, 2 Sang Bae Park, 3 Hoon Choi, 4 Hanmin Jung 1, First Author Korea Institute of Science and Technology
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Big Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres [email protected] Talk outline! We talk about Petabyte?
What is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group [email protected] ABSTRACT We define analytic infrastructure to be the services,
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Analyze Human Genome Using Big Data
Analyze Human Genome Using Big Data Poonm Kumari 1, Shiv Kumar 2 1 Mewar University, Chittorgargh, Department of Computer Science of Engineering, NH-79, Gangrar-312901, India 2 Co-Guide, Mewar University,
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data
Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona 1 Outline
Duke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
Support for data-intensive computing with CloudMan
Support for data-intensive computing with CloudMan Y. Kowsar 1 and E. Afgan 1,2 1 Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne, Melbourne, Australia 2 Centre for Informatics
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
A Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King [email protected] Planets Project Permanent
How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable
DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System
A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System Young-Ho Kim, Eun-Ji Lim, Gyu-Il Cha, Seung-Jo Bae Electronics and Telecommunications
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
A Tutorial in Genetic Sequence Classification Tools and Techniques
A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University [email protected] www.jakemdrew.com Sequence Characters IUPAC nucleotide
Managing and Conducting Biomedical Research on the Cloud Prasad Patil
Managing and Conducting Biomedical Research on the Cloud Prasad Patil Laboratory for Personalized Medicine Center for Biomedical Informatics Harvard Medical School SaaS & PaaS gmail google docs app engine
UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production
Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM 1 KONG XIANGSHENG 1 Department of Computer & Information, Xinxiang University, Xinxiang, China E-mail:
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
Cloud-Based Big Data Analytics in Bioinformatics: A Review
Cloud-Based Big Data Analytics in Bioinformatics: A Review Cephas MAWERE 1, Kudakwashe ZVAREVASHE 2, Thamari SENGUDZWA 3, Tendai PADENGA 4 1 Harare Institute of Technology, School of Industrial Sciences
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Cloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
Towards Integrating the Detection of Genetic Variants into an In-Memory Database
Towards Integrating the Detection of Genetic Variants into an 2nd International Workshop on Big Data in Bioinformatics and Healthcare Oct 27, 2014 Motivation Genome Data Analysis Process DNA Sample Base
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
http://www.paper.edu.cn
5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Improving Current Hadoop MapReduce Workflow and Performance
Improving Current Hadoop MapReduce Workflow and Performance Hamoud Alshammari Department of Computer Science, CT, USA Jeongkyu Lee Department of Computer Science, CT, USA Hassan Bajwa Department of Electrical
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Implementing Graph Pattern Mining for Big Data in the Cloud
Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya [email protected]
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
International Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Next Generation Sequencing: Technology, Mapping, and Analysis
Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University [email protected] http://tandem.bu.edu/ The Human Genome Project took
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Introduction to NGS data analysis
Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
