HADOOP IN THE LIFE SCIENCES:

Size: px
Start display at page:

Download "HADOOP IN THE LIFE SCIENCES:"

Transcription

1 White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS) and its adoption in the Life Sciences with an example in Genomics data analysis. December 2012

2 Copyright 2012 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. Part number H

3 Table of Contents Audience... 3 Executive Summary... 4 Hadoop: an Introduction... 5 Genomics example: CrossBow... 8 Enterprise-Class Hadoop on EMC Isilon... 9 Conclusion References Audience This white paper introduces the new data processing and analysis paradigm, Hadoop TM, within the context of its usage in the life sciences, specifically Genomics Sequencing. It is intended for audiences with basic knowledge of storage and computing technology; a rudimentary understanding of DNA sequencing and the bioinformatics analysis associated with it. 3

4 Executive Summary Life Sciences data will reach the ExaByte (10 18 bytes, EB) scale soon. This is Big Data. As a reference point, all words ever spoken by all human beings when transcribed are about 5 EB of data. In a recent article titled Will Computers Crash Genomics? 1, the analysis points to exponential growth of the total genomics sequencing market capacity, as outlined in Figure 1 below: 10 Tera base-pairs (10 12 bp) per day, with an astounding 5x year-on-year growth rate (500%). The human genome is approximately 3 billion base pairs long a base pair (bp) comprising of DNA molecules in G-C or A-T pairs Figure 1: Genomics Growth Each base-pair represents a total of about 100 bytes (of raw, analyzed and interpreted data). Therefore the genomics market capacity in 2010 storage terms (from Fig. 1) was about 200 PetaBytes (PB), with the capacity growing to about 1 ExaByte (EB) by late This capacity is drowning out technologies attempting to handle the deluge of Big Data in the life sciences. Proteomics (study of proteins) and imaging data are early stages of this exponential rise. It is not just the data storage volume, but also its velocity and variability that make this a challenge requiring scale-out technologies: grow simply and painlessly as the data center and business needs grow. Within the past year, one computing and storage framework has matured into a contender to handle this tsunami of Big Data: Hadoop. Life Sciences workflows require a High Performance Computing (HPC) infrastructure to process and analyze the data to determine the variations in the genome and the proper scale of storage to retain this data. With Next Generation (genome) Sequencing (NGS) workflows generating up to 2 TeraBytes (TB) of data per run per week per sequencer not including the raw images the need for a scale-out storage that integrates easily with HPC is a line item requirement. EMC Isilon has provided the scale-out storage for nearly all the workflows for all the DNA sequencer instrument manufacturers in the market today at more than 150 customers. Since 2008, the EMC Isilon OneFS storage platform has a Life Sciences installed base of more than 65 PetaBytes (PB). 4

5 As genomics has very large, semi-structured, file-based data and is modeled on postprocess streaming data access and I/O patterns that can be parallelized, it is ideally suited for Hadoop. It consists of two main components: a file system and a compute system the Hadoop Distributed File System (HDFS) and the MapReduce framework respectively. The Hadoop ecosystem consists of many open source tools, as shown in Figure 2 below: Figure 2: Hadoop Components To make the Hadoop storage scale-out and truly distributed, the EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) just like any other shared file system protocol: NFS, CIFS or SMB 3. This allows for the data co-location of the storage with its compute nodes using the standard higher level Java application programming interface (API) to build MapReduce jobs. Hadoop: an Introduction Hadoop was created by Doug Cutting of the Apache Lucene project 4 initially as the Nutch Distributed File System (NDFS), which was inspired by Google s BigTable data infrastructure and the MapReduce 5 application layer in Hadoop is an Apache Foundation derivative which is comprised of a MapReduce layer for data analysis and a Hadoop Distributed File System (HDFS) layer written in the Java programming language to distribute and scale the MapReduce data. The Hadoop MapReduce framework runs on the compute cluster using the data stored on the HDFS. MapReduce 'jobs' aim to provide a key/value based processing ability in a highly parallelized fashion. Since the data is distributed over the cluster, a MapReduce job can be split-up to run many parallel processes over the data stored on the cluster. The Map parts of MapReduce only run on the data they can see that is the data blocks on the particular machine its running on. The Reduce brings together the output from the Maps. The result is a system that provides a highly- 5

6 paralleled batch processing capability. The system scales well, since you just need to add more hardware to increase its storage capability or decrease the time a MapReduce job takes to run. The partitioning of the storage and compute framework into master and worker node types is outlined in the Figure 3 below: Figure 3: Hadoop Cluster Hadoop is a Write Once Ready Many (WORM) system with no random writes. This makes Hadoop faster than HPC and Storage integrated separately. The life sciences has been at the forefront of the technology adoption curve: one of the earliest usecases of the Sun GridEngine 6 HPC was the DNA sequence comparison BLAST 16 search. Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV 7. The R (statistical language) Hadoop interface, RHIPE 8, is also popular in the life sciences community. The HDFS layer has a Name Node, the controller, with data locality through the name node and uses the share nothing architecture which is a distributed independent node based scheme 7. From a platform perspective, the OneFS HDFS interface is compatible with Apache Hadoop, EMC GreenPlum 3 and Cloudera. In a traditional Hadoop implementation, the HDFS Name Node is a single point of failure since it is the sole keeper of all the metadata for all the data that lives in the filesystem the OneFS HDFS interface resolves this by distributing the name node data 3. HDFS creates a 3x replica for redundancy OneFS drastically reduces the need for a 3x copy. A good example of the MapReduce algorithm key-value pair process for analyzing word count of specific words across documents 9 is shown in Figure 3 below: 6

7 Figure 4: Hadoop Example word count across documents Hadoop is not suited for low-latency, in process use-cases like real-time, spectral or video analysis; or for large numbers of small files (<8KB). When small files have to be used, the Hadoop Archive (HAR) can be used to archive small files for processing. Since its early days, life sciences organizations have been Hadoop s earliest adopters. Following the publication of the first Apache Hadoop project 10 in January 2008, the first large-scale MapReduce project was initiated by the Broad Institute resulting in the comprehensive Genome Analysis Tool Kit (GATK) 11. The Hadoop CrossBow project 12 from Johns Hopkins University came soon after. Other projects are Cloud-based: they include CloudBurst, Contrail, Myrna and CloudBLAST 13. An interesting implementation is the NERSC (Department of Energy) Flash-based Hadoop cluster within the Magellan Science Cloud 14. 7

8 Genomics example: CrossBow Figure 5: Crossbow example SNP calls across DNA fragments The Hadoop word count across documents example in Fig. 4 can be extended to DNA Sequencing: count for single base changes across millions of short DNA fragments and across hundreds of samples. A Single Nucleotide Polymorphism (SNP) occurs when one nucleotide (A, T, C or G) varies in the DNA sequence of members of the same biological species. Next Generation Sequencers (NGS) like Illumina HiSeq can produce data in the order of 200 Giga base pairs in a single one-week run for a 60x human genome coverage this means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. This data requires specialized software algorithms called short read aligners. CrossBow 12 is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 5 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown in Figure 5 is a traditional N-node Hadoop cluster. 1. The Map step is the short read alignment algorithm, called BoWTie (Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. 2. The Sort step apportions the alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset 8

9 for that partition). The data here are the sorted alignments. 3. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the Hadoop cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS; then archived in SOAPsnp format. Enterprise-Class Hadoop on EMC Isilon As demonstrated by previous examples, the data and analysis scalability required for Genomics is ideally suited for Hadoop. EMC Isilon s OneFS distributes the Hadoop Name Node to provide high availability and load balancing, thereby eliminating the single point of failure. The Isilon NAS storage solution provides a highly efficient single file system/single volume, scalable up to 20 PB. Data can be staged from other protocols to HDFS using OneFS as a staging gateway. EMC Isilon provides Enterprise Grade data services to the Hadoop infrastructure via SnapshotIQ and SyncIQ for advanced backup and disaster recovery capabilities. The equation for Hadoop scalability can be represented as: Big(Data + Analytics) = Hadoop EMC:Isilon These advantages are summarized in Fig. 6 below: Figure 6: Hadoop advantages with EMC Isilon When combined the EMC GreenPlum Analytics appliance and solution 17, the Hadoop architecture becomes a complete Enterprise package. 9

10 Conclusion What began as an internal project at Google in 2004 has now matured into a scalable framework for two computing paradigms that are particularly suited for the life sciences: parallelization and distribution. The post-processing streaming data patterns for text strings, clustering and sorting the core process patterns in the life sciences are ideal workflows for Hadoop. The CrossBow example discussed above aligned Illumina NGS reads for SNP calling over a 35x coverage of the human genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude better than traditional HPC technology for parallel processes. Even though Hadoop implementations in the Cloud are popular on the Public Cloud instances, several issues have resulted in most large institutions maintaining their own data repositories internally: large data transfer from the on-premise storage to the Cloud; data regulations and security; data availability; data redundancy and HPC throughput. This is especially true as genome sequencing moves into the Clinic for diagnostic testing. The convergence of these issues is evidenced by the mirroring of Short Read sequence Archive (SRA) at the National Center for Biotechnology Information (NCBI) on the DNANexus SRA Cloud 15 its business model is slowly evolving into a full data and analysis offsite model via Hadoop. The Hybrid Cloud model (a source data mirror between Private Cloud and Community Cloud) with Hadoop as a Service (HaaS) is the current state-of-the-art. Hadoop s advantages far outweigh its challenges it is ready to become the life sciences analytics framework of the future. The EMC Isilon platform is bringing that future to you today. References 1. Pennisi, E; Science 11 February 2011: Vol. 331 no pp Editorial, Challenges and Opportunities, Science 11 February 2011: Vol. 331 no pp Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h Cafarella, M and Cutting D, Building Nutch, Open Source Search, ACM Queue vol. 2, no. 2, April Dean J and Ghemawat S, "MapReduce: Simplfied Data Processing on Large Clusters", OSDI conference proceedings, Vasiliu B, Integrating BLAST with Sun GridEngine, July 2003, last visited Dec White, Tom: Hadoop -- The Definitive Guide 2 nd Edition, Published by O Reilly, Oct RHIPE: last visited Dec

11 9. MapReduce example: last visited Dec Hadoop wins Terabyte sort benchmark, Apr 2008, Apr 2009, last accessed Dec McKenna A, et al, "The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data", Genome Research, 20: , July Langmead B, Schatz MC, et al, Human SNPs from short reads in hours using cloud computing Poster Presentation, WABI Sep 2009, last accessed Dec Taylor RC, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics" BMC Bioinformatics 2010, 11(Suppl 12):S1, last accessed Dec Ramakrishnan L, Evaluating Cloud Computing for HPC Applications, DoE NeRSC, last accessed Dec DNAnexus to mirror SRA database in Google Cloud, BioIT World, Page 41, IT_World/1111BITW_download.pdf, last visited Dec Altschul SF, et al, "Basic local alignment search tool". J Mol Biol 215 (3): , October Lockner J.,"EMC s Enterprise Hadoop Solution: Isilon Scale-out NAS and GreenPlum HD", White Paper, The Enterprise Strategy Group, Inc (ESG), February

Hadoop s Rise in Life Sciences

Hadoop s Rise in Life Sciences Exploring EMC Isilon scale-out storage solutions Hadoop s Rise in Life Sciences By John Russell, Contributing Editor, Bio IT World Produced by Cambridge Healthtech Media Group By now the Big Data challenge

More information

Protecting Big Data Data Protection Solutions for the Business Data Lake

Protecting Big Data Data Protection Solutions for the Business Data Lake White Paper Protecting Big Data Data Protection Solutions for the Business Data Lake Abstract Big Data use cases are maturing and customers are using Big Data to improve top and bottom line revenues. With

More information

Big + Fast + Safe + Simple = Lowest Technical Risk

Big + Fast + Safe + Simple = Lowest Technical Risk Big + Fast + Safe + Simple = Lowest Technical Risk The Synergy of Greenplum and Isilon Architecture in HP Environments Steffen Thuemmel (Isilon) Andreas Scherbaum (Greenplum) 1 Our problem 2 What is Big

More information

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst White Paper EMC s Enterprise Hadoop Solution Isilon Scale-out NAS and Greenplum HD By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst February 2012 This ESG White Paper was commissioned

More information

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved. THE EMC ISILON STORY Big Data In The Enterprise 2012 1 Big Data In The Enterprise Isilon Overview Isilon Technology Summary 2 What is Big Data? 3 The Big Data Challenge File Shares 90 and Archives 80 Bioinformatics

More information

AUTOMATED DATA RETENTION WITH EMC ISILON SMARTLOCK

AUTOMATED DATA RETENTION WITH EMC ISILON SMARTLOCK White Paper AUTOMATED DATA RETENTION WITH EMC ISILON SMARTLOCK Abstract EMC Isilon SmartLock protects critical data against accidental, malicious or premature deletion or alteration. Whether you need to

More information

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013 The BIG Data Era has arrived Re-invent your storage! Bratislava, Slovakia, 21st March 2013 Luka Topic Regional Manager East Europe EMC Isilon Storage Division luka.topic@emc.com 1 What is Big Data? 2 EXABYTES

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

How To Manage A Single Volume Of Data On A Single Disk (Isilon) 1 ISILON SCALE-OUT NAS OVERVIEW AND FUTURE DIRECTIONS PHIL BULLINGER, SVP, EMC ISILON 2 ROADMAP INFORMATION DISCLAIMER EMC makes no representation and undertakes no obligations with regard to product planning

More information

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise ESSENTIALS Easy-to-use, single volume, single file system architecture Highly scalable with

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

EMC IRODS RESOURCE DRIVERS

EMC IRODS RESOURCE DRIVERS EMC IRODS RESOURCE DRIVERS PATRICK COMBES: PRINCIPAL SOLUTION ARCHITECT, LIFE SCIENCES 1 QUICK AGENDA Intro to Isilon (~2 hours) Isilon resource driver Intro to ECS (~1.5 hours) ECS Resource driver Possibilities

More information

BIG DATA-AS-A-SERVICE

BIG DATA-AS-A-SERVICE White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers

More information

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS EMC Isilon solutions for oil and gas EMC PERSPECTIVE TABLE OF CONTENTS INTRODUCTION: THE HUNT FOR MORE RESOURCES... 3 KEEPING PACE WITH

More information

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics ESSENTIALS EMC ISILON Use the industry's first and only scale-out NAS solution with native Hadoop

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

CHALLENGES IN NEXT-GENERATION SEQUENCING

CHALLENGES IN NEXT-GENERATION SEQUENCING CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture

More information

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoopizer : a cloud environment for bioinformatics data analysis Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Data management challenges in todays Healthcare and Life Sciences ecosystems

Data management challenges in todays Healthcare and Life Sciences ecosystems Data management challenges in todays Healthcare and Life Sciences ecosystems Jose L. Alvarez Principal Engineer, WW Director Life Sciences jose.alvarez@seagate.com Evolution of Data Sets in Healthcare

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Cloud-Based Big Data Analytics in Bioinformatics

Cloud-Based Big Data Analytics in Bioinformatics Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large

More information

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014 White Paper EMC Isilon: A Scalable Storage Platform for Big Data By Nik Rouda, Senior Analyst and Terri McClure, Senior Analyst April 2014 This ESG White Paper was commissioned by EMC Isilon and is distributed

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY White Paper CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY DVTel Latitude NVMS performance using EMC Isilon storage arrays Correct sizing for storage in a DVTel Latitude physical security

More information

Cloud-based Analytics and Map Reduce

Cloud-based Analytics and Map Reduce 1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,

More information

Hadoop. Bioinformatics Big Data

Hadoop. Bioinformatics Big Data Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture

More information

Storage Solutions for Bioinformatics

Storage Solutions for Bioinformatics Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory liyan3@genomics.cn http://www.genomics.cn/flexlab/index.html Science and Technology Division,

More information

Cloud-Based Big Data Analytics in Bioinformatics: A Review

Cloud-Based Big Data Analytics in Bioinformatics: A Review Cloud-Based Big Data Analytics in Bioinformatics: A Review Cephas MAWERE 1, Kudakwashe ZVAREVASHE 2, Thamari SENGUDZWA 3, Tendai PADENGA 4 1 Harare Institute of Technology, School of Industrial Sciences

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Storage made simple. Essentials. Expand it... Simply

Storage made simple. Essentials. Expand it... Simply EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY Storage made simple Essentials Simple storage management, designed for ease of use Massive scalability with easy, grow-as-you-go flexibility World s fastest

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis 2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance

More information

Integrated Grid Solutions. and Greenplum

Integrated Grid Solutions. and Greenplum EMC Perspective Integrated Grid Solutions from SAS, EMC Isilon and Greenplum Introduction Intensifying competitive pressure and vast growth in the capabilities of analytic computing platforms are driving

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

EMC ISILON ONEFS OPERATING SYSTEM

EMC ISILON ONEFS OPERATING SYSTEM EMC ISILON ONEFS OPERATING SYSTEM Powering scale-out storage for the Big Data and Object workloads of today and tomorrow ESSENTIALS Easy-to-use, single volume, single file system architecture Highly scalable

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

Can Storage Fix Hadoop

Can Storage Fix Hadoop Can Storage Fix Hadoop John Webster, Senior Partner 9/18/2013 1 Agenda What is the Internet Data Center and how is it different from Enterprise Data Center? How is the Apache Software Foundation (ASF)

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Object Storage: Out of the Shadows and into the Spotlight

Object Storage: Out of the Shadows and into the Spotlight Technology Insight Paper Object Storage: Out of the Shadows and into the Spotlight By John Webster December 12, 2012 Enabling you to make the best technology decisions Object Storage: Out of the Shadows

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director brett.weninger@adurant.com Dave Smelker, Managing Principal dave.smelker@adurant.com

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Data Centric Computing Revisited

Data Centric Computing Revisited Piyush Chaudhary Technical Computing Solutions Data Centric Computing Revisited SPXXL/SCICOMP Summer 2013 Bottom line: It is a time of Powerful Information Data volume is on the rise Dimensions of data

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

DataSafe Solutions. Protect your valuable genomic data

DataSafe Solutions. Protect your valuable genomic data DataSafe Solutions Protect your valuable genomic data Central and secure storage of next-generation sequencing (NGS) data is critical to the success of your organization. The ability to store and protect

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

EMC BACKUP MEETS BIG DATA

EMC BACKUP MEETS BIG DATA EMC BACKUP MEETS BIG DATA Strategies To Protect Greenplum, Isilon And Teradata Systems 1 Agenda Big Data: Overview, Backup and Recovery EMC Big Data Backup Strategy EMC Backup and Recovery Solutions for

More information

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY SCALE-OUT STORAGE PRODUCT FAMILY Storage made simple ESSENTIALS Simple storage designed for ease of use Massive scalability with easy, grow-as-you-go flexibility World s fastest-performing NAS Unmatched

More information

Distributed Computing and Hadoop in Statistics

Distributed Computing and Hadoop in Statistics Distributed Computing and Hadoop in Statistics Xiaoling Lu and Bing Zheng Center For Applied Statistics, Renmin University of China, Beijing, China Corresponding author: Xiaoling Lu, e-mail: xiaolinglu@ruc.edu.cn

More information

EMC ISILON SCALE-OUT NAS FOR IN-PLACE HADOOP DATA ANALYTICS

EMC ISILON SCALE-OUT NAS FOR IN-PLACE HADOOP DATA ANALYTICS White Paper EMC ISILON SCALE-OUT NAS FOR IN-PLACE HADOOP DATA ANALYTICS Abstract This white paper shows that storing data in EMC Isilon scale-out network-attached storage optimizes data management for

More information

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved. EMC Federation Big Data Solutions 1 Introduction to data analytics Federation offering 2 Traditional Analytics! Traditional type of data analysis, sometimes called Business Intelligence! Type of analytics

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

NextGen Infrastructure for Big DATA Analytics.

NextGen Infrastructure for Big DATA Analytics. NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY SCALE-OUT STORAGE PRODUCT FAMILY Unstructured data storage made simple ESSENTIALS Simple storage management designed for ease of use Massive scalability of capacity and performance Unmatched efficiency

More information

EMC ISILON AND ELEMENTAL SERVER

EMC ISILON AND ELEMENTAL SERVER Configuration Guide EMC ISILON AND ELEMENTAL SERVER Configuration Guide for EMC Isilon Scale-Out NAS and Elemental Server v1.9 EMC Solutions Group Abstract EMC Isilon and Elemental provide best-in-class,

More information

WHITE PAPER. www.fusionstorm.com. Get Ready for Big Data:

WHITE PAPER. www.fusionstorm.com. Get Ready for Big Data: WHitE PaPER: Easing the Way to the cloud: 1 WHITE PAPER Get Ready for Big Data: How Scale-Out NaS Delivers the Scalability, Performance, Resilience and manageability that Big Data Environments Demand 2

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

IBM Global Technology Services September 2007. NAS systems scale out to meet growing storage demand.

IBM Global Technology Services September 2007. NAS systems scale out to meet growing storage demand. IBM Global Technology Services September 2007 NAS systems scale out to meet Page 2 Contents 2 Introduction 2 Understanding the traditional NAS role 3 Gaining NAS benefits 4 NAS shortcomings in enterprise

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON BIG DATA MANAGEMENT AND ITS SECURITY PRUTHVIKA S. KADU 1, DR. H. R.

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Red Hat Storage Server

Red Hat Storage Server Red Hat Storage Server Marcel Hergaarden Solution Architect, Red Hat marcel.hergaarden@redhat.com May 23, 2013 Unstoppable, OpenSource Software-based Storage Solution The Foundation for the Modern Hybrid

More information

Analyze Human Genome Using Big Data

Analyze Human Genome Using Big Data Analyze Human Genome Using Big Data Poonm Kumari 1, Shiv Kumar 2 1 Mewar University, Chittorgargh, Department of Computer Science of Engineering, NH-79, Gangrar-312901, India 2 Co-Guide, Mewar University,

More information

Microsoft Big Data Solutions. Anar Taghiyev P-TSP E-mail: b-anarta@microsoft.com;

Microsoft Big Data Solutions. Anar Taghiyev P-TSP E-mail: b-anarta@microsoft.com; Microsoft Big Data Solutions Anar Taghiyev P-TSP E-mail: b-anarta@microsoft.com; Why/What is Big Data and Why Microsoft? Options of storage and big data processing in Microsoft Azure. Real Impact of Big

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Enabling High performance Big Data platform with RDMA

Enabling High performance Big Data platform with RDMA Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery

More information

The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project

The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project Alastair Duncan STFC Pre Coffee talk STFC July 2014 SCAPE Scalable Preservation Environments The

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

EMC SOLUTION FOR SPLUNK

EMC SOLUTION FOR SPLUNK EMC SOLUTION FOR SPLUNK Splunk validation using all-flash EMC XtremIO and EMC Isilon scale-out NAS ABSTRACT This white paper provides details on the validation of functionality and performance of Splunk

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

High Performance Computing with Hadoop WV HPC Summer Institute 2014

High Performance Computing with Hadoop WV HPC Summer Institute 2014 High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop

More information

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information