Hadoop s Rise in Life Sciences



Similar documents
HADOOP IN THE LIFE SCIENCES:

CHALLENGES IN NEXT-GENERATION SEQUENCING

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

EMC IRODS RESOURCE DRIVERS

Intro to Map/Reduce a.k.a. Hadoop

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Protecting Big Data Data Protection Solutions for the Business Data Lake

WHITE PAPER. Get Ready for Big Data:

Hadoop. Bioinformatics Big Data

Object Storage: Out of the Shadows and into the Spotlight

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA-AS-A-SERVICE

Hadoop implementation of MapReduce computational model. Ján Vaňo

How To Scale Out Of A Nosql Database

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Large scale processing using Hadoop. Ján Vaňo

Hadoop IST 734 SS CHUNG

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoopizer : a cloud environment for bioinformatics data analysis

Hadoop. Sunday, November 25, 12

Big Data Challenges in Bioinformatics

Apache Hadoop FileSystem and its Usage in Facebook

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop & its Usage at Facebook

Scala Storage Scale-Out Clustered Storage White Paper

EMC ISILON X-SERIES. Specifications. EMC Isilon X200. EMC Isilon X210. EMC Isilon X410 ARCHITECTURE

MapReduce with Apache Hadoop Analysing Big Data

Storage made simple. Essentials. Expand it... Simply

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

HadoopTM Analytics DDN

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Big Data for Processing Data and Performing Workload

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Integrated Grid Solutions. and Greenplum

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

THE BRIDGE FROM PACS TO VNA: SCALE-OUT STORAGE

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Manage A Single Volume Of Data On A Single Disk (Isilon)

Big Data With Hadoop

Data Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

White. Paper. EMC Isilon: A Scalable Storage Platform for Big Data. April 2014

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Big + Fast + Safe + Simple = Lowest Technical Risk

Accelerating and Simplifying Apache

Data management challenges in todays Healthcare and Life Sciences ecosystems

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

Storage for Science. Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments. An Isilon Systems Whitepaper

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Cloud Computing at Google. Architecture

Implementation of Hadoop Distributed File System Protocol on OneFS Tanuj Khurana EMC Isilon Storage Division

Key Messages of Enterprise Cluster NAS Huawei OceanStor N8500

EMC ISILON SCALE-OUT STORAGE PRODUCT FAMILY

Open source Google-style large scale data analysis with Hadoop

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Data Mining in the Swamp

BIG DATA TRENDS AND TECHNOLOGIES

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

EMC ISILON ONEFS OPERATING SYSTEM

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

EMC SOLUTION FOR SPLUNK

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop Architecture. Part 1

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Big Data and Apache Hadoop s MapReduce

Hadoop & its Usage at Facebook

Cloud-Based Big Data Analytics in Bioinformatics

MapReduce and Hadoop Distributed File System V I J A Y R A O

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

A Brief Outline on Bigdata Hadoop

Understanding Enterprise NAS

Introduction to Hadoop

Diagram 1: Islands of storage across a digital broadcast workflow

Generic Log Analyzer Using Hadoop Mapreduce Framework

Distributed File Systems

Big Application Execution on Cloud using Hadoop Distributed File System

IBM Smart Business Storage Cloud

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Storage Solutions for Bioinformatics

Designing a Cloud Storage System

DDN updates object storage platform as it aims to break out of HPC niche

WOS. High Performance Object Storage

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop and Map-Reduce. Swati Gore

BIG DATA TECHNOLOGY. Hadoop Ecosystem

I/O Considerations in Big Data Analytics

EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB

Cloud-based Analytics and Map Reduce

NextGen Infrastructure for Big DATA Analytics.

Virtualizing Apache Hadoop. June, 2012

Hadoop Distributed Filesystem. Spring 2015, X. Zhang Fordham Univ.

NoSQL Data Base Basics

NoSQL and Hadoop Technologies On Oracle Cloud

Transcription:

Exploring EMC Isilon scale-out storage solutions Hadoop s Rise in Life Sciences By John Russell, Contributing Editor, Bio IT World Produced by Cambridge Healthtech Media Group

By now the Big Data challenge is familiar to the entire life sciences community. Modern high-throughput experimental technologies generate vast data sets that can only be tackled with high performance computing (HPC). Genomics, of course, is the leading example. At the end of 2011, global annual sequencing capacity was estimated at 13 quadrillion bases and growing rapidly 1. It s worth noting a single base pair typically represents about 100 bytes of data (raw, analyzed, and interpreted). The need to manage and analyze these massive data sets, not just in life sciences but throughout all of science and industry, has spurred many new approaches to HPC infrastructure and led to many important IT advances, particularly in distributed computing. While there isn t a single right answer, one approach the Hadoop storage and compute framework is emerging as a compelling contender for use in life sciences to cope with the deluge of data. Created in 2004 by Doug Cutting (who famously named it after his son s stuffed elephant) and elevated to a top-level Apache Foundation project in 2008, Hadoop is intended to run large-scale distributed data analysis on commodity clusters. Cutting was initially inspired by a paper 2 from Google Labs describing Google s BigTable infrastructure and MapReduce application layers. (For a detailed perspective see Ronald Taylor s, An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. 3 ) The Hadoop Distributed File System (HDFS) and compute framework (MapReduce) enable Hadoop to break extremely large data sets into chunks, to distribute/ store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Broadly, Hadoop uses a file system (Hadoop Distributed File System (HDFS) and framework software (MapReduce) to break extremely large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance and in April 2008 a Hadoop program, running on 910-node cluster, broke a world record, sorting a terabyte of data in less than 3.5 minutes. 4 1 DNA Sequencing Caught in Deluge of Data, New York Times, Nov. 30, 2011, http://www.nytimes.com/2011/12/01/business/dnasequencing-caught-in-deluge-of-data.html?_r=1&ref=science 2 OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004, http://research. google.com/archive/mapreduce.html 3 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/ pmc/articles/pmc3040523/ 4 Hadoop wins Terabyte sort benchmark, Apr 2008, Apr. 2009, http://sortbenchmark.org/yahoohadoop.pdf last accessed Dec 2011 Hadoop s Rise in Life Sciences 2

Part of the improved performance stems from MapReduce s key:value programming model which speeds up and scales up parallelized job execution better than many alternatives such as the GridEngine architecture for High Performance Computing (HPC). (One of the earliest use-cases of the Sun GridEngine 5 HPC was the DNA sequence comparison BLAST search.) The MapReduce layer is a batch query processor with dynamic data schema and linear scaling for unstructured or semistructured data. Its data is not normalized (decomposition of data into smaller structured relationships). Therefore higher level interpreted programming languages like Ruby and Python and a compiled language like C++ provide easier access to MapReduce to represent the program as MapReduce jobs. It turns out that Hadoop a fault-tolerant, share-nothing architecture in which tasks must have no dependence on each other is an excellent choice for many life sciences applications. Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV. The Hadoop R (statistical language) interface, RHIPE, is also popular in the life sciences community. It turns out that Hadoop a fault-tolerant, share-nothing architecture in which tasks must have no dependence on each other is an excellent choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured filebased data and ideally suited for embarrassingly parallel computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required 6. Not surprisingly life sciences organizations were among Hadoop s earliest adopters. The first large-scale MapReduce project was initiated by the Broad Institute (in 2008) and resulted in the comprehensive Genome Analysis Tool Kit (GATK) 7. The Hadoop CrossBow project from Johns Hopkins University came soon after 8. 5 Altschul SF, et al, Basic local alignment search tool. J Mol Biol 215 (3): 403 410, October 1990. 6 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/ pmc/articles/pmc3040523/ 7 McKenna A, et al, The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data, Genome Research, 20:1297 1303, July 2010. 8 http://bowtie-bio.sourceforge.net/crossbow/index.shtml Hadoop s Rise in Life Sciences 3

Here are a few current Hadoop-based bioinformatics applications 9 : Crossbow. Whole genome resequencing analysis; SNP genotyping from short reads. Contrail. De novo assembly from short sequencing reads. Myrna. Ultrafast short read alignment and differential gene expression from large RNA-seq data sets. PeakRanger. Cloud-enabled peak caller for ChIP-seq data. Quake. Quality-aware detection and sequencing error correction tool. BlastReduce. High-performance short read mapping. CloudBLAST. Hadoop implementation of NCBI s Blast. MrsRF. Algorithm for analyzing large evolutionary trees. (For a more detailed example of Hadoop in operation see sidebar, Genomics Example: Calling SNPs with Crossbow.) Genomics Example: Calling SNPs with CrossBow Next Generation Sequencers (NGS) like Illumina Hiseq can produce data in the order of 200 billion base pairs (200 Gbp) in a single one-week run for a 60x human genome coverage, which means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. Sequence reads are much shorter than traditional Sanger sequencing. This data requires specialized software algorithms called short read aligners. CrossBow is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 1 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown is a traditional N-node Hadoop cluster. All of the Hadoop features like HDFS, program management and fault tolerance are available. The Map step is the short read alignment algorithm, called BoWTie (named after the Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. The Sort step apportions the alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset for that partition). The data here are the sorted alignments. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS, and then archived in SOAPsnp format. 9 Got Hadoop?, Sept. 2011, Genome Technology, http://www.genomeweb.com/informatics/got-hadoop Hadoop s Rise in Life Sciences 4

After several years of steady development in academic environments, Hadoop is now poised for rapid commercialization and broader uptake in biopharma and healthcare. Early adoption has been strongest among next generation sequencing (NGS) centers where NGS workflows can generate 2 TeraBytes (TB) of data per run per week per sequencer that s not including the raw images. For these organizations, the need for scale-out storage that integrates with HPC is a line item requirement. EMC Isilon, long a leader in scale-out NAS storage solutions, understands these challenges and has provided the scale-out storage for nearly all the workflows for all the DNA sequencer instrument manufacturers in the market today at more than 150 customers. Since 2008, the EMC Isilon OneFS storage platform has an overall installed base of more than 65 PetaBytes (PB). Recently, EMC introduced the industry s first scale-out NAS system with native Hadoop support (via HDFS). Hadoop meets all the tenets of Jim Gray s Laws of Data Engineering which have not changed in 15 years. Sanjay Joshi CTO, Life Sciences, EMC Isilon Storage Division The EMC Isilon OneFS file system now provides for connectivity to the Hadoop Distributed File System (HDFS) just like any other shared file system protocol: NFS, CIFS or SMB 10. This allows for the data co-location of the storage with its compute nodes using the standard higher-level Java application programming interface (API) to build MapReduce jobs. EMC has gone one step further by combining its OneFS-based NAS solution with EMC Greenplum HD, a powerful analytics platform, to create a Hadoop appliance. Together, the two offerings relieve users of the burden of cobbling together various open source Hadoop components, which sometimes proves problematic. Hadoop meets all the tenets of Jim Gray s Laws of Data Engineering 11 which have not changed in 15 years, says Sanjay Joshi, CTO, Life Sciences, EMC Isilon Storage Division. Those tenets include: scientific computing is very data intensive, with no real limits; the solution is a scale-out architecture with distributed data access; and bring computation to the data, rather than data to the computations. 10 Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528 11 From Jim Gray, Scalable Computing, presentation at Nortel: Microsoft Research, April 1999 Hadoop s Rise in Life Sciences 5

Isilon built the industry s first Scale Out storage architecture. Now with its native and enterprise-ready HDFS protocol via OneFS and GreenPlum HD, EMC brings simplicity to Big Data in Science. says Joshi. EMC Isilon OneFS combines the three layers of traditional storage architectures the file system, volume manager, and RAID into one unified software layer, creating a single intelligent distributed file system that runs on one storage cluster. Important advantages of OneFS for Hadoop are: Scalable: Linear scale with increasing capacity from 18TB to 16PB in a single filesystem and a single global namespace. Scale out as needs grow, independent of the compute layer. Predictable: Dynamic content balancing is performed as nodes are added, upgraded or capacity changes. No added management time is required since this process is simple. Available: OneFS protects your data from power loss, node or disk failures, loss of quorum and storage rebuild by distributing data, metadata and parity across all nodes. It also eliminates the single point of failure of a Hadoop Name Node. Therefore OneFS is self healing. Efficient: Compared to the average 50% efficiency of traditional RAID systems, OneFS provides over 80% efficiency, independent of CPU compute or cache. This efficiency is achieved by tier ing the process into three types as shown in the figure alongside and by the pools within these node types. This efficiency extends to the reduction from a 3x copy that Hadoop requires to the >80% efficient 1x storage via EMC Isilon s HDFS protocol. Enterprise-ready. Administration of the storage clusters is via an intuitive Web based UI. Connectivity to your process is through standard file protocols: CIFS, SMB, NFS, FTP/ HTTP, iscsi and HDFS. Standardized authentication and access control is available at scale: AD, LDAP and NIS. Storage tiers without fears based on performance reside in one global namespace, connected via a dedicated backend network. Hadoop s Rise in Life Sciences 6

CONCLUSION What began as an internal project at Google in 2004 has now matured into a scalable framework for two computing paradigms that are particularly suited for the life sciences: parallelization and distribution. Indeed, the post-processing streaming data patterns for text strings, clustering and sorting the core process patterns in the life sciences are ideal workflows for Hadoop. Case-in-point: The CrossBow example cited earlier aligned Illumina NGS reads for SNP calling over a 35x coverage of the human genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude better than traditional HPC technology for parallel processes. The EMC Isilon OneFS distributed file system handles the Hadoop distributed file system, HDFS, just like any other shared file system, and provides a shield for the single point of failure in Hadoop: the name node. The Hybrid Cloud model (source data mirror) with Hadoop as a Service (HaaS) is the current state-of-the-art. For more information visit EMC Isilon at http://www.emc.com/isilon. Summary of Hadoop Attributes: Overview Write Once Read Many times (WORM) Co-locates data with compute, uses higher level architecture with Java API HDFS is a distributed file system that runs on large clusters Advantages Uses MapReduce framework a batch query processor, scales linearly EMC Isilon OneFS implements HDFS and eliminates the single point of failure, the name node Standard programming language development: Java, Ruby, Python, C++ create MapReduce jobs. FUSE and WebDAV interfaces provide architectural flexibility Challenges HDFS block size is 128 MB (can be increased), therefore large numbers of small files (<8KB) reduce its performance: use Hadoop Archive (HAR) Data coherency and latency remain issues for large scale implementations Not suited for low-latency, in process use-cases like real-time, spectral or video analysis Data transfer between Genome sequencing data sources to the Hadoop clusters in the Cloud remains an issue, the current business model is mirroring the data between source and Cloud and then utilizing Hadoop as a Service model on the mirrored data. Hadoop s Rise in Life Sciences 7