Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable



Similar documents
WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

ANY SURVEILLANCE, ANYWHERE, ANYTIME

MagFS: The Ideal File System for the Cloud

Maginatics Cloud Storage Platform for Elastic NAS Workloads

T a c k l i ng Big Data w i th High-Performance

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

HadoopTM Analytics DDN

DDN updates object storage platform as it aims to break out of HPC niche

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Scala Storage Scale-Out Clustered Storage White Paper

With DDN Big Data Storage

High Performance Server SAN using Micron M500DC SSDs and Sanbolic Software

How To Use Hp Vertica Ondemand

Where do you put 1,000,000,000,000 DNA base pairs? Could you quickly find one CT scan in a million?

SQL Server Business Intelligence on HP ProLiant DL785 Server

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

SOLUTION BRIEF KEY CONSIDERATIONS FOR LONG-TERM, BULK STORAGE

MaxDeploy Ready. Hyper- Converged Virtualization Solution. With SanDisk Fusion iomemory products

Designing a Cloud Storage System

ioscale: The Holy Grail for Hyperscale

Big Data Challenges in Bioinformatics

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

POWER ALL GLOBAL FILE SYSTEM (PGFS)

IBM ELASTIC STORAGE SEAN LEE

WOS. High Performance Object Storage

Data Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information

StorPool Distributed Storage. Software-Defined. Business Overview

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Improving Time to Results for Seismic Processing with Paradigm and DDN. ddn.com. DDN Whitepaper. James Coomer and Laurent Thiers

Any Threat, Anywhere, Anytime. ddn.com. DDN Whitepaper. Scalable Infrastructure to Enable the Warfighter

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Data Protection Technologies: What comes after RAID? Vladimir Sapunenko, INFN-CNAF HEPiX Spring 2012 Workshop

Introduction to AWS Economics

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Exploring Amazon EC2 for Scale-out Applications

Using In-Memory Data Grids for Global Data Integration

Storage as a Service: Leverage the benefits of scalability and elasticity with Storage as a Service

Scality Conversations (Episode 3) Ever Evolving Data Center Hardware. Leo Leung, VP of Corporate Marketing

Future-Proofed Backup For A Virtualized World!

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

The Microsoft Large Mailbox Vision

3PAR Fast RAID: High Performance Without Compromise

IBM Data Warehousing and Analytics Portfolio Summary

How AWS Pricing Works

Save Time and Money with Quantum s Integrated Archiving Solution

Amazon Cloud Storage Options

WHITE PAPER. QUANTUM LATTUS: Next-Generation Object Storage for Big Data Archives

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Reducing Storage TCO With Private Cloud Storage

The Cloud Hosting Revolution: Learn How to Cut Costs and Eliminate Downtime with GlowHost's Cloud Hosting Services

Leveraging Public Clouds to Ensure Data Availability

Milestone Solution Partner IT Infrastructure Components Certification Summary

Microsoft Analytics Platform System. Solution Brief

IBM PureFlex System. The infrastructure system with integrated expertise

Integration of Microsoft Hyper-V and Coraid Ethernet SAN Storage. White Paper

WOS 360 FULL SPECTRUM OBJECT STORAGE

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Revolutionizing Storage

Understanding the Economics of Flash Storage

Symantec Backup Appliances

Dell s SAP HANA Appliance

Automated and Scalable Data Management System for Genome Sequencing Data

Scaling Web Applications on Server-Farms Requires Distributed Caching

Tableau Server Scalability Explained

Data Management using irods

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

EMC XTREMIO EXECUTIVE OVERVIEW

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Performance Analysis: Scale-Out File Server Cluster with Windows Server 2012 R2 Date: December 2014 Author: Mike Leone, ESG Lab Analyst

Storage Architectures for Big Data in the Cloud

Cloud Computing and Amazon Web Services

white paper A CASE FOR VIRTUAL RAID ADAPTERS Beyond Software RAID

Keys to Successfully Architecting your DSI9000 Virtual Tape Library. By Chris Johnson Dynamic Solutions International

Intro to AWS: Storage Services

EMC BACKUP MEETS BIG DATA

Barracuda Backup Server. Introduction

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Database Fundamentals

Hadoop. Sunday, November 25, 12

DLT Solutions and Amazon Web Services

FAS6200 Cluster Delivers Exceptional Block I/O Performance with Low Latency

SOLUTION BRIEF KEY CONSIDERATIONS FOR BACKUP AND RECOVERY

Transcription:

DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable

Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences Ahead 5 2

By Mike May, PhD. Produced by Bio-IT World and the Cambridge Healthtech Media Custom Publishing Group In 2003, the Human Genome Project unveiled the roughly 25,000 genes that make up human DNA. Nonetheless, the three billion nucleotides the building blocks of DNA unscrambled in that project give only a glimpse into the growing complexity and utility of genome science. For decades, the U.S. National Institutes of Health, for example, has curated a sequence database called GenBank. In 1982, GenBank included 680,338 bases, or nucleotides, and that number rocketed to more than 106 billion bases by 2009. New technology, however, already produces even higher rates of data collection. For example, the HiSeq 2000 from Illumina can sequence 200 gigabases (GB) in a run that lasts just eight days. Likewise, the GS FLX Titanium series from 454 Life Sciences, a Roche Company, sequences a billion bases in a day. So in a few months, a GS FLX could produce the bases collected in GenBank over decades. Given this rate of information growth, researchers in genomics which can be used advance biofuels, develop treatments for disease and more require improved technologies to store and share information. Cloud Computing Today s life sciences companies and research institutions need high-performance computing and storage. In the November- December 2009 issue of Bio-IT World, which was a special report on cloud computing for life sciences, Guy Coates group leader for informatics systems at the Wellcome Trust Sanger Institute said, We have these very spiky, very agile, very diverse workloads. In addition, this institute sequences about 500 GB a week. Issues such as these led Coates and his colleagues to consider cloud computing. Moreover, in the June 2009 issue of PLoS Computational Biology, informatics experts Brent G. Richter and David P. Sexton gave an idea of how much computer storage a modern genomics institute needs. In discussing data from Illumina s Solexa Genome Analyzer II (GAII), they write: approximately 115,200 Tiff formatted files are produced per run, each at about 8 megabytes (MB) in size. This is approximately 1 terabyte (TB) of data... If a research team keeps all of this raw data, wrote Richter and Sexton, a mere 10 20 sequencing runs could overwhelm any storage and archiving system available to individual investigators. Cloud computing can add storage as needed. Furthermore, a cloud system lets researchers share data worldwide. This is particularly useful for global pharmaceutical companies. Beyond storage, cloud computing can also provide analysis, and groups are already building applications that live on the cloud. For instance, scientists at the University of Maryland created CloudBurst and Crossbow, which are cloud-based programs to map sequence data and resequence whole genomes, respectively. In addition, Cycle Computing s CycleCloud provides high-performance computing based on Amazon s Web Services, and this includes application sets that can be used in genomics. 3

Some cloud options also provide a scalable amount of computing capacity. For instance, Amazon s Elastic Compute Cloud lets users select the CPU configuration. Build vs. Rent To move data to a cloud, genomics scientists face one crucial decision: build it (private cloud) or rent it (public cloud). To rent storage, a scientist can turn to many companies, including Amazon, which offers its Simple Storage Service (S3). This requires only a credit card and an Internet connection. For the first 50 terabytes of storage on S3, Amazon charges $0.15 per gigabyte per month. S3 users also pay for data transfers and operations such as a PUT or COPY on the data. This might work well for ordinary data and computer users, but it gets expensive for life science users who store large data sets. Alternatively, a genome scientist can buy the storage, and build it up as needed. Web Object Scaler (WOS) from DataDirect Networks (DDN), for example, lets users buy hardware that can be built as a private cloud storage system. In short, WOS is a Web services cloud storage architecture designed for scale-out, persistent data storage enabling rapid data access, and global data distribution. The WOS systems come as small as 32 terabytes, but can be built into the petabyte range. This system also provides fast access to data with the ability to deliver millions of files per second. As sequencing gets more economical perhaps dropping as low as $100 per genome in the next decade the cost of data storage plays a larger role in the overall economics of this research. In addition, the economics of how scalable infrastructure is managed will directly impact an organization s ability to achieve the economic objectives of genetic science and diagnostics. For a cloud-cost comparison generated by DDN, see the accompanying chart. Why WOS Fits the Cloud Most cloud storage systems require managing multiple file systems, such as RAIDs (redundant array of independent disks) and SANs (storage area networks). Instead, WOS starts with a single namespace and sticks with that, no matter how large the cloud gets. For example, WOS units could be placed around the world to provide close access to specific users, but it would all still be managed from one location. While a user manages a WOS-based genome cloud, policies can be created to put the data in the best spot. For example, it might make sense to create more than one copy of one file and place them on WOS devices located near different groups of users to reduce the latency of file delivery. A WOS cloud also includes distribution that keeps files safe and always available. While any cloud storage system can recover from a drive failure, WOS unlike others goes beyond RAID6 and can rebuild the drive s data in just minutes. Simplicity also makes WOS a good technology to use for cloud storage. For one thing, DDN has minimized the configuration options and complexity, with just four scale-out storage building block options. A customer can select from two versions of one-node devices the WOS 1600 or the WOS 1600-HP or two versions of two-node devices the WOS 6000 or the WOS 6000HP. These units range in storage capacity from 32 120 terabytes. A user can add nodes to increase a cloud s capacity. 4

Annual & 3-Year Cost Comparison - WOS vs. S3 $3,500,000 $3,000,000 $2,500,000 $2,000,000 $1,500,000 $1,000,000 $500,000 S3 WOS $0 Year 1 Year 2 Year 3 Total 3yr Investment This shows an initial storage of 100 terabytes growing to 1 petabyte over a period of three years. It assumes a moderate amount of reads from the existing data. The WOS pricing is fully burdened, including data center costs, connectivity and labor. Over only the three year period, WOS will save more than $1.5 million compared to S3. To make two nodes say a site in your company and one in a companion company a user starts by setting up IP addresses for the nodes and names them. Then, says Chris Williams, DDN s WOS product manager, You set the policies for data protection and data replication which defines how and where the data is to be stored, and you are ready to go. Storing Sequences Ahead DDN already helped one customer build a cloud storage system specifically for genome research. Although the customer s name cannot be released, Williams provides hypothetical background on such a scenario. If you have 20 companies buying equipment to sequence genomes and analyze them, he says, they might also want to share the resulting data. He adds, It s to everybody s advantage. Imagine that someone has a DNA sample from a study of an unusual cancer; data from that person might help someone else learn something about fighting that cancer. The WOS system is also local for the users, so they can complete the research faster because they do not experience the I/O penalties of a purely Internet cloud like S3. In the next few years, sequencers will keep generating more data and generating it faster. To analyze and store that data, academic researchers and industrial groups interested in genomics will turn increasingly to cloud options. As they do so, they must compare the costs of a public versus a private cloud. In addition, the final choice must depend on economics and performance. 5

DDN About Us DataDirect Networks (DDN) is the world s largest privately held information storage company. We are the leading provider of data storage and processing solutions and services, that enable content-rich and high growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity. DDN enables enterprises to extract value and deliver results from their information. Our customers include the world s leading online content and social networking providers, high performance cloud and grid computing, life sciences, media production organizations and security & intelligence organizations. Deployed in thousands of mission critical environments worldwide, DDN s solutions have been designed, engineered and proven in the world s most scalable data centers, to ensure competitive business advantage for today s information powered enterprise. For more information, go to www. or call +1-800-TERABYTE. Version 10/11 6