DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable
Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences Ahead 5 2
By Mike May, PhD. Produced by Bio-IT World and the Cambridge Healthtech Media Custom Publishing Group In 2003, the Human Genome Project unveiled the roughly 25,000 genes that make up human DNA. Nonetheless, the three billion nucleotides the building blocks of DNA unscrambled in that project give only a glimpse into the growing complexity and utility of genome science. For decades, the U.S. National Institutes of Health, for example, has curated a sequence database called GenBank. In 1982, GenBank included 680,338 bases, or nucleotides, and that number rocketed to more than 106 billion bases by 2009. New technology, however, already produces even higher rates of data collection. For example, the HiSeq 2000 from Illumina can sequence 200 gigabases (GB) in a run that lasts just eight days. Likewise, the GS FLX Titanium series from 454 Life Sciences, a Roche Company, sequences a billion bases in a day. So in a few months, a GS FLX could produce the bases collected in GenBank over decades. Given this rate of information growth, researchers in genomics which can be used advance biofuels, develop treatments for disease and more require improved technologies to store and share information. Cloud Computing Today s life sciences companies and research institutions need high-performance computing and storage. In the November- December 2009 issue of Bio-IT World, which was a special report on cloud computing for life sciences, Guy Coates group leader for informatics systems at the Wellcome Trust Sanger Institute said, We have these very spiky, very agile, very diverse workloads. In addition, this institute sequences about 500 GB a week. Issues such as these led Coates and his colleagues to consider cloud computing. Moreover, in the June 2009 issue of PLoS Computational Biology, informatics experts Brent G. Richter and David P. Sexton gave an idea of how much computer storage a modern genomics institute needs. In discussing data from Illumina s Solexa Genome Analyzer II (GAII), they write: approximately 115,200 Tiff formatted files are produced per run, each at about 8 megabytes (MB) in size. This is approximately 1 terabyte (TB) of data... If a research team keeps all of this raw data, wrote Richter and Sexton, a mere 10 20 sequencing runs could overwhelm any storage and archiving system available to individual investigators. Cloud computing can add storage as needed. Furthermore, a cloud system lets researchers share data worldwide. This is particularly useful for global pharmaceutical companies. Beyond storage, cloud computing can also provide analysis, and groups are already building applications that live on the cloud. For instance, scientists at the University of Maryland created CloudBurst and Crossbow, which are cloud-based programs to map sequence data and resequence whole genomes, respectively. In addition, Cycle Computing s CycleCloud provides high-performance computing based on Amazon s Web Services, and this includes application sets that can be used in genomics. 3
Some cloud options also provide a scalable amount of computing capacity. For instance, Amazon s Elastic Compute Cloud lets users select the CPU configuration. Build vs. Rent To move data to a cloud, genomics scientists face one crucial decision: build it (private cloud) or rent it (public cloud). To rent storage, a scientist can turn to many companies, including Amazon, which offers its Simple Storage Service (S3). This requires only a credit card and an Internet connection. For the first 50 terabytes of storage on S3, Amazon charges $0.15 per gigabyte per month. S3 users also pay for data transfers and operations such as a PUT or COPY on the data. This might work well for ordinary data and computer users, but it gets expensive for life science users who store large data sets. Alternatively, a genome scientist can buy the storage, and build it up as needed. Web Object Scaler (WOS) from DataDirect Networks (DDN), for example, lets users buy hardware that can be built as a private cloud storage system. In short, WOS is a Web services cloud storage architecture designed for scale-out, persistent data storage enabling rapid data access, and global data distribution. The WOS systems come as small as 32 terabytes, but can be built into the petabyte range. This system also provides fast access to data with the ability to deliver millions of files per second. As sequencing gets more economical perhaps dropping as low as $100 per genome in the next decade the cost of data storage plays a larger role in the overall economics of this research. In addition, the economics of how scalable infrastructure is managed will directly impact an organization s ability to achieve the economic objectives of genetic science and diagnostics. For a cloud-cost comparison generated by DDN, see the accompanying chart. Why WOS Fits the Cloud Most cloud storage systems require managing multiple file systems, such as RAIDs (redundant array of independent disks) and SANs (storage area networks). Instead, WOS starts with a single namespace and sticks with that, no matter how large the cloud gets. For example, WOS units could be placed around the world to provide close access to specific users, but it would all still be managed from one location. While a user manages a WOS-based genome cloud, policies can be created to put the data in the best spot. For example, it might make sense to create more than one copy of one file and place them on WOS devices located near different groups of users to reduce the latency of file delivery. A WOS cloud also includes distribution that keeps files safe and always available. While any cloud storage system can recover from a drive failure, WOS unlike others goes beyond RAID6 and can rebuild the drive s data in just minutes. Simplicity also makes WOS a good technology to use for cloud storage. For one thing, DDN has minimized the configuration options and complexity, with just four scale-out storage building block options. A customer can select from two versions of one-node devices the WOS 1600 or the WOS 1600-HP or two versions of two-node devices the WOS 6000 or the WOS 6000HP. These units range in storage capacity from 32 120 terabytes. A user can add nodes to increase a cloud s capacity. 4
Annual & 3-Year Cost Comparison - WOS vs. S3 $3,500,000 $3,000,000 $2,500,000 $2,000,000 $1,500,000 $1,000,000 $500,000 S3 WOS $0 Year 1 Year 2 Year 3 Total 3yr Investment This shows an initial storage of 100 terabytes growing to 1 petabyte over a period of three years. It assumes a moderate amount of reads from the existing data. The WOS pricing is fully burdened, including data center costs, connectivity and labor. Over only the three year period, WOS will save more than $1.5 million compared to S3. To make two nodes say a site in your company and one in a companion company a user starts by setting up IP addresses for the nodes and names them. Then, says Chris Williams, DDN s WOS product manager, You set the policies for data protection and data replication which defines how and where the data is to be stored, and you are ready to go. Storing Sequences Ahead DDN already helped one customer build a cloud storage system specifically for genome research. Although the customer s name cannot be released, Williams provides hypothetical background on such a scenario. If you have 20 companies buying equipment to sequence genomes and analyze them, he says, they might also want to share the resulting data. He adds, It s to everybody s advantage. Imagine that someone has a DNA sample from a study of an unusual cancer; data from that person might help someone else learn something about fighting that cancer. The WOS system is also local for the users, so they can complete the research faster because they do not experience the I/O penalties of a purely Internet cloud like S3. In the next few years, sequencers will keep generating more data and generating it faster. To analyze and store that data, academic researchers and industrial groups interested in genomics will turn increasingly to cloud options. As they do so, they must compare the costs of a public versus a private cloud. In addition, the final choice must depend on economics and performance. 5
DDN About Us DataDirect Networks (DDN) is the world s largest privately held information storage company. We are the leading provider of data storage and processing solutions and services, that enable content-rich and high growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity. DDN enables enterprises to extract value and deliver results from their information. Our customers include the world s leading online content and social networking providers, high performance cloud and grid computing, life sciences, media production organizations and security & intelligence organizations. Deployed in thousands of mission critical environments worldwide, DDN s solutions have been designed, engineered and proven in the world s most scalable data centers, to ensure competitive business advantage for today s information powered enterprise. For more information, go to www. or call +1-800-TERABYTE. Version 10/11 6