Large Scale Storage Solutions for Bioinformatics and Genomics Projects

Similar documents

Database Fundamentals

Backup architectures in the modern data center. Author: Edmond van As Competa IT b.v.

How To Store Data On A Computer (For A Computer)

Definition of Computers. INTRODUCTION to COMPUTERS. Historical Development ENIAC

CSCA0102 IT & Business Applications. Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global

XenData Archive Series Software Technical Overview

lesson 1 An Overview of the Computer System

Implementing a Digital Video Archive Based on XenData Software

Supported File Systems

Storage Solutions for Bioinformatics

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

SAN vs. NAS: The Critical Decision

Implementing a Digital Video Archive Using XenData Software and a Spectra Logic Archive

Online Storage Replacement Strategy/Solution

Barracuda Backup Server. Introduction

BACKUP SECURITY GUIDELINE

Solution Brief: Creating Avid Project Archives

Library Recovery Center

Data Loss Prevention (DLP) & Recovery Methodologies

The Dirty Little Secret About Online Backup

Tier 2 Nearline. As archives grow, Echo grows. Dynamically, cost-effectively and massively. What is nearline? Transfer to Tape

XFS File System and File Recovery Tools

Using HP StoreOnce Backup Systems for NDMP backups with Symantec NetBackup

Implementing Offline Digital Video Storage using XenData Software

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Chapter 8: Security Measures Test your knowledge

Hardware Configuration Guide

Computer Logic (2.2.3)

Multi-Terabyte Archives for Medical Imaging Applications

Supporting Users and Troubleshooting a Microsoft Windows XP Operating System Q&A. DEMO Version

Using HP StoreOnce D2D systems for Microsoft SQL Server backups

Backup & Disaster Recovery Options

VEEAM BACKUP & REPLICATION 6.1 RELEASE NOTES

Call: Disaster Recovery/Business Continuity (DR/BC) Services From VirtuousIT

Restoration Technologies. Mike Fishman / EMC Corp.

Traditionally, a typical SAN topology uses fibre channel switch wiring while a typical NAS topology uses TCP/IP protocol over common networking

POWER ALL GLOBAL FILE SYSTEM (PGFS)

Contingency Planning and Disaster Recovery

Optimizing Backup & Recovery Performance with Distributed Deduplication

An Affordable Commodity Network Attached Storage Solution for Biological Research Environments.

SOLUTIONS INC. BACK-IT UP. Online Backup Solution

XenData Product Brief: SX-550 Series Servers for LTO Archives

Cost Effective Backup with Deduplication. Copyright 2009 EMC Corporation. All rights reserved.

Cleaning Up Your Outlook Mailbox and Keeping It That Way ;-) Mailbox Cleanup. Quicklinks >>

Backing up Data. You have lots of different options for backing up data, different methods offer different protection.

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Archiving and Managing Your Mailbox

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

How To Back Up A Computer To A Backup On A Hard Drive On A Microsoft Macbook (Or Ipad) With A Backup From A Flash Drive To A Flash Memory (Or A Flash) On A Flash (Or Macbook) On

The Genealogy Cloud: Which Online Storage Program is Right For You Page , copyright High-Definition Genealogy. All rights reserved.

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Storage Switzerland White Paper Storage Infrastructures for Big Data Workflows

CHAPTER 17: File Management

June Blade.org 2009 ALL RIGHTS RESERVED

HOW TO USE WINDOWS BACKUP IN WINDOWS 7

XenData Video Edition. Product Brief:

Storage Solutions For Small and Medium Businesses

How Does the ECASD Network Work?

Practical issues in DIY RAID Recovery

Introduction to Data Protection: Backup to Tape, Disk and Beyond. Michael Fishman, EMC Corporation

Electronic Records Storage Options and Overview

University of Bristol. Research Data Storage Facility (the Facility) Policy Procedures and FAQs

Using HP StoreOnce Backup systems for Oracle database backups

Object Oriented Storage and the End of File-Level Restores

Simplified HA/DR Using Storage Solutions

Cobian9 Backup Program - Amanita

Planning a Backup Strategy

Algorithms and Methods for Distributed Storage Networks 7 File Systems Christian Schindelhauer

Computer Backup Strategies

Ultra-Scalable Storage Provides Low Cost Virtualization Solutions

an introduction to networked storage

The HP IT Transformation Story

Primary Memory. Input Units CPU (Central Processing Unit)

Transcription:

Large Scale Storage Solutions for Bioinformatics and Genomics Projects Phillip Smith Unix System Administrator, Bioinformatics Group sysadmin@bio.indiana.edu The Center for Genomics and Bioinformatics Indiana University, Bloomington http://cgb.indiana.edu/

Overview Environment as it is today Types of data being stored and typical dataset sizes Where and how the data is being stored Current storage capabilities Problem areas De-centralized vs. centralized storage Data availability and redundancy Backups Long-term data archiving and future retrieval Research and development, and future implementation Evaluate the new technology paradigms (SAN, NAS, etc.) Setup a test bed to try these technologies in our environment Enabling of new software services, like electronic lab notebooks Summary Review Related examples (1TB SAN in CS, Whitehead) Questions and Comments

Bits and Bytes A quick overview of computer storage semantics: One bit (b) = 0 or 1 One byte (B) = 8 bits One kilobyte (KB) = 1024 bytes (2^10) One megabyte (MB) = 1024 KB (2^20) One gigabyte (GB) = 1024 MB (2^30) One terabyte (TB) = 1024 GB (2^40) One petabyte (PB) = 1024 TB (2^50) Beyond this is exa (2^60), zetta (2^70), and yotta (2^80) Relatively few sites are managing more than a couple of petabytes

Putting the numbers into perspective Some real-world examples 200 petabytes: All printed material 2 petabytes: All U.S. Academic research libraries 400 terabytes: National Climate Data Center (NOAA) database 20 terabytes: The printed collection of the U.S. Library of Congress 2 terabytes: An academic research library 1 terabyte: 50,000 trees made into paper and printed upon 100 gigabytes: A floor of academic journals 4 gigabytes: 1 movie on a DVD 5 megabytes: The complete works of Shakespeare

That's a whole lot of data There was an estimated 12+ exabytes of data generated by the year 2000, representing the entire history of humanity In 2002, the number is now estimated at 16+ exabytes. By 2005, it will be nearly 24 exabytes Growing at around two exabytes of new data per year, this equates to roughly 250 megabytes for every man, woman, and child on earth Carved from this is roughly 11,285 terabytes of E-mail, 576,000 terabytes worth of phone calls (in the U.S. alone), and over 150,000 terabytes of snail mail per year Nucleotide sequences are being added to databases at a rate of more than 210 million base pairs (210+ MB) per year, with database content doubling in size approximately every 14 months Statistics are from a 2001 paper by researchers at UC Berkeley

Environment

Data types and sizes Research data comes in all shapes and sizes... Flat file and relational DB datasets GenBank (22 billion base pairs, roughly 80 GB) EMBL (31 billion base pairs, roughly 100 GB) SWISS-PROT (43.6 million amino acids, roughly 416 MB) PIR (96 million amino acids, roughly 645 MB) Dataset indexes for various applications GenBank converted for GCG usage today is approximately 52 GB Microarray and derived numerical/sequence data Microarray images -- each TIFF is roughly 5-10 MB, with 2-3 images per hybridization (DGRC to generate approximately 15 GB images per year) Associated 'meta-data' for 15,000 genes * 500 hybridizations is ~ 1 TB/yr Other types of data, such as video E. Martins' lab currently with 61 GB of lizard video, more than doubling by project completion. Source data is 20-80 MB QuickTime files

Data types and sizes We also need to consider other sources... Papers and Articles The average scientific paper/article as a PDF file is 2 MB, with a typical research scientist storing between 200-500 of these. E-mail The average incoming mailbox size is 10 MB, which doesn't include archived E-mail Personally generated files MS Word documents (average size is 3 MB, with 10 files per person) MS Power point presentations (average size is 6 MB, with 5 files per person) Miscellaneous images and other files such as.gif,.jpg/jpeg, text, etc. (average size is 200 KB, with 300 files per person) Assuming these numbers are on the conservative side, the average researcher will amass between 3-5 GB of data, either unique or copied. These data were generated by randomly sampling 500 user accounts on the sunflower system.

How does the data get stored? On removable media Floppy disks (1.44 MB & 2 MB) Zip disks (100 MB & 250 MB) Compact Discs (650-800 MB) Digital Video Disks (2 GB - 4 GB) On hard disk drives IDE (20-200 GB) [320GB announced and expected in a few months] SCSI (18-180GB) On filesystems FAT, FAT32, NTFS (Windows) UFS, XFS, JFS, EXT2, EXT3,... (Unix) HFS, HFS+ (MacOS) Via file-sharing protocols CIFS (Windows) NFS (Unix, MacOS X) AppleShare (MacOS 9 and earlier)

Where does the data get stored? The short answer is, all over the place...

Current CGB/Biology Storage Infrastructure What we can store today The sunflower system in its current form can store approximately 26 GB of personal data, 10 GB of E-mail, and 175 GB of research databases (i.e GenBank). Within the next 3 months, we will bring nearly 1 TB of new storage capacity online. CGB's new Laboratory Information Management System (LIMS), as configured, can store 175 GB of research data. David Kehoe's pondscum project server, as configured, can store 175 GB of research data Other CGB research machines, such as those serving up Flybase, Bio- Mirror, and IUBio, have a combined storage capacity of 1 TB Total remaining Biology research (storage capacity on desktops) is guesstimated at around 6 TB (300 computers * 20 GB drives)

Current UITS Storage Infrastructure What UITS can store today The Common File System (CFS) service has a total of 1.5 TB of online (hard disk based) storage, and is tied into the MDSS system (which is used to back that data up). CFS is meant for small to medium storage requirements. For instance, you might use it to store presentation files you'll want to access from a conference. By default you are given a 100 MB quota, but researchers can request up to a few GB depending on specific needs. The Massive Data Storage System (MDSS) is based on robotic tape libraries, with a combined storage capacity over 500 TB (120 TB located at IUB, 360 TB located at IUPUI) MDSS is meant for large scale, long-term archival storage. Faculty, staff, and graduate students are given default quotas of 500 GB. If your project demands more, UITS will negotiate a higher quota on a per project/cost share basis.

De-centralized vs. Centralized Storage What are the differences? De-centralized storage Direct Attached Storage (DAS), where the storage device(s) connect to an individual machine Hard to manage because it must be done directly from the machine to which it is attached Doesn't scale well (you can only attach so many devices to one machine) It's hard to share this storage with other machines Examples of de-centralized storage include a desktop s hard drive, or several servers with a disk array attached to each one Centralized storage All the storage is connected to one machine or group of machines, and/or to some type of network fabric Easier to manage in the long run, but more complex to implement initially Examples of centralized storage would include a dedicated file sharing server

Data Availability and Redundancy We must make sure that the data is always available, and fault-tolerant Availability Murphy's law is always in full effect. Machines and storage media will ultimately fail at some point, and we can't always predict when problems will occur Since the CGB provides services to the IU Bloomington community (e.g. GeneTraffic, BioWeb) and to the world at large (e.g. Bio-Mirror, Flybase, etc.), we must ensure that the data we store is available 24x7x365 There is an expectation that research and personal data should also be available 24x7x365. People get ANGRY when they can't get their E-mail! Redundancy The answer is to plan for data redundancy, which generally means we mirror data with two or more drives This doubles the cost and halves our available storage capacity, but gives us peace of mind and our customers more reliable service We're relying on disk drive redundancy, and less so on system redundancy, to protect the data (i.e if an important server dies, we don't have drop-in replacements yet We have no off-site redundancy. If Jordan or Myers are inflicted with flood or fire, everything is gone.

Backups You are responsible for backing up your own data CGB and Biology core/research servers We currently offer no guarantee, implied or otherwise, that up-to-date tape backups will be available for all data. We DO make a best effort to backup critical data on our servers, but currently rely on disk redundancy for most data Quite frankly, our existing backup infrastructure is completely inadequate, and we need to do (and will do) better UITS services UITS makes backups of all its core servers, but only provides recovery for up to one month The data you store in your CFS account has a one-day backup by default. You can request that UITS restore files from up to one month, but it will cost you $15 per incident UITS cannot restore files from the MDSS Laptops, Desktops, and Workstations You are responsible for backing up data on your laptop, desktop, or workstation There should be a campus-wide or departmental backup system, but it doesn't exist yet

Long-Term Data Archiving and Future Retrieval We can't get rid of anything We have enough storage capacity to handle existing data, and new data that is being generated today. But we need to address the long-term storage issues; that is, we must be able to archive today's data tomorrow, while providing enough capacity for tomorrow's new data. To illustrate part of the problem, there are requirements from federal funding agencies, and various laws such as the Freedom of Information Act (FOIA), which require data dissemination for an indefinite period of time. We have online storage and offline storage. Online means that the data is instantly available, while offline means it must first be retrieved from tape or other media, such as CD Offline storage makes future data retrieval difficult, so it would be better to have plenty of online storage

Research and Development

Evaluating New Technologies What other people are using to solve these problems Storage Area Network (SAN) Centralized data storage model All servers connect to the storage devices via a network, similar to the way your computer connects to the campus ethernet Generally this network can move data at extremely high speeds, upwards of 200 MB per second. As a comparison, you can copy files between machines over the campus network at 1 MB to 10 MB per second (theoretical maximum) Highly scalable we can easily add more storage as we need it, without worry of how many devices can attach to one machine Network Attached Storage (NAS) Provides the ability for desktops and workstations to access data on the SAN natively and transparently Disk-to-disk backups SAN and NAS technologies enable us to easily backup large amounts of data quickly Archived data can remain online at all times Future retrieval could be as simple as going to a web page and selecting your files

Implementation Where do we go from here Setup a test bed Convince a few SAN/NAS vendors to loan us the hardware, so that we can test this out in our existing environment Try out some of our routine day-to-day storage tasks and purposefully try to break things Repeat the process until we have a workable solution Identify projects and funding sources We need feedback from everyone in regards to anticipated project storage needs Instead of allocating funds for direct attached storage (such as an extra hard drive in a desktop or workstation), start including money for a chunk of the SAN Put it into production Move existing servers and data into the SAN fabric Create and offer new services Departmental wide desktop/workstation file sharing and backup Electronic lab notebooks What else would you like?

Summary Simply put, we can NEVER have enough storage! It may seem like we have enough storage now to last for several years, but there will always be more data to store. If the storage exists, you can be guaranteed that someone or something will find a way to fill it up. Research databases continue to grow at an amazing rate. It's not enough to cope with that alone; we still have to come up with enough temporary storage in which to copy, index, and store multiple versions of these multi-gigabyte datasets. Using GCG as an example again, we are today having to manipulate over half a terabyte of new and existing data every three months or so. And that's just for one application! With the increased focus on high-throughput sequencing, microarrays, LIMS, electronic lab notebooks, etc., a new ripple has formed in regards to bioinformatics and genomics storage patterns. Pretty soon, we'll have crashing waves... the water will need to go somewhere. In short, we can only see the tip of the iceberg, but it goes much deeper.