Large Scale Storage Solutions for Bioinformatics and Genomics Projects Phillip Smith Unix System Administrator, Bioinformatics Group sysadmin@bio.indiana.edu The Center for Genomics and Bioinformatics Indiana University, Bloomington http://cgb.indiana.edu/
Overview Environment as it is today Types of data being stored and typical dataset sizes Where and how the data is being stored Current storage capabilities Problem areas De-centralized vs. centralized storage Data availability and redundancy Backups Long-term data archiving and future retrieval Research and development, and future implementation Evaluate the new technology paradigms (SAN, NAS, etc.) Setup a test bed to try these technologies in our environment Enabling of new software services, like electronic lab notebooks Summary Review Related examples (1TB SAN in CS, Whitehead) Questions and Comments
Bits and Bytes A quick overview of computer storage semantics: One bit (b) = 0 or 1 One byte (B) = 8 bits One kilobyte (KB) = 1024 bytes (2^10) One megabyte (MB) = 1024 KB (2^20) One gigabyte (GB) = 1024 MB (2^30) One terabyte (TB) = 1024 GB (2^40) One petabyte (PB) = 1024 TB (2^50) Beyond this is exa (2^60), zetta (2^70), and yotta (2^80) Relatively few sites are managing more than a couple of petabytes
Putting the numbers into perspective Some real-world examples 200 petabytes: All printed material 2 petabytes: All U.S. Academic research libraries 400 terabytes: National Climate Data Center (NOAA) database 20 terabytes: The printed collection of the U.S. Library of Congress 2 terabytes: An academic research library 1 terabyte: 50,000 trees made into paper and printed upon 100 gigabytes: A floor of academic journals 4 gigabytes: 1 movie on a DVD 5 megabytes: The complete works of Shakespeare
That's a whole lot of data There was an estimated 12+ exabytes of data generated by the year 2000, representing the entire history of humanity In 2002, the number is now estimated at 16+ exabytes. By 2005, it will be nearly 24 exabytes Growing at around two exabytes of new data per year, this equates to roughly 250 megabytes for every man, woman, and child on earth Carved from this is roughly 11,285 terabytes of E-mail, 576,000 terabytes worth of phone calls (in the U.S. alone), and over 150,000 terabytes of snail mail per year Nucleotide sequences are being added to databases at a rate of more than 210 million base pairs (210+ MB) per year, with database content doubling in size approximately every 14 months Statistics are from a 2001 paper by researchers at UC Berkeley
Environment
Data types and sizes Research data comes in all shapes and sizes... Flat file and relational DB datasets GenBank (22 billion base pairs, roughly 80 GB) EMBL (31 billion base pairs, roughly 100 GB) SWISS-PROT (43.6 million amino acids, roughly 416 MB) PIR (96 million amino acids, roughly 645 MB) Dataset indexes for various applications GenBank converted for GCG usage today is approximately 52 GB Microarray and derived numerical/sequence data Microarray images -- each TIFF is roughly 5-10 MB, with 2-3 images per hybridization (DGRC to generate approximately 15 GB images per year) Associated 'meta-data' for 15,000 genes * 500 hybridizations is ~ 1 TB/yr Other types of data, such as video E. Martins' lab currently with 61 GB of lizard video, more than doubling by project completion. Source data is 20-80 MB QuickTime files
Data types and sizes We also need to consider other sources... Papers and Articles The average scientific paper/article as a PDF file is 2 MB, with a typical research scientist storing between 200-500 of these. E-mail The average incoming mailbox size is 10 MB, which doesn't include archived E-mail Personally generated files MS Word documents (average size is 3 MB, with 10 files per person) MS Power point presentations (average size is 6 MB, with 5 files per person) Miscellaneous images and other files such as.gif,.jpg/jpeg, text, etc. (average size is 200 KB, with 300 files per person) Assuming these numbers are on the conservative side, the average researcher will amass between 3-5 GB of data, either unique or copied. These data were generated by randomly sampling 500 user accounts on the sunflower system.
How does the data get stored? On removable media Floppy disks (1.44 MB & 2 MB) Zip disks (100 MB & 250 MB) Compact Discs (650-800 MB) Digital Video Disks (2 GB - 4 GB) On hard disk drives IDE (20-200 GB) [320GB announced and expected in a few months] SCSI (18-180GB) On filesystems FAT, FAT32, NTFS (Windows) UFS, XFS, JFS, EXT2, EXT3,... (Unix) HFS, HFS+ (MacOS) Via file-sharing protocols CIFS (Windows) NFS (Unix, MacOS X) AppleShare (MacOS 9 and earlier)
Where does the data get stored? The short answer is, all over the place...
Current CGB/Biology Storage Infrastructure What we can store today The sunflower system in its current form can store approximately 26 GB of personal data, 10 GB of E-mail, and 175 GB of research databases (i.e GenBank). Within the next 3 months, we will bring nearly 1 TB of new storage capacity online. CGB's new Laboratory Information Management System (LIMS), as configured, can store 175 GB of research data. David Kehoe's pondscum project server, as configured, can store 175 GB of research data Other CGB research machines, such as those serving up Flybase, Bio- Mirror, and IUBio, have a combined storage capacity of 1 TB Total remaining Biology research (storage capacity on desktops) is guesstimated at around 6 TB (300 computers * 20 GB drives)
Current UITS Storage Infrastructure What UITS can store today The Common File System (CFS) service has a total of 1.5 TB of online (hard disk based) storage, and is tied into the MDSS system (which is used to back that data up). CFS is meant for small to medium storage requirements. For instance, you might use it to store presentation files you'll want to access from a conference. By default you are given a 100 MB quota, but researchers can request up to a few GB depending on specific needs. The Massive Data Storage System (MDSS) is based on robotic tape libraries, with a combined storage capacity over 500 TB (120 TB located at IUB, 360 TB located at IUPUI) MDSS is meant for large scale, long-term archival storage. Faculty, staff, and graduate students are given default quotas of 500 GB. If your project demands more, UITS will negotiate a higher quota on a per project/cost share basis.
De-centralized vs. Centralized Storage What are the differences? De-centralized storage Direct Attached Storage (DAS), where the storage device(s) connect to an individual machine Hard to manage because it must be done directly from the machine to which it is attached Doesn't scale well (you can only attach so many devices to one machine) It's hard to share this storage with other machines Examples of de-centralized storage include a desktop s hard drive, or several servers with a disk array attached to each one Centralized storage All the storage is connected to one machine or group of machines, and/or to some type of network fabric Easier to manage in the long run, but more complex to implement initially Examples of centralized storage would include a dedicated file sharing server
Data Availability and Redundancy We must make sure that the data is always available, and fault-tolerant Availability Murphy's law is always in full effect. Machines and storage media will ultimately fail at some point, and we can't always predict when problems will occur Since the CGB provides services to the IU Bloomington community (e.g. GeneTraffic, BioWeb) and to the world at large (e.g. Bio-Mirror, Flybase, etc.), we must ensure that the data we store is available 24x7x365 There is an expectation that research and personal data should also be available 24x7x365. People get ANGRY when they can't get their E-mail! Redundancy The answer is to plan for data redundancy, which generally means we mirror data with two or more drives This doubles the cost and halves our available storage capacity, but gives us peace of mind and our customers more reliable service We're relying on disk drive redundancy, and less so on system redundancy, to protect the data (i.e if an important server dies, we don't have drop-in replacements yet We have no off-site redundancy. If Jordan or Myers are inflicted with flood or fire, everything is gone.
Backups You are responsible for backing up your own data CGB and Biology core/research servers We currently offer no guarantee, implied or otherwise, that up-to-date tape backups will be available for all data. We DO make a best effort to backup critical data on our servers, but currently rely on disk redundancy for most data Quite frankly, our existing backup infrastructure is completely inadequate, and we need to do (and will do) better UITS services UITS makes backups of all its core servers, but only provides recovery for up to one month The data you store in your CFS account has a one-day backup by default. You can request that UITS restore files from up to one month, but it will cost you $15 per incident UITS cannot restore files from the MDSS Laptops, Desktops, and Workstations You are responsible for backing up data on your laptop, desktop, or workstation There should be a campus-wide or departmental backup system, but it doesn't exist yet
Long-Term Data Archiving and Future Retrieval We can't get rid of anything We have enough storage capacity to handle existing data, and new data that is being generated today. But we need to address the long-term storage issues; that is, we must be able to archive today's data tomorrow, while providing enough capacity for tomorrow's new data. To illustrate part of the problem, there are requirements from federal funding agencies, and various laws such as the Freedom of Information Act (FOIA), which require data dissemination for an indefinite period of time. We have online storage and offline storage. Online means that the data is instantly available, while offline means it must first be retrieved from tape or other media, such as CD Offline storage makes future data retrieval difficult, so it would be better to have plenty of online storage
Research and Development
Evaluating New Technologies What other people are using to solve these problems Storage Area Network (SAN) Centralized data storage model All servers connect to the storage devices via a network, similar to the way your computer connects to the campus ethernet Generally this network can move data at extremely high speeds, upwards of 200 MB per second. As a comparison, you can copy files between machines over the campus network at 1 MB to 10 MB per second (theoretical maximum) Highly scalable we can easily add more storage as we need it, without worry of how many devices can attach to one machine Network Attached Storage (NAS) Provides the ability for desktops and workstations to access data on the SAN natively and transparently Disk-to-disk backups SAN and NAS technologies enable us to easily backup large amounts of data quickly Archived data can remain online at all times Future retrieval could be as simple as going to a web page and selecting your files
Implementation Where do we go from here Setup a test bed Convince a few SAN/NAS vendors to loan us the hardware, so that we can test this out in our existing environment Try out some of our routine day-to-day storage tasks and purposefully try to break things Repeat the process until we have a workable solution Identify projects and funding sources We need feedback from everyone in regards to anticipated project storage needs Instead of allocating funds for direct attached storage (such as an extra hard drive in a desktop or workstation), start including money for a chunk of the SAN Put it into production Move existing servers and data into the SAN fabric Create and offer new services Departmental wide desktop/workstation file sharing and backup Electronic lab notebooks What else would you like?
Summary Simply put, we can NEVER have enough storage! It may seem like we have enough storage now to last for several years, but there will always be more data to store. If the storage exists, you can be guaranteed that someone or something will find a way to fill it up. Research databases continue to grow at an amazing rate. It's not enough to cope with that alone; we still have to come up with enough temporary storage in which to copy, index, and store multiple versions of these multi-gigabyte datasets. Using GCG as an example again, we are today having to manipulate over half a terabyte of new and existing data every three months or so. And that's just for one application! With the increased focus on high-throughput sequencing, microarrays, LIMS, electronic lab notebooks, etc., a new ripple has formed in regards to bioinformatics and genomics storage patterns. Pretty soon, we'll have crashing waves... the water will need to go somewhere. In short, we can only see the tip of the iceberg, but it goes much deeper.