Challenges in data acquisition, storage and processing for NIH funded studies

Similar documents
Doing Multidisciplinary Research in Data Science

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable


Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Using the Bionimbus Protected Data Cloud (PDC): Obtaining Access Credentials FAQ

Backing up to the Cloud

HYBRID ARCHITECTURE IN THE CLOUD

Cloud 101. Mike Gangl, Caltech/JPL, 2015 California Institute of Technology. Government sponsorship acknowledged

Cloud Computing in the Digital Age

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Leveraging Public Clouds to Ensure Data Availability

MozyPro Online Backup. Overview and Sales Best Practices

Volunteer Computing, Grid Computing and Cloud Computing: Opportunities for Synergy. Derrick Kondo INRIA, France

Biometric Evaluation on the Cloud: A Case Study with HumanID Gait Challenge

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

BioHPC Web Computing Resources at CBSU

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

MediSapiens Ltd. Bio-IT solutions for improving cancer patient care. Because data is not knowledge. 19th of March 2015

Description of Application

Large-Scale Data Processing

Data Storage Options for Research

Barracuda Backup Server. Introduction

Database Fundamentals

Cost-Benefit Analysis of Cloud Computing versus Desktop Grids

Berlin Storage, Backup and Disaster Recovery in the Cloud AWS Customer Case Study: HERE Maps for Life

Transcription. Founder Interview - Panayotis Vryonis Talks About BigStash Cloud Storage. Media Duration: 28:45

Current and Future Economics of Deep Storage: Public vs. Private Cloud

Cloud Computing for Small to Mid Size Businesses. Tech66, LLC William Burleson

MapReduce, Hadoop and Amazon AWS

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Delivering the power of the world s most successful genomics platform

How To Backup An Exchange Server With 25Gb And More On A Microsoft Smartfiler With A Backup From A Backup To A Backup Point Set On A Flash Drive On A Pc Or Macbook Or Ipad On A Cheap Computer (For A

Secrets to Backups and Disaster Recovery for Architects. How to Prevent Losing Your Data And Losing Your Business

Google Cloud Data Platform & Services. Gregor Hohpe

Best Practices for What Files Do I Put Where? How to navigate the many options for digital storage

Campus Research Data Storage Survey

OneDrive in Office 365

Using Cloud Services. In Canada, eh! Justin J. Ngan, Manager Advisory Services: GIS & Analytics Regional Municipality of Wood Buffalo

New solutions for Big Data Analysis and Visualization

Contents. Introduction. What is the Cloud? How does it work? Types of Cloud Service. Cloud Service Providers. Summary

DOCLITE: DOCKER CONTAINER-BASED LIGHTWEIGHT BENCHMARKING ON THE CLOUD

Cloud Computing for Education Workshop

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

BIG DATA TRENDS AND TECHNOLOGIES

Cloud-Based Big Data Analytics in Bioinformatics

Where do you put 1,000,000,000,000 DNA base pairs? Could you quickly find one CT scan in a million?

Research Data Storage, Sharing, and Transfer Options

Microsoft Azure Cloud on your terms. Start your cloud journey.

Intro to AWS: Storage Services

Making Big Data Possible:

Big Data on Microsoft Platform

2.1.5 Storing your application s structured data in a cloud database

TELSTRA CLOUD SERVICES CLOUD INFRASTRUCTURE PRICING GUIDE SINGAPORE

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

There Are Clouds In Your Future. Jeff Barr Amazon Web (Twitter)

The Cloud. JL Cabrera LTEC 4550

Hadoopizer : a cloud environment for bioinformatics data analysis

Cloud storage Megas, Gigas and Teras

The Genealogy Cloud: Which Online Storage Program is Right For You Page , copyright High-Definition Genealogy. All rights reserved.

HIGH-SPEED BRIDGE TO CLOUD STORAGE

The Cost of the Cloud. Steve Saporta CTO, SwipeToSpin Mar 20, 2015

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

An Affordable Commodity Network Attached Storage Solution for Biological Research Environments.

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

FREE computing using Amazon EC2

SECURE BACKUP SYSTEM DESKTOP AND MOBILE-PHONE SECURE BACKUP SYSTEM HOSTED ON A STORAGE CLOUD

What happens when Big Data and Master Data come together?

What s New in SharePoint 2016 (On- Premise) for IT Pros

Cloud, Community and Collaboration Airline benefits of using the Amadeus community cloud

Archiving On-Premise and in the Cloud. March 2015

Cloud Computing and Amazon Web Services. CJUG March, 2009 Tom Malaher

Transcription:

UAB Research Computing Day September 15, 2011 Challenges in data acquisition, storage and processing for NIH funded studies Stephen Barnes, PhD Department of Pharmacology & Toxicology and the Targeted Metabolomics and Proteomics Laboratory

Synopsis Proteomic and genomic cats Federal rules for maintaining data from funded grants Economics of storing and transferring data Local Cloud vs Commercial Cloud Media for Cloud storage

How many cats in 2011? If we start with 2 cats (male and female) in 2001 and cats breed every 3 months producing a litter of 4 kittens, how many cats will we have today? At 3 months, we ll have 4 kittens At 6 months, we ll have 8 kittens At 12 months we ll have 32 kittens (expanded by 16) At 5 years, we ll have 16 5 kittens = (2 4*5 ) = one million At 10 years, we ll have 16 10 kittens = 2 40 Transfer to OMG units or as they re better known in IT departments as terabytes

The problem Analysis is exploding Imaging NextGen sequencing Proteomics Metabolomics Have created a world where there are routine TB datasets

Recent increases in speed of DNA sequencing Since 2005, DNA sequencing rates have increased 500-fold and are headed higher. The annualized increase is 4.5-fold, i.e., 22.5-fold every two years, an order of magnitude more than Moore s law in computing. Mardis ER Nature 470:198, 2011

Expense of Deep Sequencing Deep DNA sequencing is leading to Terabytes of data per month If the genomes of the population of the USA were sequenced, this would amount to >1,000 Petabytes of data 1 Terabyte of storage is ~$100 If simple scaling is possible, then it would cost $100 million a year (expected lifetime of the drives) to store the data or approaching $7 billion for the average person s lifetime (assuming no further population increase) And that s without backups, using the data in any way and assuming that Microsoft doesn t structure your saved files!

NIH requirements for data collected from funded grants NIH requires you to make available datasets created from federally supported studies How long should this be after termination of the grant? Federal rules about data state that you must keep them for 3 years http://grants.nih.gov/grants/policy/nihgps_2010/nih gps_ch8.htm#_toc271264950 So, who pays for keeping TB datasets? The investigator doesn t have financial authority once a grant is over So, UAB?

Does the cloud present a viable option? Yes, if we can transfer the data to other computers with greater processing power and cheaper long-term storage, but..

Whither the cloud? Depends on the size and capacity of the pipe from the acquisition computer to the computers in the cloud If we can get data to the cloud, can the software, particularly commercial software, be used for analysis in the cloud? Opportunity for companies to go to a different business model where there is one, always up-todate version of their software in the cloud and users pay a small fee for each time they use it

Principal issues posed at Bio-IT 2011 Can cloud infrastructure can support highthroughput data analysis pipelines designed for next-generation sequencing data? The cloud is a valuable option for small research centers that lack the resources to purchase and maintain inhouse infrastructure For large centers like the Broad Institute (or UAB), which require almost constant compute power to manage and move files ranging from 1 gigabyte to 1 terabyte in size, the cloud, at least for the present, does not seem to be a cost-effective option.

Economics of the cloud Estimated cost of traditional storage is $3.75 per Gigabyte-month Amazon cost $0.15 Gigabyte-month CPU cost estimated at $2.63-$3.33 per CPUhour Microsoft Azure cost - $0.12 CPU-hour The cloud, if you can get there, offers substantial savings From Virtualization and Cloud Computing Digital Realty Trust, February 2011

Tb storage costs in the Cloud $0.15/GB/month $150/TB/month Annual cost $1,800/TB/year Commercially, life-time storage is $3,000 For our group, ONE machine generates 2 TB each month Annualized cost $78,000 Conclusion: Cloud HD storage is not viable

Other models to consider Do we really need to have high speed access to old data? Is a tape back up system viable? Google still uses it as their long-term storage system One tenth of the costs of HD storage This reduces 1 TB storage to $15/month or $180/year

Costs of tape storage http://www.bitbunker.com/pricing If an investigator uploads a 200 GB file, this costs $20. It also costs $20 each time it is downloaded. This will be a cost borne by the investigator.

The robotic tape storage system Download time is consistent with the time to get a coffee or a Coke, or take a quick bathroom break.

Summary and the future Biomedical research is generating volumes of data that strain the current system for data transport and storage There are two options Gaining access to very fast pipes from UAB NextGen and other UAB data generating centers to existing regional fast pipes and on to the commercial cloud Creating very fast pipes from UAB NextGen and other UAB data generating centers to a UAB cloud Storage costs of large data sets has become an economic heavyweight Is a tape system the solution for NIH data? Software may not be transferrable to the cloud Security issues need good solutions for all parties

Acknowledgements David Shealy, PhD Chiquito Crasto, PhD Jonas Almeida, PhD John-Paul Robinson Scott Sweeney Landon Wilson Mikako Kawai Chandrahas Narne NCCAM R21 AT004661 NCRR S10 RR027822