ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013



Similar documents
Delivering the power of the world s most successful genomics platform

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

CGHub Client Security Guide Documentation

Four Ways High-Speed Data Transfer Can Transform Oil and Gas WHITE PAPER

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

ediscovery and Search of Enterprise Data in the Cloud

Practical Solutions for Big Data Analytics

Digital Asset Management. Content Control for Valuable Media Assets

Using the Bionimbus Protected Data Cloud (PDC): Obtaining Access Credentials FAQ

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Globus Genomics Tutorial GlobusWorld 2014

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Big Data Challenges in Bioinformatics

T a c k l i ng Big Data w i th High-Performance

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Introduction to Arvados. A Curoverse White Paper

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

HIGH-SPEED BRIDGE TO CLOUD STORAGE

White Paper. Version 1.2 May 2015 RAID Incorporated

Key Considerations and Major Pitfalls

Amazon Cloud Storage Options

Computational Requirements

White Paper. Amazon in an Instant: How Silver Peak Cloud Acceleration Improves Amazon Web Services (AWS)

Taking Big Data to the Cloud. Enabling cloud computing & storage for big data applications with on-demand, high-speed transport WHITE PAPER

Intelligent Systems for Health Solutions

How To Write A Blog Post On Globus

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

Whitepaper. The ABC of Private Clouds. A viable option or another cloud gimmick?

A Service for Data-Intensive Computations on Virtual Clusters

Keystones for supporting collaborative research using multiple data sets in the medical and bio-sciences

NetApp Big Content Solutions: Agile Infrastructure for Big Data

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Keystone Image Management System

OPTIMIZING PERFORMANCE IN AMAZON EC2 INTRODUCTION: LEVERAGING THE PUBLIC CLOUD OPPORTUNITY WITH AMAZON EC2.

Making a Case for Including WAN Optimization in your Global SharePoint Deployment

GenomeSpace Architecture

Axceleon s CloudFuzion Turbocharges 3D Rendering On Amazon s EC2

EMC CLOUDARRAY PRODUCT DESCRIPTION GUIDE

Cisco UCS and Quantum StorNext: Harnessing the Full Potential of Content

Hadoop & Spark Using Amazon EMR

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Testimony of. Paul Misener Vice President for Global Public Policy, Amazon.com. Before the

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

CGHub Web-based Metadata GUI Statement of Work

LifeScope Genomic Analysis Software 2.5

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

ORACLE HEALTH SCIENCES INFORM ADVANCED MOLECULAR ANALYTICS

Cisco Virtualized Multiservice Data Center Reference Architecture: Building the Unified Data Center

Powerful analytics. and enterprise security. in a single platform. microstrategy.com 1

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

Product Brief SysTrack VMP

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Relocating Windows Server 2003 Workloads

Desktop Virtualization for the Banking Industry. Resilient Desktop Virtualization for Bank Branches. A Briefing Paper

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

GeneProf and the new GeneProf Web Services

IBM Global Technology Services September NAS systems scale out to meet growing storage demand.

DELL s Oracle Database Advisor

End-to-End E-Clinical Coverage with Oracle Health Sciences InForm GTM

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Digital Asset Management

Big Data at Cloud Scale

Media Exchange really puts the power in the hands of our creative users, enabling them to collaborate globally regardless of location and file size.

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Databricks. A Primer

The Recipe for Sarbanes-Oxley Compliance using Microsoft s SharePoint 2010 platform

Utilizing the SDSC Cloud Storage Service

CrossPoint for Managed Collaboration and Data Quality Analytics

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

Databricks. A Primer

Cisco Unified Data Center

How To Create A Large Enterprise Cloud Storage System From A Large Server (Cisco Mds 9000) Family 2 (Cio) 2 (Mds) 2) (Cisa) 2-Year-Old (Cica) 2.5

ebook Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry.

How To Build A Cloud Computer

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

Globus Research Data Management: Introduction and Service Overview

Cluster, Grid, Cloud Concepts

Data processing goes big

How To Build A Clustered Storage Area Network (Csan) From Power All Networks

Understanding the Benefits of IBM SPSS Statistics Server

StorReduce Technical White Paper Cloud-based Data Deduplication

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Globus Research Data Management: Introduction and Service Overview. Steve Tuecke Vas Vasiliadis

Data management challenges in todays Healthcare and Life Sciences ecosystems

cloud functionality: advantages and Disadvantages

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

Scalable Services for Digital Preservation

Analyzing HTTP/HTTPS Traffic Logs

Tableau Online. Understanding Data Updates

Increased Security, Greater Agility, Lower Costs for AWS DELPHIX FOR AMAZON WEB SERVICES WHITE PAPER

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Luncheon Webinar Series May 13, 2013

CAREER TRACKS PHASE 1 UCSD Information Technology Family Function and Job Function Summary

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Transcription:

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013

Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and medical practice, a momentous challenge arises how to cope with the rapidly increasing volume of complex data. Issues such as data storage, access, transfer, sharing, security, and analysis must be resolved to enable the new era of genomic medicine. Annai Systems provides several tools to enable and enhance genomic data use: the Annai-GNOS data management platform, GeneTorrent and GTFuse for accelerated file transfer and file mining, request Portal for collaboration and discovery, and the BioCompute Farm for analytical power. These powerful tools can be deployed in concert or independently. of Annai Platform Components Annai-GNOS provides a fast, scalable and robust network solution for storing, moving, finding, and securing genomic sequence data and associated metadata. GNOS-enabled repositories are capable of handling multi-petabytes of next generation sequencing data for fast and flexible storage, search, and retrieval. GeneTorrent is a data transfer protocol that allows for highspeed transfer of data files into and out of a given GNOS enabled repository. The repository and file transfer capabilities are highly secure and meet government standards, as defined by the Federal Information Security and Management Act of 2002 (FISMA). BioCompute Farm is a virtualized computation environment that provides on-demand compute power specifically optimized to facilitate analysis of genomic data. Users can enjoy high throughput computing without having to build local high-performance compute platforms or transfer massive data files over the Internet. request is a web portal which employs a query and networking infrastructure enabling researchers to search, find, and manage downloads from multiple GNOS-enabled data repositories. request s intuitive user interface streamlines the process of exploring and searching genomic data. GTFuse amplifies GeneTorrent s fast transfer speeds by allowing users to download selected portions of large genomic data files such as those at CGHub. GTFuse allows researchers to find and quickly access sequence data files as swiftly as if they were on the local network. GTFuse s option to select and retrieve a designated subset or region of a BAM file dramatically reduces data transfer times and costs. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 2

There are a growing number of public and private repositories emerging as integral parts of the drug discovery and therapeutic treatment process. These data repositories vary greatly in data use, efficiency of data upload/ download and access, regulatory compliance and security configurations. Furthermore, genomic data comes in a wide variety of formats and from various sequencing platforms. As the integration of genomic data with clinical data becomes increasingly required, there is an urgent need for genomic data tools that provide flexible, scalable solutions for a wide diversity of uses. The Cancer Genomics Hub (CGHub) is a vast repository of cancer genome data accessed freely by hundreds of researchers and clinicians, in both academic and commercial environments. CGHub uses Annai- GNOS to provide highly scalable access to The Cancer Genome Atlas (TCGA) and other cancer genome data sets. CGHub was launched in 2012 at UC Santa Cruz and now holds over 55,000 cancer genome files totaling 675 Terabytes. Hundreds of researchers from dozens of institutions rely on CGHub for access to cancer genome data from ten world-class sequencing centers, including the Broad Institute, Washington University, and Baylor College of Medicine. The repository is expected to grow to 5 Petabytes in the next few years. Annai supports both research and clinical settings by providing a powerful and flexible environment for enabling users at all levels of IT skill to easily accomplish tasks of genomic data handling and analysis. AnnaiBCF AnnaireQuest Research Portal AnnaiGNOS Genome Network Operating System GNOS Web Services AnnaiGTFuse Federated Authentication GNOS Repository Public Genomic Data GeneTorrent Data Transfer Private Genomic Data FIGURE 1. The Annai-GNOS environment and related peripheral data management tools. The various components of the Annai platform can be deployed together as an integrated whole or independently. When deployed in full, the Annai-GNOS system boosts productivity, reduces timeto-insight, and ensures data security while facilitating collaboration. Researchers or clinicians can quickly search and extract specific segments from thousands of genomes, work independently or collaborate with a team to analyze the data, and prepare their findings for publication or use in the clinic to guide therapy. The Annai-GNOS platform is designed to accelerate genomic research. A closer examination of its components will provide insight into their collective synergy as a system with unique and comprehensive capabilities. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 3

Annai-GNOS A Platform for High Performance Genomic Analysis and Data Management Annai-GNOS is a unique integration of the data repository infrastructure and high-speed networking capabilities needed to accommodate large genomic data sets. These data sets are characterized by diverse file formats, extensive meta-data, large file sizes and individual sequence datasets ranging from 10 Gigabytes to more than 1 Terabyte in size (depending on the depth of coverage). Annai-GNOS allows the entire user community to see the state of data throughout the submission lifecycle, including data that has not yet been approved or submitted for download. Researchers can query the state of data as soon as it is submitted and quickly identify submissions that may require some attention due to formatting or other problematic issues, before they are available to users of the repository. Flexible meta-data searching greatly simplifies finding the right sequence file, and highly fault-tolerant design ensures services continue to be available. The GNOS network functionality integrates secure, high-speed network protocols to mobilize petabyte scale genomic data analysis. Annai-GNOS can also be integrated with federated authentication systems like InCommon and the National Cancer Institute s authorization systems. Technical Specifications GNOS features the following capabilities: User-programmable meta-data format validation engine Support for multiple meta-data formats including customer defined formats and the Sequence Read Archive (SRA) schemas used by NCBI, EBI and DDBJ Support for multiple sequence data file types Ability to store other file types, such as compressed sequences Accelerated file transfer using GeneTorrent and GTFuse Incommon (Shibboleth) based, federated user authentication. Project-based data authorization to control individual researcher access Support for commonly used file format standards and analysis tools, including NCBI SRA Meta-data format; TCGA v2 BAM and VCF File Formats and GATK, BowTie, TopHat, CuffLinks and additional tools. The GNOS platform streamlines all aspects of genomic data management and access for researchers and clinicians. Setting up a GNOS repository consists of two steps: 1) data ingestion (duration depends on the state of the data) and 2) data deployment as indexed, meta-data tags in the GNOS database. Sequence data are entered into the repository using Annai s proprietary GeneTorrent tool and metadata submission API. Researchers can use the request web portal to quickly and easily explore GNOS-enabled data. For example, a simple search of ovarian cancer in CGHub using request can instantly output the number of ovarian cancer genome files contained in the database and how many are RNA-Seq, exome, or whole genome. The interface also enables the user to further drill down quickly to the specific files of interest. The ability to quickly visualize the contents of a GNOS repository is based on searching meta-data attributes that are extracted from sequencing files, catalogued and indexed. Query parameters are unlimited, but typically include file type, disease, sample collection date, sequencing platform, date of sequencing, and mapping and alignment tools. GNOS is suitable for public and/or private genomic databases of translational and basic research centers, pharmaceutical R&D labs, diagnostic companies, and similar organizations generating significant volumes of sequence data. GNOS provides tools to help catalogue, index, upload and download files, and to make the data available for collaboration. GNOS can also be integrated with any data management and transfer method or protocol. Use Case 1 CGHub Cancer Genome Repository The University of California Santa Cruz (UCSC) provides CGHub, the world s largest repository of cancer genome data. CGHub is built on GNOS and, after rigorous testing with active TCGA users, was established as the new secure repository for the Cancer Genome Atlas (TGCA) on April 30, 2012. Use Case 2 Drug Development Pharmaceuticals companies have strict requirements for data protection and security. Corporate policies may mandate keeping data behind a firewall. In this case, an in-house GNOS repository is an optimal solution. After installation by Annai, this type of repository will be managed by the company s local experts within its existing highperformance computing infrastructure. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 4

GeneTorrent Accelerated Secure File Transport Whole genome sequence data files range from several hundred gigabytes to over one terabyte in size. GeneTorrent enables accelerated transfers of terabyte-scale data. It employs a proprietary variant of the popular BitTorrent algorithm to securely transfer files at speeds limited only by the base network bandwidth. Technical Specifications Use Case Translational and Clinical Research Translational researchers and clinicians use GeneTorrent to push sequence data, either locally or from an external sequencing lab, into a GNOS repository either installed in their facility or hosted by Annai in the BioCompute Farm. GNOS-enabled repositories can also be hosted on Amazon Web Services (AWS) or in similar cloud environments. GeneTorrent s key functionality is as follows: High-fidelity parallel file transfer at up to multi-gbits/sec (speeds as high as 200 Mbps are routinely achieved) Highly resilient to in-network and computing failures with automatic recovery Highly secure 256-bit encrypted file transfer request One-stop Portal for Data Access, Collaboration and Management One of the most difficult aspects of genomics research is finding specific data across multiple, growing and often separate, disparate data repositories. Individual files can also be very large and the metadata extensive and difficult to interpret. The request portal addresses these challenges by providing a single point of access to the contents of all accessible GNOS-enabled repositories. Researchers can employ request s data exploration capabilities to analyze the data trends across available repositories. The portal s Access and Download capabilities allow researchers to drill down to find and download specific data sets. The Explore, Access, Download, and Collaboration capabilities of request are available to the community through standard web browsers enabling users to query, retrieve, and monitor download progress without having to install or master complex proprietary tools or query syntax. Technical Specifications The following describes request s key functionality: Explore a graphical interface to interrogate and analyze the contents of any Annai-GNOS enabled data repository using data statistics and meta-data. This function enables searches based on organization, study, disease, and other key terms to explore the genomic data set. Access a powerful, yet user-friendly meta-data query building capability allowing the researcher to find and select a set of individual sequence files for download. The download of files can be initiated from the Access area once the desired files are designated. Annai request offers conditional access, as some data repositories, such as the TCGA data hosted on CGHub, require access authorization credentials in order to download sequence files. The status of current and past download requests can be reviewed from a single dashboard. Download users can view the status of each file within their download requests, and a complete history of downloads is maintained to support experiment reproducibility. Collaborate provides public and private collaboration sites to engage with colleagues and share knowledge around common projects and frequently accessed datasets to broaden and expand the community of academic and clinical researchers. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 5

System Management Data Explorer Annai request portal Data Access Portal Management Database Data Download Metadata Ingest The collaborative capabilities of request facilitate cross team communication and allow for better distribution of tasks. For example, a team member responsible for defining the experimental parameters could select the appropriate data and pass it to a bioinformatician who is performing the analysis. Operating System Communications Broker FIGURE 2. request Portal helping to expedite research through a wellmanaged, user-friendly portal environment. GTFuse Accelerated Data Queries GTFuse enables researchers to directly access remote sequence data files as if they were on the local file system. GTFuse allows researchers to mount the desired data and immediately run any existing tools such as SamTools to inspect the header and begin accessing specific regions of the sequence data (i.e. if you are interested in analyzing data from a particular chromosome, gene, or region). GNOS Genomic File GTFuse client HPC Analysis Clusters Technical Specifications The following describes GTFuse s key functionality: Mounts remote file on local file system Relevant data within file GTFuse client Local Analysis Tools Provides asynchronous access to files via GeneTorrent protocol No data transfer until file is accessed by the user on local file system FIGURE 3. GTFuse provides the option to search and download the specific genes or regions required instead of the entire file. It requires no tools integration and allows any analysis tool to access data files as if they were local. Researchers often want to quickly examine specific regions of genomic data in remote repositories without retrieving the entire BAM file or analysis object. Alternately, researchers may need to read entire files but do not have the storage capacity to maintain local copies of large numbers of BAM files. Other tasks are difficult due to the large size of sequence files. For example, a researcher may spend hours downloading BAM files to inspect their headers and determine if there is sufficient coverage depth for their analysis. For all of these scenarios, GTFuse provides a speedy and economical solution by substantially shortening the time researchers spend preparing to undertake the analysis that interests them and helping to conserve IT resources. Use Case 1 Asynchronous BAM file access A researcher wants to use SAMTools to view specific genome data coordinates. The researcher uses GTFuse to open a BAM file and its corresponding BAI file and perform seek operations to read small portions from the BAM file asynchronously. Use Case 2 Process remote file locally A researcher avoids using large amounts of local disk storage by mounting a remote BAM file using GTFuse before building a BAI index file locally. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 6

BioCompute Farm Enabling Simple, Streamlined Data Analysis The BioCompute Farm is a private cloud designed specifically for genomic data analysis. The BioCompute Farm allows collaborators to use an elastic pool of compute servers and run cross-organizational experiments without up front capital expense, IT development effort, ongoing maintenance, or significant lead-time. Local GNOS-enabled compute databases, a pre-installed set of analysis tools, a stored set of reference genomes, and specialized data access greatly simplify genomic data gathering and analysis. The BioCompute Farm s unique efficiencies reduce the resources and time needed to accomplish complex genomic data analysis. Researchers can instantly activate virtual machines in our highly secure BioCompute Farm and collaborate with colleagues across the globe. Data input and output is free on the BioCompute Farm. The BioCompute Farm s high-speed network transfer capability removes the need to ship hard disks containing potentially sensitive data between organizations with the attendant risks and delays. The BioCompute Farm s flexible storage allows researchers to import large volumes of data to be utilized for performing data analysis and to discard it afterwards. This allows researchers to avoid the difficulties and delays of expanding existing local IT infrastructure to cope with moving and processing large volumes of sequencing data. Customer Site Access Control Researcher Researcher Researcher CGHub Compute Console request Portal Transfer Control Sequence Data DataCenter Fabric San Diego Supercomputing Center Internet ANNAI BioCompute Farm FIGURE 4. The BioCompute Farm offers high performance computing, storage, and networking resources in a virtualized computing environment Genome Analysis Tools & GTFuse Technical specifications The BioCompute Farm has the following key functionalities: High-performance compute power including 10G networking, 100GB memory and highly scalable storage capacity, to deliver performance optimized for bioinformatics application needs. Users have complete control over their virtual instances. Additional instances, memory and storage capacity can be added as needed. Custom user tracking and reporting can be enabled. Instances include bioinformatics and data extraction tools for large-scale and complex genomic analysis. Users can add additional tools and save them for future reuse. Workflows can be set up to launch automatically. There are two primary uses of the BioCompute Farm. One use is serving clients who need to do analytical research with repositories such as CGHub, and do not need to store data at the compute center. Typically, they want to do analysis of primary sequence data in the BioCompute Farm and pull results datasets back to their local environments. By using GTFuse researchers can extract the genes or regions of interest, instead of bulk copying whole sequence files. This is one of the most significant advantages of GTFuse used in conjunction with the BioCompute Farm. In some particular cases where a handful of genes are studied across many genomes, TCGA researchers use up to one hundred times less compute and storage capacity by working only with the actively used TCGA data. Use Case 1 CGHub BioCompute Farm The CGHub BioCompute Farm is co-located with CGHub, home of genomic data from The Cancer Genome Atlas, within the San Diego Supercomputer Center. The BioCompute Farm has a 10Gb/sec connection to CGHub and the Internet. Annai s request web portal enables users to rapidly browse the genomic data sets via customized and automated searches, and to bring the desired data into the user applications running in the BioCompute Farm. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 7

Use Case 2 Private BioCompute Farm A private BioCompute Farm can be co-located with an in-house GNOS-enabled data repository tailored to meet the particular requirements of a research organization. Annai provides installation, configuration and GeneTorrent training to researchers. Optionally, mapping, alignment, and variant calling tools can also be pre-installed in the BioCompute Farm. Having data analysis capacity co-located with in-house data can substantially reduce costs and speed up genomic data analysis. Conclusion Advancing translational research and genomic medicine requires distilling valuable, actionable information from hundreds or thousands of genomic sequence files and raises a unique set of big data challenges. Responding to these challenges, Annai Systems has developed the Annai-GNOS platform that drives robust repository operations to meet the real-world needs of users by providing metadata-based indexing, search query, and access to multiple distributed data sets, high-speed file transfer, rapid extraction of designated elements from multiple files, and a user-friendly alternative to command line interface. Annai Systems Inc. www.annaisystems.com Tel. 408 395-3621 475 Alberto Way, Suite 120 Los Gatos, California, 95032 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 8