ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013
Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and medical practice, a momentous challenge arises how to cope with the rapidly increasing volume of complex data. Issues such as data storage, access, transfer, sharing, security, and analysis must be resolved to enable the new era of genomic medicine. Annai Systems provides several tools to enable and enhance genomic data use: the Annai-GNOS data management platform, GeneTorrent and GTFuse for accelerated file transfer and file mining, request Portal for collaboration and discovery, and the BioCompute Farm for analytical power. These powerful tools can be deployed in concert or independently. of Annai Platform Components Annai-GNOS provides a fast, scalable and robust network solution for storing, moving, finding, and securing genomic sequence data and associated metadata. GNOS-enabled repositories are capable of handling multi-petabytes of next generation sequencing data for fast and flexible storage, search, and retrieval. GeneTorrent is a data transfer protocol that allows for highspeed transfer of data files into and out of a given GNOS enabled repository. The repository and file transfer capabilities are highly secure and meet government standards, as defined by the Federal Information Security and Management Act of 2002 (FISMA). BioCompute Farm is a virtualized computation environment that provides on-demand compute power specifically optimized to facilitate analysis of genomic data. Users can enjoy high throughput computing without having to build local high-performance compute platforms or transfer massive data files over the Internet. request is a web portal which employs a query and networking infrastructure enabling researchers to search, find, and manage downloads from multiple GNOS-enabled data repositories. request s intuitive user interface streamlines the process of exploring and searching genomic data. GTFuse amplifies GeneTorrent s fast transfer speeds by allowing users to download selected portions of large genomic data files such as those at CGHub. GTFuse allows researchers to find and quickly access sequence data files as swiftly as if they were on the local network. GTFuse s option to select and retrieve a designated subset or region of a BAM file dramatically reduces data transfer times and costs. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 2
There are a growing number of public and private repositories emerging as integral parts of the drug discovery and therapeutic treatment process. These data repositories vary greatly in data use, efficiency of data upload/ download and access, regulatory compliance and security configurations. Furthermore, genomic data comes in a wide variety of formats and from various sequencing platforms. As the integration of genomic data with clinical data becomes increasingly required, there is an urgent need for genomic data tools that provide flexible, scalable solutions for a wide diversity of uses. The Cancer Genomics Hub (CGHub) is a vast repository of cancer genome data accessed freely by hundreds of researchers and clinicians, in both academic and commercial environments. CGHub uses Annai- GNOS to provide highly scalable access to The Cancer Genome Atlas (TCGA) and other cancer genome data sets. CGHub was launched in 2012 at UC Santa Cruz and now holds over 55,000 cancer genome files totaling 675 Terabytes. Hundreds of researchers from dozens of institutions rely on CGHub for access to cancer genome data from ten world-class sequencing centers, including the Broad Institute, Washington University, and Baylor College of Medicine. The repository is expected to grow to 5 Petabytes in the next few years. Annai supports both research and clinical settings by providing a powerful and flexible environment for enabling users at all levels of IT skill to easily accomplish tasks of genomic data handling and analysis. AnnaiBCF AnnaireQuest Research Portal AnnaiGNOS Genome Network Operating System GNOS Web Services AnnaiGTFuse Federated Authentication GNOS Repository Public Genomic Data GeneTorrent Data Transfer Private Genomic Data FIGURE 1. The Annai-GNOS environment and related peripheral data management tools. The various components of the Annai platform can be deployed together as an integrated whole or independently. When deployed in full, the Annai-GNOS system boosts productivity, reduces timeto-insight, and ensures data security while facilitating collaboration. Researchers or clinicians can quickly search and extract specific segments from thousands of genomes, work independently or collaborate with a team to analyze the data, and prepare their findings for publication or use in the clinic to guide therapy. The Annai-GNOS platform is designed to accelerate genomic research. A closer examination of its components will provide insight into their collective synergy as a system with unique and comprehensive capabilities. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 3
Annai-GNOS A Platform for High Performance Genomic Analysis and Data Management Annai-GNOS is a unique integration of the data repository infrastructure and high-speed networking capabilities needed to accommodate large genomic data sets. These data sets are characterized by diverse file formats, extensive meta-data, large file sizes and individual sequence datasets ranging from 10 Gigabytes to more than 1 Terabyte in size (depending on the depth of coverage). Annai-GNOS allows the entire user community to see the state of data throughout the submission lifecycle, including data that has not yet been approved or submitted for download. Researchers can query the state of data as soon as it is submitted and quickly identify submissions that may require some attention due to formatting or other problematic issues, before they are available to users of the repository. Flexible meta-data searching greatly simplifies finding the right sequence file, and highly fault-tolerant design ensures services continue to be available. The GNOS network functionality integrates secure, high-speed network protocols to mobilize petabyte scale genomic data analysis. Annai-GNOS can also be integrated with federated authentication systems like InCommon and the National Cancer Institute s authorization systems. Technical Specifications GNOS features the following capabilities: User-programmable meta-data format validation engine Support for multiple meta-data formats including customer defined formats and the Sequence Read Archive (SRA) schemas used by NCBI, EBI and DDBJ Support for multiple sequence data file types Ability to store other file types, such as compressed sequences Accelerated file transfer using GeneTorrent and GTFuse Incommon (Shibboleth) based, federated user authentication. Project-based data authorization to control individual researcher access Support for commonly used file format standards and analysis tools, including NCBI SRA Meta-data format; TCGA v2 BAM and VCF File Formats and GATK, BowTie, TopHat, CuffLinks and additional tools. The GNOS platform streamlines all aspects of genomic data management and access for researchers and clinicians. Setting up a GNOS repository consists of two steps: 1) data ingestion (duration depends on the state of the data) and 2) data deployment as indexed, meta-data tags in the GNOS database. Sequence data are entered into the repository using Annai s proprietary GeneTorrent tool and metadata submission API. Researchers can use the request web portal to quickly and easily explore GNOS-enabled data. For example, a simple search of ovarian cancer in CGHub using request can instantly output the number of ovarian cancer genome files contained in the database and how many are RNA-Seq, exome, or whole genome. The interface also enables the user to further drill down quickly to the specific files of interest. The ability to quickly visualize the contents of a GNOS repository is based on searching meta-data attributes that are extracted from sequencing files, catalogued and indexed. Query parameters are unlimited, but typically include file type, disease, sample collection date, sequencing platform, date of sequencing, and mapping and alignment tools. GNOS is suitable for public and/or private genomic databases of translational and basic research centers, pharmaceutical R&D labs, diagnostic companies, and similar organizations generating significant volumes of sequence data. GNOS provides tools to help catalogue, index, upload and download files, and to make the data available for collaboration. GNOS can also be integrated with any data management and transfer method or protocol. Use Case 1 CGHub Cancer Genome Repository The University of California Santa Cruz (UCSC) provides CGHub, the world s largest repository of cancer genome data. CGHub is built on GNOS and, after rigorous testing with active TCGA users, was established as the new secure repository for the Cancer Genome Atlas (TGCA) on April 30, 2012. Use Case 2 Drug Development Pharmaceuticals companies have strict requirements for data protection and security. Corporate policies may mandate keeping data behind a firewall. In this case, an in-house GNOS repository is an optimal solution. After installation by Annai, this type of repository will be managed by the company s local experts within its existing highperformance computing infrastructure. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 4
GeneTorrent Accelerated Secure File Transport Whole genome sequence data files range from several hundred gigabytes to over one terabyte in size. GeneTorrent enables accelerated transfers of terabyte-scale data. It employs a proprietary variant of the popular BitTorrent algorithm to securely transfer files at speeds limited only by the base network bandwidth. Technical Specifications Use Case Translational and Clinical Research Translational researchers and clinicians use GeneTorrent to push sequence data, either locally or from an external sequencing lab, into a GNOS repository either installed in their facility or hosted by Annai in the BioCompute Farm. GNOS-enabled repositories can also be hosted on Amazon Web Services (AWS) or in similar cloud environments. GeneTorrent s key functionality is as follows: High-fidelity parallel file transfer at up to multi-gbits/sec (speeds as high as 200 Mbps are routinely achieved) Highly resilient to in-network and computing failures with automatic recovery Highly secure 256-bit encrypted file transfer request One-stop Portal for Data Access, Collaboration and Management One of the most difficult aspects of genomics research is finding specific data across multiple, growing and often separate, disparate data repositories. Individual files can also be very large and the metadata extensive and difficult to interpret. The request portal addresses these challenges by providing a single point of access to the contents of all accessible GNOS-enabled repositories. Researchers can employ request s data exploration capabilities to analyze the data trends across available repositories. The portal s Access and Download capabilities allow researchers to drill down to find and download specific data sets. The Explore, Access, Download, and Collaboration capabilities of request are available to the community through standard web browsers enabling users to query, retrieve, and monitor download progress without having to install or master complex proprietary tools or query syntax. Technical Specifications The following describes request s key functionality: Explore a graphical interface to interrogate and analyze the contents of any Annai-GNOS enabled data repository using data statistics and meta-data. This function enables searches based on organization, study, disease, and other key terms to explore the genomic data set. Access a powerful, yet user-friendly meta-data query building capability allowing the researcher to find and select a set of individual sequence files for download. The download of files can be initiated from the Access area once the desired files are designated. Annai request offers conditional access, as some data repositories, such as the TCGA data hosted on CGHub, require access authorization credentials in order to download sequence files. The status of current and past download requests can be reviewed from a single dashboard. Download users can view the status of each file within their download requests, and a complete history of downloads is maintained to support experiment reproducibility. Collaborate provides public and private collaboration sites to engage with colleagues and share knowledge around common projects and frequently accessed datasets to broaden and expand the community of academic and clinical researchers. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 5
System Management Data Explorer Annai request portal Data Access Portal Management Database Data Download Metadata Ingest The collaborative capabilities of request facilitate cross team communication and allow for better distribution of tasks. For example, a team member responsible for defining the experimental parameters could select the appropriate data and pass it to a bioinformatician who is performing the analysis. Operating System Communications Broker FIGURE 2. request Portal helping to expedite research through a wellmanaged, user-friendly portal environment. GTFuse Accelerated Data Queries GTFuse enables researchers to directly access remote sequence data files as if they were on the local file system. GTFuse allows researchers to mount the desired data and immediately run any existing tools such as SamTools to inspect the header and begin accessing specific regions of the sequence data (i.e. if you are interested in analyzing data from a particular chromosome, gene, or region). GNOS Genomic File GTFuse client HPC Analysis Clusters Technical Specifications The following describes GTFuse s key functionality: Mounts remote file on local file system Relevant data within file GTFuse client Local Analysis Tools Provides asynchronous access to files via GeneTorrent protocol No data transfer until file is accessed by the user on local file system FIGURE 3. GTFuse provides the option to search and download the specific genes or regions required instead of the entire file. It requires no tools integration and allows any analysis tool to access data files as if they were local. Researchers often want to quickly examine specific regions of genomic data in remote repositories without retrieving the entire BAM file or analysis object. Alternately, researchers may need to read entire files but do not have the storage capacity to maintain local copies of large numbers of BAM files. Other tasks are difficult due to the large size of sequence files. For example, a researcher may spend hours downloading BAM files to inspect their headers and determine if there is sufficient coverage depth for their analysis. For all of these scenarios, GTFuse provides a speedy and economical solution by substantially shortening the time researchers spend preparing to undertake the analysis that interests them and helping to conserve IT resources. Use Case 1 Asynchronous BAM file access A researcher wants to use SAMTools to view specific genome data coordinates. The researcher uses GTFuse to open a BAM file and its corresponding BAI file and perform seek operations to read small portions from the BAM file asynchronously. Use Case 2 Process remote file locally A researcher avoids using large amounts of local disk storage by mounting a remote BAM file using GTFuse before building a BAI index file locally. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 6
BioCompute Farm Enabling Simple, Streamlined Data Analysis The BioCompute Farm is a private cloud designed specifically for genomic data analysis. The BioCompute Farm allows collaborators to use an elastic pool of compute servers and run cross-organizational experiments without up front capital expense, IT development effort, ongoing maintenance, or significant lead-time. Local GNOS-enabled compute databases, a pre-installed set of analysis tools, a stored set of reference genomes, and specialized data access greatly simplify genomic data gathering and analysis. The BioCompute Farm s unique efficiencies reduce the resources and time needed to accomplish complex genomic data analysis. Researchers can instantly activate virtual machines in our highly secure BioCompute Farm and collaborate with colleagues across the globe. Data input and output is free on the BioCompute Farm. The BioCompute Farm s high-speed network transfer capability removes the need to ship hard disks containing potentially sensitive data between organizations with the attendant risks and delays. The BioCompute Farm s flexible storage allows researchers to import large volumes of data to be utilized for performing data analysis and to discard it afterwards. This allows researchers to avoid the difficulties and delays of expanding existing local IT infrastructure to cope with moving and processing large volumes of sequencing data. Customer Site Access Control Researcher Researcher Researcher CGHub Compute Console request Portal Transfer Control Sequence Data DataCenter Fabric San Diego Supercomputing Center Internet ANNAI BioCompute Farm FIGURE 4. The BioCompute Farm offers high performance computing, storage, and networking resources in a virtualized computing environment Genome Analysis Tools & GTFuse Technical specifications The BioCompute Farm has the following key functionalities: High-performance compute power including 10G networking, 100GB memory and highly scalable storage capacity, to deliver performance optimized for bioinformatics application needs. Users have complete control over their virtual instances. Additional instances, memory and storage capacity can be added as needed. Custom user tracking and reporting can be enabled. Instances include bioinformatics and data extraction tools for large-scale and complex genomic analysis. Users can add additional tools and save them for future reuse. Workflows can be set up to launch automatically. There are two primary uses of the BioCompute Farm. One use is serving clients who need to do analytical research with repositories such as CGHub, and do not need to store data at the compute center. Typically, they want to do analysis of primary sequence data in the BioCompute Farm and pull results datasets back to their local environments. By using GTFuse researchers can extract the genes or regions of interest, instead of bulk copying whole sequence files. This is one of the most significant advantages of GTFuse used in conjunction with the BioCompute Farm. In some particular cases where a handful of genes are studied across many genomes, TCGA researchers use up to one hundred times less compute and storage capacity by working only with the actively used TCGA data. Use Case 1 CGHub BioCompute Farm The CGHub BioCompute Farm is co-located with CGHub, home of genomic data from The Cancer Genome Atlas, within the San Diego Supercomputer Center. The BioCompute Farm has a 10Gb/sec connection to CGHub and the Internet. Annai s request web portal enables users to rapidly browse the genomic data sets via customized and automated searches, and to bring the desired data into the user applications running in the BioCompute Farm. 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 7
Use Case 2 Private BioCompute Farm A private BioCompute Farm can be co-located with an in-house GNOS-enabled data repository tailored to meet the particular requirements of a research organization. Annai provides installation, configuration and GeneTorrent training to researchers. Optionally, mapping, alignment, and variant calling tools can also be pre-installed in the BioCompute Farm. Having data analysis capacity co-located with in-house data can substantially reduce costs and speed up genomic data analysis. Conclusion Advancing translational research and genomic medicine requires distilling valuable, actionable information from hundreds or thousands of genomic sequence files and raises a unique set of big data challenges. Responding to these challenges, Annai Systems has developed the Annai-GNOS platform that drives robust repository operations to meet the real-world needs of users by providing metadata-based indexing, search query, and access to multiple distributed data sets, high-speed file transfer, rapid extraction of designated elements from multiple files, and a user-friendly alternative to command line interface. Annai Systems Inc. www.annaisystems.com Tel. 408 395-3621 475 Alberto Way, Suite 120 Los Gatos, California, 95032 2013 ANNAI SYSTEMS ALL RIGHTS RESERVED 8