Automated and Scalable Data Management System for Genome Sequencing Data



Similar documents
Integrated Rule-based Data Management System for Genome Sequencing Data

Data Management using irods

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

Policy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un

INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)

WOS for Research. ddn.com. DDN Whitepaper. Utilizing irods to manage collaborative research DataDirect Networks. All Rights Reserved.

Key Considerations for Managing Big Data in the Life Science Industry

The National Consortium for Data Science (NCDS)

irods Overview Intro to Data Grids and Policy-Driven Data Management!!Leesa Brieger, RENCI! Reagan Moore, DICE & RENCI!

Managing Next Generation Sequencing Data with irods

integrated Rule-Oriented Data System Reference

irods Overview Introduction to Data Grids, Policy-Driven Data Management, and Enterprise irods

Object storage in Cloud Computing and Embedded Processing

irods Policy-Driven Data Preservation Integrating Cloud Storage and Institutional Repositories

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Technical. Overview. ~ a ~ irods version 4.x

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Conceptualizing Policy-Driven Repository Interoperability (PoDRI) Using irods and Fedora

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

RELATED WORK DATANET FEDERATION CONSORTIUM, IRODS,

irods at CC-IN2P3: managing petabytes of data

Data grid storage for digital libraries and archives using irods

Long term retention and archiving the challenges and the solution

Chronopolis: A Partnership. The Chronopolis: Digital Preservation Archive Development and Demonstration Program

irods in complying with Public Research Policy

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing

DataGrids 2.0 irods - A Second Generation Data Cyberinfrastructure. Arcot (RAJA) Rajasekar DICE/SDSC/UCSD

irods Technologies at UNC

Save Time and Money with Quantum s Integrated Archiving Solution

Digital Preservation Lifecycle Management

Backup and Recovery 1

Collaborative SRB Data Federations

How To Create A Large Enterprise Cloud Storage System From A Large Server (Cisco Mds 9000) Family 2 (Cio) 2 (Mds) 2) (Cisa) 2-Year-Old (Cica) 2.5

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

irods for Big Data Management in Research Driven Organizations Charles Schmitt CTO & Director of Informatics RENCI

Redefining Oracle Database Management

Storage Solutions for Bioinformatics

Research Data Networks: Privacy- Preserving Sharing of Protected Health Informa>on

Importance of Statistics in creating high dimensional data

Practical Solutions for Big Data Analytics

CAREER TRACKS PHASE 1 UCSD Information Technology Family Function and Job Function Summary

OSG PUBLIC STORAGE. Tanya Levshina

EMC arhiviranje. Lilijana Pelko Primož Golob. Sarajevo, Copyright 2008 EMC Corporation. All rights reserved.

Oracle Recovery Manager

Data Management at UT

Research Data Understanding your choice for data placement

High Performance Compu2ng Facility

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Technical Note. Dell PowerVault Solutions for Microsoft SQL Server 2005 Always On Technologies. Abstract

HP LTO-5 Ultrium Tape Drive Portfolio Bridging the gap between current data protection infrastructure capabilities and today s business demands

The software platform for storing, preserving and sharing very large data sets.

Open Source Long Term Preservation Archives. Richard Matthews Sun Microsystems, Inc.

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Price List - Ultrium LTO Data Cartridges Fujifilm, HP, IBM, SONY, Quantum, Maxell, Dell

Data Management Resources at UNC: The Carolina Digital Repository and Dataverse Network

Microsoft SQL Server Always On Technologies

Accelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com DataDirect Networks. All Rights Reserved

Big data in cancer research : DNA sequencing and personalised medicine

Report of the DTL focus meeting on Life Science Data Repositories

StackIQ Enterprise Data Reference Architecture

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

IBM Infrastructure for Long Term Digital Archiving

Cloud Storage and Backup

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Long term data Archive Study on new Technologies. GMV, 2011 Property of GMV All rights reserved

Delivering the power of the world s most successful genomics platform

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

ETERNUS CS High End Unified Data Protection

Introduction to next-generation sequencing data

Lustre failover experience

Utilizing the SDSC Cloud Storage Service

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Eucalyptus-Based. GSAW 2010 Working Group Session 11D. Nehal Desai

Balancing Big Data for Security, Collaboration and Performance

How To Manage Your Digital Assets On A Computer Or Tablet Device

The NGS IT notes. George Magklaras PhD RHCE

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Fedora Distributed data management (SI1)

Backup and Disaster Recovery Planning On a Budget. Presented by: Najam Saeed Lisa Ulrich

Seagate Cloud Systems & Solutions

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

Personalized Medicine and IT

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

How To Build A Cloud Storage System

MiSeq: Imaging and Base Calling

XenData Product Brief: SX-550 Series Servers for LTO Archives

Abstract. 1. Introduction. irods White Paper 1

Product Brief: XenData X2500 LTO-6 Digital Video Archive System

TRANSFORMING DATA PROTECTION

How to Cost Effectively Retain Reference Data for Analytics and Big Data. Molly Rector, EVP Product Management & WW Marketing, Spectra Logic

Why Your Backup/Recovery Solution Probably Isn t Good Enough

The Mantid Project. The challenges of delivering flexible HPC for novice end users. Nicholas Draper SOS18

NGS Technologies for Genomics and Transcriptomics

Call: Disaster Recovery/Business Continuity (DR/BC) Services From VirtuousIT

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Data Storage Options for Research

Transcription:

Automated and Scalable Data Management System for Genome Sequencing Data Michael Mueller NIHR Imperial BRC Informatics Facility Faculty of Medicine Hammersmith Hospital Campus

Continuously falling costs have resulted in widespread adoption of next-generation sequencing (NGS) technologies Sequencing costs per genome NGS technology becomes commercially available

NIHR Imperial BRC Translational Genomics Unit Support for translational NGS projects AHSC MRC Genomics NIHR Imperial BRC Informatics Facility Library Preparation Sequencing Data Analysis

NIHR Imperial BRC Translational Genomics Unit Support for translational NGS projects AHSC MRC Genomics NIHR Imperial BRC Informatics Facility Library Preparation Sequencing Data Analysis NIHR Imperial BRC Translational Genomics Unit Integrated support for translational NGS projects Library Preparation Sequencing Data Analysis

Data generation, storage and processing Translational Genomics Unit Illumina HiSeq 2500 up to 8TB of raw data per run Illumina MiSeq up to 100GB of raw data per run Dell PowerEdge Server 40TB storage High Performance Computing Service Data Centre PC cluster (cx1) dedicated access to 320 cores 20TB + 80TB storage SGI Altix UV (ax3) shared access to 384 cores 28TB storage HP ProLiant Server 15TB storage

Data generation, storage and processing data generation data processing & analysis data dissemination & archiving DNA sequencer sequencing facility raw data 8TB temporary storage raw data 8TB sequencing (fastq) 500GB 1 2 3 cluster mapped (BAM) compressed (CRAM) high performance computing service data retention times 2-3 weeks 3 years >3 years (1) transfer of raw data from local storage server to HPC Service (2) extraction of sequencing read information from raw data (3) mapping of sequencing to reference genome sequence (4) reference based compression of mapped read data 200GB 100GB 4 5 encrypted 8 6 7 aligned encrypted tape data centre encrypted server public repository European Genome-Phenome Archive (5) encryption of compressed read data (6) local archiving of encrypted read data on tape (7) remote archiving at public repository (8) local dissemination of mapped read data

Data Management System Requirements Ensure data integrity and security Avoid unnecessary data replication Facilitate data access for analysis and sharing Allow for a high degree of automation Scalable Comply with College and funder data preservation requirements

integrated Rule-Oriented Data System...software middleware that manages a highly controlled collection of distributed digital objects, while enforcing user-defined Management Policies across the multiple storage locations

Integrated Rule-Oriented Data System Open source under a BSD license Developed by the Data Intensive Cyber Environments research group (University of North Carolina at Chapel Hill (UNC) and the University of California, San Diego (UCSD)) and collaborators Uniform interface to distributed and heterogeneous data storage resources Implements a logical namespace and maintains metadata catalogue (icat) on data-objects Features highly-configurable rule engine to manage the processing, sharing, replication, transfer, and preservation of distributed data collections various end-user client applications: command line (i-commands), web client, irods Explorer for Windows, Java and PHP client APIs Used for the management of genome sequencing data at the Wellcome Trust Sanger Institute

integrated Rule-Oriented Data System Peer-to-peer server architecture Rule Engine wiki.irods.org Management policies expressed as computer actionable rules Management procedures expressed as sets of remotely executable Micro-services Rules control execution of Micro-services State information generated by Micro-services is stored in metadata catalog (icat)

Dalia Kasperaviciute Bioinformatician Informatics Facility Alona Sosinsky Bioinformatician Informatics Facility Anna Zekavati Head Emmanuel Savage Technician Jorge Ferrer Theme Leader Genetics & Genomics NIHR Imperial BRC Simon Burbidge Manager Imperial HPC Service Steve Lawlor Manager ICT Computer Room