Managing Next Generation Sequencing Data with irods



Similar documents
The National Consortium for Data Science (NCDS)

INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)

Technology solutions for managing and computing on largescale biomedical data

Data Management using irods

Automated and Scalable Data Management System for Genome Sequencing Data

irods for Big Data Management in Research Driven Organizations Charles Schmitt CTO & Director of Informatics RENCI

Integrated Rule-based Data Management System for Genome Sequencing Data

Technical. Overview. ~ a ~ irods version 4.x

Object storage in Cloud Computing and Embedded Processing

Personalized Medicine and IT

Balancing Big Data for Security, Collaboration and Performance

Data management challenges in todays Healthcare and Life Sciences ecosystems

Policy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un

Seagate HPC /Big Data Business Tech Talk. December 2014

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

irods Policy-Driven Data Preservation Integrating Cloud Storage and Institutional Repositories

Hadoop & its Usage at Facebook

Introduc)on of Pla/orm ISF. Weina Ma

irods Overview Intro to Data Grids and Policy-Driven Data Management!!Leesa Brieger, RENCI! Reagan Moore, DICE & RENCI!

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Practical Solutions for Big Data Analytics

Hadoop & its Usage at Facebook

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

WHITEPAPER. Why Dependency Mapping is Critical for the Modern Data Center

TRANSFORMING DATA PROTECTION

EMC IRODS RESOURCE DRIVERS

Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

The software platform for storing, preserving and sharing very large data sets.

Globus Genomics Tutorial GlobusWorld 2014

A Survey of Shared File Systems

DDN updates object storage platform as it aims to break out of HPC niche

Data Pipeline with Kafka

irods in complying with Public Research Policy

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

W H I T E P A P E R E X E C U T I V E S U M M AR Y S I T U AT I O N O V E R V I E W. Sponsored by: EMC Corporation. Laura DuBois May 2010

U"lizing the SDSC Cloud Storage Service

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

NERSC Archival Storage: Best Practices

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Distributed File Systems An Overview. Nürnberg, Dr. Christian Boehme, GWDG

How To Manage Research Data At Columbia

LabArchives Electronic Lab Notebook:

Let s Talk Digital An Approach to Managing, Storing, and Preserving Time-Based Media Art Works. DAMS Branch Manager Smithsonian Institution, OCIO

irods Overview Introduction to Data Grids, Policy-Driven Data Management, and Enterprise irods

Managing a local Galaxy Instance. Anushka Brownley / Adam Kraut BioTeam Inc.

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

irods at CC-IN2P3: managing petabytes of data

1 P a g e Delivering Self -Service Cloud application service using Oracle Enterprise Manager 12c

Building and Managing

Seagate Cloud Systems & Solutions

IBM Campaign Version-independent Integration with IBM Engage Version 1 Release 3 April 8, Integration Guide IBM

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

Report of the DTL focus meeting on Life Science Data Repositories

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

SIF 3: A NEW BEGINNING

INTRODUCTION APPLICATION DEPLOYMENT WITH ORACLE VIRTUAL ASSEMBLY

Data Management & Storage for NGS

Object Storage, Cloud Storage, and High Capacity File Systems

Clodoaldo Barrera Chief Technical Strategist IBM System Storage. Making a successful transition to Software Defined Storage

<Insert Picture Here> Refreshing Your Data Protection Environment with Next-Generation Architectures

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hadoop FileSystem and its Usage in Facebook

Hitachi Content Platform as a Continuous Integration Build Artifact Storage System

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments

Best Practices Series: Cure Your Management Headaches in IBM Lotus Notes/Domino Environments

High Performance Computing OpenStack Options. September 22, 2015

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Privileged Administra0on Best Prac0ces :: September 1, 2015

Addressing Storage Management Challenges using Open Source SDS Controller

An Oracle White Paper July Oracle ACFS

Transcription:

Managing Next Generation Sequencing Data with irods Presented by Dan Bedard // danb@renci.org at the 9 th International Conference on Genomics Shenzhen, China September 12, 2014 Managing NGS Data with irods // ICG-9 1

Background: Problem Statement Next Generation Sequencing (NGS) results in lots of data Several GB/genome, for each processing step. cp feels safer than mv We don t know what we don t know. Managing NGS Data with irods // ICG-9 2

Background: Problem Statement Next Generation Sequencing (NGS) results in lots of data Several GB/genome, for each processing step. cp feels safer than mv We don t know what we don t know. We need to store everything. We need to know where it came from. We need to be able to find it. Well we doesn t include everybody. We need to secure it but we still need to share it. Managing NGS Data with irods // ICG-9 3

Background: Problem Statement Next Generation Sequencing (NGS) results in lots of data Several GB/genome, for each processing step. cp feels safer than mv We don t know what we don t know. We need to store everything. We need to know where it came from. We need to be able to find it. Well we doesn t include everybody. We need to secure it but we still need to share it. And time is of the essence! Managing NGS Data with irods // ICG-9 4

Background How do the world s preeminent bioinformatics centers manage their data? Beijing Genomics Institute (BGI) The Wellcome Trust Sanger Institute (WTSI) The Broad Institute The International Neuroinformatics Coordinating Facility (INCF) The iplant Collaborative UNC Lineberger Comprehensive Cancer Center Uppsala Genome Center Public Health England Life Science Industrial Users Managing NGS Data with irods // ICG-9 5

Agenda What is irods? How are People Using It? Reference Implementation for NGS Managing NGS Data with irods // ICG-9 6

What is irods? irods is the underlying technology for the world s preeminent genomic research ins:tutes. irods is an infinitely configurable data janitor. irods is the kind of technology you need irods to is open host source everyone s data grid unstructured middleware for data. irods is a Data Discovery powerful data migra:on tool. irods is the technology that Workflow Automation underpins the iplant Data Store. irods is a data preserva:on Secure Collaboration technology. Data Virtualization irods is a fundamental technology for CineGrid. irods is a tool for providing fine- grained privacy and security controls. irods is extensible: irods has command- line clients, APIs for numerous programming languages, and web clients. irods supports new plug- ins for storage resources, authen:ca:on mechanisms, microservices, and network prot Managing NGS Data with irods // ICG-9 7

What is irods? irods is free the to use underlying technology for the world s free to modify sits between preeminent genomic research ins:tutes. irods is an infinitely free to contribute the files and the user configurable data janitor. irods is the kind of technology you need irods to is open host source everyone s data grid unstructured middleware for data. irods is a Data Discovery powerful data migra:on ß tool. metadata irods is the technology that Workflow Automation ß policies: any condition; any action underpins the iplant Data Store. irods is a data preserva:on Secure Collaboration ß sharing without losing control technology. Data Virtualization irods is a fundamental ß file system flexibility technology for CineGrid. irods is a tool for providing fine- grained privacy and security controls. irods is extensible: irods has command- line clients, APIs for numerous programming languages, and web clients. irods supports new plug- ins for storage resources, authen:ca:on mechanisms, microservices, and network prot ß ß Managing NGS Data with irods // ICG-9 9

irods: Ready for Enterprise Product of nearly 20 years of research and development, funded by DARPA, DOE, NASA, NSF, NARA, and NOAA. Starting with irods 4.0, the entire codebase has been reviewed and restructured for enterprise use. Each change is verified with a test case in a continuous integration suite Pre-compiled binary packages are available for several Linux distributions and multiple database management systems. Managing NGS Data with irods // ICG-9 10

irods: The irods Consortium Founded to ensure that irods continues to be free open source software. Four levels of membership, with increasing levels of involvement Participation in technical planning and governance Contact, co-marketing, sales support Discretionary staff hours Stakeholders who recognize the value of sustaining irods development. Currently: RENCI The DICE Center DataDirect Networks Seagate The Wellcome Trust Sanger Institute EMC Corporation Additionally, the irods Consortium provides professional integration services, training, and support on a contract basis to irods users. Learn more at irods.org/consortium or contact us at info@irods.org Managing NGS Data with irods // ICG-9 11

Use Case: Wellcome Trust Sanger Institute Large Scale Genomics Research Sequenced 1/3 of the human genome (largest single contributor) Active cancer, malaria, pathogen, and genomic variation studies All data publicly available through websites, FTP, direct database access, programmatic APIs 2 PB of data managed by irods Managing NGS Data with irods // ICG-9 12

Use Case: Wellcome Trust Sanger Institute Using irods for... Data Discovery Metadata for tracking origin and processing history Example attribute fields Users query and access data largely from local compute clusters Users access irods locally via the CLI attribute: library attribute: total_reads attribute: type attribute: lane attribute: is_paired_read attribute: study_accession_number attribute: library_id attribute: sample_accession_number attribute: sample_public_name attribute: manual_qc attribute: tag attribute: sample_common_name attribute: md5 attribute: tag_index attribute: study_title attribute: study_id attribute: reference attribute: sample attribute: target attribute: sample_id attribute: id_run attribute: study attribute: alignment Copyright Wellcome Trust Sanger Ins:tute. Used with permission. Managing NGS Data with irods // ICG-9 13

Use Case: Wellcome Trust Sanger Institute Using irods for... Data Virtualiza1on with Workflow Automa1on Seamless data replica:on, automa:c checksumming, policy- based data resource selec:on Oracle RAC Cluster IRODS server IRES servers Data lands by preference on green room storage, when available. Replicated, with checksums, to red room storage. Read access served by both rooms. Integrity verified in flight. SAN attached luns from various vendors Copyright Wellcome Trust Sanger Ins:tute. Used with permission. Managing NGS Data with irods // ICG-9 14

Use Case: Wellcome Trust Sanger Institute Using irods for... Secure Collabora1on Selec:vely sharing data between workgroups; isola:on for maintenance opera:ons; op:ons for defining policy on a per- group basis Sanger 1 Portal zone (provides Kerberised access) Federation using head zone accounts /seq /uk10k /humgen /Archive Copyright Wellcome Trust Sanger Ins:tute. Used with permission. Managing NGS Data with irods // ICG-9 15

Use Case: The Broad Institute Harvard-MIT collaboration focused on cross-disciplinary challenges in biology and medicine. Small pilot program using irods to archive 9TB of data. Managing NGS Data with irods // ICG-9 16

Use Case: The Broad Institute Using irods for... Data Discovery and Workflow Automa1on Metadata automa:cally generated from original file system, used to enforce policy and verify integrity. Policy 1 Validate, checksum, replicate, compress Policy 2 Users cannot delete files Policy 3 Purge files by expira:on date Copyright Distributed Bio LLC. Used with permission. Managing NGS Data with irods // ICG-9 17

Use Case: UNC Lineberger Comprehensive Cancer Center One of the leading cancer centers in the US Highly collaborative research between UNC departments Managing NGS Data with irods // ICG-9 18

Use Case: UNC Lineberger Cancer Research Center Secure Medical Workspace Sequencing center Primary Processing, Storage, and Archive i Isilon UNC Kure HPC Using irods for... Data Virtualiza1on with Workflow Automa1on Automa:cally staging data for HPC and interpreta:on; using hardware from mul:ple vendors; complex access control Dell Genomic Sciences TopSail Hadoop i irods Grid Hadoop compu5ng, SAN scratch storage i RENCI Croatan Big Data DDN, Quantum Addi5onal Processing, Secondary Archive RENCI BlueRidge HPC Managing NGS Data with irods // ICG-9 19

Use Cases: The Upshot irods is finding a permanent home at NGS sites because of: Metadata! Vendor neutrality Not subject to storage vendor lock-in Mitigates risk of vendor termination Open source Mitigate risk of developer termination Flexibility Policy enforcement: any trigger, any action Storage virtualization: layers-deep replication; localß à cloud User permissions Sharing between workgroups Managing NGS Data with irods // ICG-9 20

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control Standard Analy:cal Processing Querying Interpreta:on Consulta:on Addi:onal Ac:on (ex. Treatment) Archive/Replica:on Managing NGS Data with irods // ICG-9 21

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control Standard Analy:cal Processing Querying Interpreta:on Consulta:on Addi:onal Ac:on (ex. Treatment) Archive/Replica:on irods will apply sample IDs and results (or links to results) of processing and interpreta:on Managing NGS Data with irods // ICG-9 22

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control Standard Analy:cal Processing Querying Interpreta:on Consulta:on Addi:onal Ac:on (ex. Treatment) Archive/Replica:on irods will apply sample IDs and results (or links to results) of automated processing irods will kick off each process in the pipeline, or launch a workflow engine for more complex tasks. Managing NGS Data with irods // ICG-9 23

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control irods will apply sample IDs and results (or links to results) of automated processing irods will kick off each process in the pipeline, or launch a workflow engine for more complex tasks. Standard Analy:cal Processing Querying irods will automa:cally compile reports upon schedule or request Interpreta:on Consulta:on Addi:onal Ac:on (ex. Treatment) Archive/Replica:on Managing NGS Data with irods // ICG-9 24

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control irods will apply sample IDs and results (or links to results) of automated processing irods will kick off each process in the pipeline, or launch a workflow engine for more complex tasks. Standard Analy:cal Processing Querying Interpreta:on Consulta:on irods will automa:cally compile reports upon schedule or request irods will stage files for processing, evalua:on on a secure workspace, and archiving Addi:onal Ac:on (ex. Treatment) Archive/Replica:on Managing NGS Data with irods // ICG-9 25

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control irods will apply sample IDs and results (or links to results) of automated processing irods will kick off each process in the pipeline, or launch a workflow engine for more complex tasks. Standard Analy:cal Processing Querying Interpreta:on Consulta:on Addi:onal Ac:on (ex. Treatment) irods will automa:cally compile reports upon schedule or request irods will stage files for processing, evalua:on on a secure workspace, and archiving irods will search on metadata Archive/Replica:on Managing NGS Data with irods // ICG-9 26

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control irods will apply sample IDs and results (or links to results) of automated processing irods will kick off each process in the pipeline, or launch a workflow engine for more complex tasks. Standard Analy:cal Processing Querying Interpreta:on Consulta:on Addi:onal Ac:on (ex. Treatment) Archive/Replica:on irods will automa:cally compile reports upon schedule or request irods will stage files for processing, evalua:on on a secure workspace, and archiving irods will search on metadata irods will manage complex, dynamic user permissions across mul:ple workgroups Managing NGS Data with irods // ICG-9 27

What s Next? Create the reference genomics implementation Document it in a cookbook, so other NGS centers can adapt and implement systems Examples: Replication, Policy-based storage selection, User interface API, Access control policies, Archiving policies Managing NGS Data with irods // ICG-9 28

What s Next? NGS Reference Implementation Ini:aliza:on Sequencing Forma`ng and Cleaning Quality Control irods will apply sample IDs and results (or links to results) of automated processing irods will kick off each process in the pipeline, or launch a workflow engine for more complex tasks. Standard Analy:cal Processing Querying Interpreta:on Consulta:on Addi:onal Ac:on (ex. Treatment) Archive/Replica:on irods will automa:cally compile reports upon schedule or request irods will stage files for processing, evalua:on on a secure workspace, and archiving irods will search on metadata irods will manage complex, dynamic user permissions across mul:ple workgroups Managing NGS Data with irods // ICG-9 29

What s Next? We need partners LIMS integration System integrators Data processing with native I/O Storage/computing appliance vendors Hosts for training Users to shape the problem space and evaluate our solution Get involved Contact info@irods.org Follow us on Github: https://github.com/irods/irods Read our blog: http://irods.org/controlyourdata/ Join the conversation on irods Chat: https://groups.google.com/forum/#!forum/irod-chat Managing NGS Data with irods // ICG-9 30

Acknowledgments Thank you to: DARPA, DOE, NASA, NSF, NARA, and NOAA for supporting the creation of irods irods Consortium Members, for their continued support John Constable, WTSI Chris Smith, Distributed Bio Sai Balu, Lineberger Cancer Research Center Genomics researchers, for coming up with interesting problems Managing NGS Data with irods // ICG-9 31

Attribution and License Slide 26: The Joy of Cookbooks, Tom Taker, http://tinyurl.com/m273mtl, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Except where specified otherwise, this work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Managing NGS Data with irods // ICG-9 32

Thank you! Dan Bedard irods Market Development Manager danb@renci.org Managing NGS Data with irods // ICG-9 33