irods in complying with Public Research Policy



Similar documents
Object storage in Cloud Computing and Embedded Processing

WOS OBJECT STORAGE PRODUCT BROCHURE DDN.COM Full Spectrum Object Storage

HadoopTM Analytics DDN

Clarifications of EPSRC expectations on research data management.

WOS. High Performance Object Storage

Automated and Scalable Data Management System for Genome Sequencing Data

Long term retention and archiving the challenges and the solution

Canadian Astronomy Data Centre. Séverin Gaudet David Schade Canadian Astronomy Data Centre

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Solution Brief: Archiving Avid Interplay Projects using NLT and XenData

Big Data and the Earth Observation and Climate Modelling Communities: JASMIN and CEMS

DDN updates object storage platform as it aims to break out of HPC niche

EPSRC Research Data Management Compliance Report

UW-IT Backups & Archives

Technical. Overview. ~ a ~ irods version 4.x

ANY SURVEILLANCE, ANYWHERE, ANYTIME

Object Oriented Storage and the End of File-Level Restores

Archiving On-Premise and in the Cloud. March 2015

Storage Design for High Capacity and Long Term Storage. DLF Spring Forum, Raleigh, NC May 6, Balancing Cost, Complexity, and Fault Tolerance

Managed GRID or why NFSv4.1 is not enough. Tigran Mkrtchyan for dcache Team

Improving Time to Results for Seismic Processing with Paradigm and DDN. ddn.com. DDN Whitepaper. James Coomer and Laurent Thiers

EOFS Workshop Paris Sept, Lustre at exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Policy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un

Why long time storage does not equate to archive

With DDN Big Data Storage

Data Management using irods

AUTOMATED DATA RETENTION WITH EMC ISILON SMARTLOCK

Integrated Rule-based Data Management System for Genome Sequencing Data

Solution Brief: XenData Digital Video Archives in a Dalet Environment

EMC IRODS RESOURCE DRIVERS

ETERNUS CS High End Unified Data Protection

Globus and the Centralized Research Data Infrastructure at CU Boulder

CMIP6 Data Management at DKRZ

SURFsara Data Services

Research Data Storage and the University of Bristol

Data Management Planning

EMC NETWORKER AND DATADOMAIN

The Design and Implementation of the Zetta Storage Service. October 27, 2009

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Archive Storage Infrastructure At the Library of Congress

A Survey of Shared File Systems

Solution Brief: Creating Avid Project Archives

GDCMTM. Global DataCenter Management. nlytetm. nlyte the next generation datacenter management system

Call: Disaster Recovery/Business Continuity (DR/BC) Services From VirtuousIT

EMC BACKUP AND RECOVERY SOLUTIONS

Building a Scalable Big Data Infrastructure for Dynamic Workflows

DDN in Seismic Workflows

EMC arhiviranje. Lilijana Pelko Primož Golob. Sarajevo, Copyright 2008 EMC Corporation. All rights reserved.

A Physics Approach to Big Data. Adam Kocoloski, PhD CTO Cloudant

irods at CC-IN2P3: managing petabytes of data

WOS 360 FULL SPECTRUM OBJECT STORAGE

The BIG Data Era has. your storage! Bratislava, Slovakia, 21st March 2013

Data storage services at CC-IN2P3

ClearPath Storage Update Data Domain on ClearPath MCP

Big Data Analytics Service Definition G-Cloud 7

Scale and Availability Considerations for Cluster File Systems. David Noy, Symantec Corporation

Growth of Unstructured Data & Object Storage. Marcel Laforce Sr. Director, Object Storage

Lecture 2 CS An example of a middleware service: DNS Domain Name System

Open Source Sales Force Automation (SFA) in the Cloud SaaS

Managing Microsoft Office SharePoint Server Content with Hitachi Data Discovery for Microsoft SharePoint and the Hitachi NAS Platform

Oracle Data Protection Concepts

INTEGRATED RULE ORIENTED DATA SYSTEM (IRODS)

Our School Backup A trusted, safe and secure remote backup solution for the UK education sector.

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

How To Manage An Electronic Discovery Project

Introduction to NetApp Infinite Volume

Oracle Content Management and Archiving

Preview of a Novel Architecture for Large Scale Storage

The Hartree Centre helps businesses unlock the potential of HPC

OpenAIRE Research Data Management Briefing paper

Scaling Out With Apache Spark. DTL Meeting Slides based on

EMC BACKUP MEETS BIG DATA

Transcription:

irods User Group 2015 irods in complying with Public Research Policy Vic Cornell Senior Storage Consultant

Overview Compliance overview UK examples Imperial College MedBio Requirements Architecture irods integration irods capabilities Proposed Workflow Challenges and Unknowns 2

From: EPSRC Data Management Policy Research organisations will ensure that appropriately structured metadata describing the research data they hold is published (normally within 12 months of the data being generated) and made freely accessible on the internet. Where the research data referred to in the metadata is a digital object it is expected that the metadata will include use of a robust digital object identifier (For example as available through the DataCite organisation). Research organisations will ensure that EPSRC-funded research data is securely preserved for a minimum of 10 years from the date that any researcher privileged access period expires or, if others have accessed the data, from last date on which access to the data was requested by a third party 3

MRC (Medical Research Council UK) From MRC Research data policy: All research must have a Data Management Plan Speaks of: o Managing, storing and curating data. o Metadata standards and data documentation o Data preservation strategy and standards These must all be specified and adhered to. 4

Imperial College MedBio The Imperial College Bioinformatics Support is a part of the Imperial College Centre for Integrative Systems Biology and Bioinformatics. The mission of the Imperial College Centre for Bioinformatics is to promote and co-ordinate worldclass research and training in Bioinformatics within Imperial College and to provide state-of-the-art Bioinformatics support to members of Imperial College for their research. 5

Imperial College MedBio Conduct a large number of studies with respect to Systems Biology and Bioinformatics Range of data sources from Internal o Next Generation Sequencers o Very high resolution microscopes. Big Data o Phenome study systems produce 7GB data every 15 minutes. o They have 10 of them and they run for 2 weeks/month. o Maybe ½ PB Year? External Datasets Staff! Current staff are overloaded with IT tasks and don t have time to embrace new methods. Too many workflows to follow. More staff being recruited but its hard to find people with the right skills 6

Storage Solution South Kensington DDN SFA 12K20 180 4TB Disks 7 x WOS 7000 420 4TB Disks 1 x Tape Library DDN WOS Bridge GridScaler (GPFS) GridNAS TSM with Space Manager for GPFS Infinity Slough DDN SFA SFA 7700 180 4TB Disks 7 x WOS 7000 420 4TB Disks 1 x Tape Library DDN WOS Bridge GridScaler (GPFS) GridNAS TSM with Space Manager for GPFS 7

MedBio Architecture WOS Replication for Tier2 WOS Core Object Store Tier2 Tier2 WOS Core Object Store Tier 1 Pool NAS NFS/CIFS Tier1 NFS/CIFS NAS Tier 1 Pool WOS BRIDGE Tier 2 Cache GPFS Filesystems GPFS Filesystems Tier 2 Cache WOS BRIDGE Tier 3 Cache Client Nodes Tier 3 Cache TSM TSM TSM database Block Storage GPFS Block Storage TSM database Tape Tier3 NAS Tier3 Tape Library Library TSM 8 2012 DataDirect Networks. All Rights All Rights Reserved. Reserved.

MedBio POC irods Architecture WOS Replication for Tier2 WOS Core Object Store Tier2 Tier2 WOS Core Object Store Tier 1 Pool NAS NFS/CIFS Tier1 NFS/CIFS NAS Tier 1 Pool WOS BRIDGE Tier 2 Cache GPFS Filesystems GPFS Filesystems Tier 2 Cache WOS BRIDGE Tier 3 Cache Client Nodes Imperial MedBIO Architecture Tier 3 Cache TSM TSM TSM database TSM database Tier3 Block Storage irods GPFS irods Block Storage Tier3 Tape irods Database NAS irods Database Tape Library IRODS Pool irods IRODS Pool Library TSM 9 2012 DataDirect Networks. All Rights All Rights Reserved. Reserved.

10 Example Imperial MedBIO Workflow Load data from sequencer into Tier2 Tier 2 Associate metadata from Sequencer Make data immutable. Add in data from LIMS (Laboratory Management system) Register data with irods Publish Data via AIMS (Academic Information Management System) irods

irods for Compliance? Good: Rules based engine which associates metadata with data. Allows time based policies for data retention. Can execute rules based on complex metadata queries to catch all required cases. Policy enforcement points Can be used to implement data management and data retention policy. Opportunities for data harvesting. Once established can become a matter of record and boilerplate for subsequent projects Still need to have resilient storage underneath. 11

irods for Compliance? Questions: Can irods make data *really* immutable? At what level is this best done? End-to-end metadata harvesting can it manage an integration with LIMS. o Chain of custody o Chain of provenance. Publishing: Integration with AIMS? o Which one? Scale this is a 7PB facility with file counts to scale can Imperial do this in one zone? If not how will they manage federation so it works. Database stability, availability and recoverability. o Who sets the standards? 12

Q&A

THANK YOU

Links http://www.mrc.ac.uk/research/research-policy-ethics/data-sharing/datamanagement-plans/ https://documentation.tgac.ac.uk/display/copo/copo+documents http://www3.imperial.ac.uk/bioinfsupport 15