Why long time storage does not equate to archive

Similar documents
KIT Site Report. Andreas Petzold. STEINBUCH CENTRE FOR COMPUTING - SCC

Digital Repository Initiative

Ex Libris Rosetta: A Digital Preservation System Product Description

Digital Preservation. OAIS Reference Model

DIGITAL ARCHIVES & PRESERVATION SYSTEMS

ETERNUS CS High End Unified Data Protection

A federated data infrastructure: the Dutch way forward

In ediscovery and Litigation Support Repositories MPeterson, June 2009

Implementing Offline Digital Video Storage using XenData Software

Storage Design for High Capacity and Long Term Storage. DLF Spring Forum, Raleigh, NC May 6, Balancing Cost, Complexity, and Fault Tolerance

Implementing an Integrated Digital Asset Management System: FEDORA and OAIS in Context

UNINETT Sigma2 AS: architecture and functionality of the future national data infrastructure

Amazon Cloud Storage Options

European Data Infrastructure - EUDAT Data Services & Tools

LONG TERM RETENTION OF BIG DATA

AIIM & ASSUREON AN ASSUREON BRIEF

Solution Brief: Creating Avid Project Archives

EUDAT. Towards a pan-european Collaborative Data Infrastructure. Willem Elbers

Introduction to Optical Archiving Library Solution for Long-term Data Retention

irods at CC-IN2P3: managing petabytes of data

Media Cloud Building Practicalities

<Insert Picture Here> Cloud Archive Trends and Challenges PASIG Winter 2012

Policy Policy--driven Distributed driven Distributed Data Management (irods) Richard M arciano Marciano marciano@un

PASIG Four-Tier Computing: Linking On-Premise and Cloud Infrastructures

Multi-Terabyte Archives for Medical Imaging Applications

Diagram 1: Islands of storage across a digital broadcast workflow

Strategies and New Technology for Long Term Preservation of Big Data

Image Data, RDA and Practical Policies

KIT Site Report. Andreas Petzold. STEINBUCH CENTRE FOR COMPUTING - SCC

Preview of a Novel Architecture for Large Scale Storage

XenData Archive Series Software Technical Overview

irods in complying with Public Research Policy

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

Long Term Record Retention and XAM

XenData Product Brief: SX-550 Series Servers for LTO Archives

Tier 2 Nearline. As archives grow, Echo grows. Dynamically, cost-effectively and massively. What is nearline? Transfer to Tape

OCLC Digital Archive Preservation Policy and Supporting Documentation Last Revised: 8 August 2006

Creating a Cloud Backup Service. Deon George

Taming Big Data Storage with Crossroads Systems StrongBox

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

Alexandria Overview. Sept 4, 2015

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

巨 量 資 料 分 層 儲 存 解 決 方 案

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL 2012 IBM Corporation

Solving the long term archiving challenges with IBM System Storage Archive Manager Solutions

Feet On The Ground: A Practical Approach To The Cloud Nine Things To Consider When Assessing Cloud Storage

Copying Archives. Ngoni Munyaradzi (MNYNGO001)

Data storage services at CC-IN2P3

Overview of Storage and Data Management Industry Trends in Long Term Information Retention and Preservation

A 5 Year Total Cost of Ownership Study on the Economics of Cloud Storage

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

IBM Infrastructure for Long Term Digital Archiving

Columbia University Digital Library Architecture. Robert Cartolano, Director Library Information Technology Office October, 2009

DFSgc. Distributed File System for Multipurpose Grid Applications and Cloud Computing

Long-term archiving and preservation planning

ISLANDORA STAFF USER GUIDE. Version 1.3

Data Management using irods

AXF Archive exchange Format: Interchange & Interoperability for Operational Storage and Long-Term Preservation

Archiving On-Premise and in the Cloud. March 2015

ntier Verde Simply Affordable File Storage

XenData Video Edition. Product Brief:

Long term retention and archiving the challenges and the solution

The Archival Upheaval Petabyte Pandemonium Developing Your Game Plan Fred Moore President

EMC Symmetrix Data at Rest Encryption

SURFsara Data Services

Large File System Backup NERSC Global File System Experience

Reference Architectures for Repositories and Preservation Archiving

Cloud Computing and Digital Preservation: A Comparison of Two Services. Amanda L. Stowell. San Jose State University

The Australian War Memorial s Digital Asset Management System

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. dcache Introduction

Open Standards Driven Cloud Storage and Preservation

Object storage in Cloud Computing and Embedded Processing

Best Practices for Long-Term Retention & Preservation. Michael Peterson, Strategic Research Corp. Gary Zasman, Network Appliance

Transcription:

Why long time storage does not equate to archive Jos van Wezel HUF Toronto 2015 STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu

(probably) not an archive 2 HUF 2015 - Toronto CN

open data 4 HUF 2015 - Toronto CN

Archive tasks at KIT State of Baden-Württemberg universities, museums, state archives, libraries currently 9 PB / 1 billion files est. growth 0.5 PB / year HLRS Stuttgart finalised projects from e.g. climatology, CFD, molecular dynamics, engineering currently 1.5 PB / 7.5 million files est. growth 0.5 PB / year GridKa (German LHC Tier 1 center) support of all 4 LHC experiments currently 16 PB / 13 million files est. growth 1 PB / year Roadmap EU projects: Human Brain Project, EUDAT, DARIAH Helmholtz Data Center BW-HPC-C5 KIT has been designated as central site for long time digital data storage in the State of Baden- Württemberg Move archived data out of TSM to HPSS. Except for GridKa numbers are to be doubled for archive storage 5 HUF 2015 - Toronto CN

Long time storage business cases A. Project Store Temporary storage for large data, 3 to 4 years Non active data that active experiments depend on Simple upload interface B. Good scientific practice store All of A above plus At least 10 years, conformity with good scientific practice Minimal set of meta data, requires PIDs Expiration, Integritity checks C. Long term store, archive All of B above plus Forever or as long as is needed Interaction with community for curation Premium: Conformity with OAIS, certification of repository (DSA, ISO16363) 6 HUF 2015 - Toronto CN

Data centres Libraries, Museums Communities Portals Collections Archive layer model Communities take care of selection, ordering, ingesting of data Community or ideally archivists representing the community, takes care of curation, deletions, pruning, formats, Libraries and museums are confronted with large data sets and subsequent handling Content Preservation data exploration (web) interface content preservation referencing The data centers, take care of bit preservation Bit preservation is not an done deal Access to preservation information is not standardised The bits that go in are the bits that come out data center interface technical metadata bit preservation disk/tape storage 7 HUF 2015 - Toronto CN

Classic view Data sources publications, images, measurements, recordings Archive Management Content Management Systems, DSpace, Fedora, ArchiveMatica, Invenio, eprints, etc (Archive) Storage HPSS LTFS Black Pearl other 8 HUF 2015 - Toronto CN

Application access to data in deep archives Data interfaces assume disk interface i.e. on-line storage Archivematica, Fedora, Invenio, Fixity etc. DB access Interfaces still keep with POSIX No consideration with regard to off-line, or replicated, storage Assume different storage is handled behind the scenes No one waits for information to become on-line Unless you know its will come on line How to handle asynchronous data access Requirements from the provider, i.e. data center, perspective How to do efficient offline check summing (with user provided method) Applications should not DOS the (tape) storage systems An upstream reference how to get from content to the owner of the data Data center exchange changing resource providers, disaster recovery 9 HUF 2015 - Toronto CN

Rough edges and requirements storing [m b]illions of files data exchange roles and rights, accounting wide range of file sizes assured reliability storage quality 10 HUF 2015 - Toronto CN

B2SAFE irods Archive service path CDMI Data path EUDAT development Archive service (REST) Archive Logic VIM meta data (PREMIS) checksums storage quality IDM Auth *FTP S3 standard protocols for data transport storage backends Meta DB disk storage LTFS 11 HUF 2015 - Toronto CN

archive access enhance fixity support user provided checksum lazy checking and periodic integrity verification we need to be more intelligent if we want to scale in the future poc implementation for HPSS and TSM with PSNC (Poland) and MPCDF (was RZG) integrate in current EUDAT B2SAFE service enable bring-on-line functionality on-line and off-line data represent two different storage qualities task in INDIGO data cloud project what else? embargo deletion and clean-up: to get rid of data 12 HUF 2015 - Toronto CN

Conclusions A data archive is more than long term data storage HPSS is used in HPC and increasingly as cold storage engine for many other data sources i.e. communities With some user space effort ARCHIVE requirements can be implemented on current HPSS Work on HPSS from bottom and top Buddy Bland: We must make it easy for sites to match the types of storage to the requirements of the users. We must make it easy for users to match the types of data to the requirements of the sites. 13 HUF 2015 - Toronto CN

Thank you for your attention Contact: jos.vanwezel@kit.edu 14 HUF 2015 - Toronto CN

Spares 15 HUF 2015 - Toronto CN

Challenges Determining the rate of bit rot and taking countermeasures we have no metric for archive (bit preservation) quality this some how should be part of a site certification Intelligent selection of candidates for caching getting data from a deep (i.e. cost effective) archive takes time on line caches in front of archives for fast access what goes to and what data stays in the cache Notions: trends, markets, relationships: maybe google can help? Uniform interface for cold storage API that goes beyond classic POSIX and takes longer access times Into account Exchange data sets or data volumes with peers for redundancy or for contract changes: asynchronous operations 16 HUF 2015 - Toronto CN

copyright: Jan Potthoff / SCC Archive interface work in progress of the RADAR and bwarchive projects revisit the 2008 SNIA XAM standard Asynchronous access to data on off-line storage xam comes with an emulator / reference implementation xam modular adaptation to different back-ends through Vendor Interface Modules (VIM) 17 HUF 2015 - Toronto CN

What is an Archive? SCC view Definition of an Archive: A long time file repository containing a massive collection of digital information No loss of data or degradation of quality for an agreed amount of time SCC ensures proper integrity Does not depend on proprietary middleware or storage formats migration should be possible in some number of years and must be independent from content or owner Data is stored in (self-defining) containers containers are not files, they are bitfiles with a checksum, they may contain one or more files. Reference to the container is fixed for the lifetime of the container The operator (SCC) is not responsible for passing content on to the next generation of owners data ownership has a lifetime that must be known in advance, but can be renewed data is stored after consent of a data provider and SCC. It is effectuated in a service level agreement for archives. the above intentionally excludes searching, finding and global referencing 18 HUF 2015 - Toronto CN