Why long time storage does not equate to archive Jos van Wezel HUF Toronto 2015 STEINBUCH CENTRE FOR COMPUTING - SCC KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu
(probably) not an archive 2 HUF 2015 - Toronto CN
open data 4 HUF 2015 - Toronto CN
Archive tasks at KIT State of Baden-Württemberg universities, museums, state archives, libraries currently 9 PB / 1 billion files est. growth 0.5 PB / year HLRS Stuttgart finalised projects from e.g. climatology, CFD, molecular dynamics, engineering currently 1.5 PB / 7.5 million files est. growth 0.5 PB / year GridKa (German LHC Tier 1 center) support of all 4 LHC experiments currently 16 PB / 13 million files est. growth 1 PB / year Roadmap EU projects: Human Brain Project, EUDAT, DARIAH Helmholtz Data Center BW-HPC-C5 KIT has been designated as central site for long time digital data storage in the State of Baden- Württemberg Move archived data out of TSM to HPSS. Except for GridKa numbers are to be doubled for archive storage 5 HUF 2015 - Toronto CN
Long time storage business cases A. Project Store Temporary storage for large data, 3 to 4 years Non active data that active experiments depend on Simple upload interface B. Good scientific practice store All of A above plus At least 10 years, conformity with good scientific practice Minimal set of meta data, requires PIDs Expiration, Integritity checks C. Long term store, archive All of B above plus Forever or as long as is needed Interaction with community for curation Premium: Conformity with OAIS, certification of repository (DSA, ISO16363) 6 HUF 2015 - Toronto CN
Data centres Libraries, Museums Communities Portals Collections Archive layer model Communities take care of selection, ordering, ingesting of data Community or ideally archivists representing the community, takes care of curation, deletions, pruning, formats, Libraries and museums are confronted with large data sets and subsequent handling Content Preservation data exploration (web) interface content preservation referencing The data centers, take care of bit preservation Bit preservation is not an done deal Access to preservation information is not standardised The bits that go in are the bits that come out data center interface technical metadata bit preservation disk/tape storage 7 HUF 2015 - Toronto CN
Classic view Data sources publications, images, measurements, recordings Archive Management Content Management Systems, DSpace, Fedora, ArchiveMatica, Invenio, eprints, etc (Archive) Storage HPSS LTFS Black Pearl other 8 HUF 2015 - Toronto CN
Application access to data in deep archives Data interfaces assume disk interface i.e. on-line storage Archivematica, Fedora, Invenio, Fixity etc. DB access Interfaces still keep with POSIX No consideration with regard to off-line, or replicated, storage Assume different storage is handled behind the scenes No one waits for information to become on-line Unless you know its will come on line How to handle asynchronous data access Requirements from the provider, i.e. data center, perspective How to do efficient offline check summing (with user provided method) Applications should not DOS the (tape) storage systems An upstream reference how to get from content to the owner of the data Data center exchange changing resource providers, disaster recovery 9 HUF 2015 - Toronto CN
Rough edges and requirements storing [m b]illions of files data exchange roles and rights, accounting wide range of file sizes assured reliability storage quality 10 HUF 2015 - Toronto CN
B2SAFE irods Archive service path CDMI Data path EUDAT development Archive service (REST) Archive Logic VIM meta data (PREMIS) checksums storage quality IDM Auth *FTP S3 standard protocols for data transport storage backends Meta DB disk storage LTFS 11 HUF 2015 - Toronto CN
archive access enhance fixity support user provided checksum lazy checking and periodic integrity verification we need to be more intelligent if we want to scale in the future poc implementation for HPSS and TSM with PSNC (Poland) and MPCDF (was RZG) integrate in current EUDAT B2SAFE service enable bring-on-line functionality on-line and off-line data represent two different storage qualities task in INDIGO data cloud project what else? embargo deletion and clean-up: to get rid of data 12 HUF 2015 - Toronto CN
Conclusions A data archive is more than long term data storage HPSS is used in HPC and increasingly as cold storage engine for many other data sources i.e. communities With some user space effort ARCHIVE requirements can be implemented on current HPSS Work on HPSS from bottom and top Buddy Bland: We must make it easy for sites to match the types of storage to the requirements of the users. We must make it easy for users to match the types of data to the requirements of the sites. 13 HUF 2015 - Toronto CN
Thank you for your attention Contact: jos.vanwezel@kit.edu 14 HUF 2015 - Toronto CN
Spares 15 HUF 2015 - Toronto CN
Challenges Determining the rate of bit rot and taking countermeasures we have no metric for archive (bit preservation) quality this some how should be part of a site certification Intelligent selection of candidates for caching getting data from a deep (i.e. cost effective) archive takes time on line caches in front of archives for fast access what goes to and what data stays in the cache Notions: trends, markets, relationships: maybe google can help? Uniform interface for cold storage API that goes beyond classic POSIX and takes longer access times Into account Exchange data sets or data volumes with peers for redundancy or for contract changes: asynchronous operations 16 HUF 2015 - Toronto CN
copyright: Jan Potthoff / SCC Archive interface work in progress of the RADAR and bwarchive projects revisit the 2008 SNIA XAM standard Asynchronous access to data on off-line storage xam comes with an emulator / reference implementation xam modular adaptation to different back-ends through Vendor Interface Modules (VIM) 17 HUF 2015 - Toronto CN
What is an Archive? SCC view Definition of an Archive: A long time file repository containing a massive collection of digital information No loss of data or degradation of quality for an agreed amount of time SCC ensures proper integrity Does not depend on proprietary middleware or storage formats migration should be possible in some number of years and must be independent from content or owner Data is stored in (self-defining) containers containers are not files, they are bitfiles with a checksum, they may contain one or more files. Reference to the container is fixed for the lifetime of the container The operator (SCC) is not responsible for passing content on to the next generation of owners data ownership has a lifetime that must be known in advance, but can be renewed data is stored after consent of a data provider and SCC. It is effectuated in a service level agreement for archives. the above intentionally excludes searching, finding and global referencing 18 HUF 2015 - Toronto CN