Recent Developments at WDC Climate: Limitation of Long-term Archiving at DKRZ



Similar documents
Data Formats for Long-term Archiving of Climate Model Data at WDC Climate and DKRZ

The CEOP Model Data Archive at the World Data Center for Climate as part of the CEOP Data Network

CMIP5 Data Management CAS2K13

Big Data Services at DKRZ

Data management and archiving

CMIP6 Data Management at DKRZ

BALTEX II Data Management and Baltic Grid: Status Report

Big Data Research at DKRZ

Concept of long term data archive

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Archiving of Simulations within the NERC Data Management Framework: BADC Policy and Guidelines.

Use of OGC Sensor Web Enablement Standards in the Meteorology Domain. in partnership with

PIT adoption for Climate Data Management

Archive I. Metadata. 26. May 2015

The Metadata in the Data Catalogue DBK

EXCOM 2015 Tromsø, Norway August 2015 SCADM Report (Standing Committee on Antarctic Data Management)

Long Term Preservation of Earth Observation Space Data. Preservation Workflow

Nevada NSF EPSCoR Track 1 Data Management Plan

Making Data Citable The German DOI Registration Agency for Social and Economic Data: da ra

Research Data Management Planning

ETD The application of Persistent Identifiers as one approach to ensure long-term referencing of Online-Theses

REACCH PNA Data Management Plan

CEDA Storage. Dr Matt Pritchard. Centre for Environmental Data Archival (CEDA)

Interagency Science Working Group. National Archives and Records Administration

DA-09-02a Task Report Data Integration and Analysis System

data.bris: collecting and organising repository metadata, an institutional case study

The Center for Research Libraries

Outcomes of the CDS Technical Infrastructure Workshop

escidoc: una plataforma de nueva generación para la información y la comunicación científica

European Soil Data Centre (ESDAC) Marc Van Liedekerke Land Management and Natural Harzards Unit

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

PsychData Sharing research data in Psychology

NOAA Environmental Data Management Update for Unidata SAC

The THREDDS Data Repository: for Long Term Data Storage and Access

Environment Canada Data Management Program. Paul Paciorek Corporate Services Branch May 7, 2014

Libraries and Large Data

Network Working Group

EUMETSAT DATA CENTRES AND ARCHIVE AND LONG-TERM DATA PRESERVATION

Data Publication and Paradigm Mapping Solutions

Digital Preservation. OAIS Reference Model

NOAA Environmental Data Management

Government of Canada Program for International Polar Year Data Assembly Centre Request for Proposals

Working with the British Library and DataCite Institutional Case Studies

The World Agroforestry centre Policy Series. Research Data Management Policy

Dr Norman Paskin¹* ABSTRACT 1. INTRODUCTION. To be published in Data Science Journal, 2005 Digital Object Identifiers for scientific data

Enhanced Research Data Management and Publication with Globus

The NERC DataGrid (NDG)

How To Useuk Data Service

Long-term Archiving of Relational Databases with Chronos

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

Research Data Management Guide

Interactive Data Visualization with Focus on Climate Research

Assessing a Scientific Data Center as a Trustworthy Digital Repository

What is a GEIA Foundation?

AMLIGHT, Simulation Datasets, and Global Data Sharing

Creation of Focused Web Archives for Scientists

Dataset citation and identification Adam Farquhar, PhD

Cite My Data M2M Service Technical Description

A DISTRIBUTED CATALOG AND DATA SERVICES SYSTEM FOR REMOTE SENSING DATA

NASA s Big Data Challenges in Climate Science

Data-Intensive Science and Scientific Data Infrastructure

NASA Earth System Science: Structure and data centers

CommVault Simpana Archive 8.0 Integration Guide

Data Management Considerations for the Data Life Cycle

Data Integration and long-term planning of the Observing Systems as a cross-cutting process in a NMS

Exploitation of ISS scientific data

Quality assessment concept of the World Data Center for Climate and its application to CMIP5 data

DATA CITATION. what you need to know

Report of the DTL focus meeting on Life Science Data Repositories

Data archiving and data policies

PRINCIPLES OF RESEARCH DATA MANAGEMENT AT THE RECTOR S DECISION, 9 SEPTEMBER 2014

Big Data at ECMWF Providing access to multi-petabyte datasets Past, present and future

Long term data Archive Study on new Technologies. GMV, 2011 Property of GMV All rights reserved

Bradford Scholars Digital Preservation Policy

GCOS/GOOS/GTOS JOINT DATA AND INFORMATION MANAGEMENT PLAN

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

ATLAS2000 Atlases of the Future in Internet

The Arctic Observing Network and its Data Management Challenges Florence Fetterer (NSIDC/CIRES/CU), James A. Moore (NCAR/EOL), and the CADIS team

Working Paper Series. RatSWD. Working Paper No DataCite: The International Data Citation Initiative. Datasets Programme.

Working with the British Library and DataCite A guide for Higher Education Institutions in the UK

Chapter WAC PRESERVATION OF ELECTRONIC PUBLIC RECORDS

Collaboration in Data Documentation: Developing STARDAT - The Data Archiving Suite

Environmental Data Management:

Data Models For Interoperability. Rob Atkinson

The ORIENTGATE data platform

What objects must be associable with an identifier? 1 Catch plus: continuous access to cultural heritage plus

Geospatial Data Archiving

Development of a Very Flexible Web based Database System for Environmental Research

SURFsara Data Services

Metadata Hierarchy in Integrated Geoscientific Database for Regional Mineral Prospecting

Infrastructure, Standards, and Policies for Research Data Management

Data Management Framework for the North American Carbon Program

Data Management in Science and the Legacy of the International Polar Year

Report on data management and infrastructure

Digital libraries of the future and the role of libraries

INTEROPERABLE IMAGE DATA ACCESS THROUGH ARCGIS SERVER

Data Management Plan. Name of Contractor. Name of project. Project Duration Start date : End: DMP Version. Date Amended, if any

Best Practices for Data Management. RMACC HPC Symposium, 8/13/2014

Globus Research Data Management: Introduction and Service Overview. Steve Tuecke Vas Vasiliadis

Information Migration

Transcription:

Recent Developments at WDC Climate: Limitation of Long-term Archiving at DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang Stahl German Climate Computing Centre (DKRZ) Hamburg Computing in Atmospheric Sciences Workshop 2007 September 9th 13th, 2007 in Annecy, France

DKRZ: Earth system model development Simulations of past, present and future climate WDC Climate: Long-term data archiving Inter-disciplinary data dissemination

Next generation of compute server and of climate models leads to a problem at DKRZ: Data production increase with implications for long-term archiving What is the problem? What could be the solution?

Increase in installed compute power motivates finer spatial and temporal model resolution and integration of additional physical and chemical processes into climate models.

Finer spatial (+ temporal) model resolution or Execution of climate model ensemble runs

Not accepted projection Discussion about total media costs starts What could be the solution for mass storage cost balancing?

Data classes Test data from model code development, life cycle: weeks to months Project data from scientific model evaluation and research projects (DKRZ resources at project level), life cycle: 3 5 years Final results as data products for international projects (IPCC) and scientific publications, life cycle: 10 years and longer Data hierarchy levels Temp(orary) scratch discs at compute server Work fixed disc space at project level for evaluation Arch(ive) tape storage space (single copy) with expiration date for project data beyond available disc space Docu(mentation) documented, long-term tape archive (security copy) for data products, focus on interdisciplinary data utilisation, data are fixed and no longer matter of change

Tape space distributon to archive classes at DKRZ begin of 2007: part of the work space on tape because GFS too small docu domain consists of WDCC no expiration dates in arch domain, parts of arch domain belongs to docu but not yet documented

Project based data storage strategy and resource assigment at DKRZ: Separation of project data and long-term archive Expiration date for project data Aware, scientific decision to move data into the longterm archive Data documentation requirements for long-term archive Long-term data archive ( docu hierarchy level) accomplishes the rules for good scientific practice

Data documentation requirements are accomplished by using the WDCC infrastruture CERA-2 metadata model developed in 1999 Catalogue interface: cera.wdc-climate.de Input interface: input.wdc-climate.de CERA-2 metadata content is complete with respect to browse, to discover and to use climate data which are stored in the database system or outside in flat files The WDCC matches international description standards like ISO 19115, Dublin Core or GCMD and is integrated in international data federations Data storage structure assembles storage of climate time series per variable in BLOB data tables. This allows for web-based data catalogue search and data access in small data granules.

WDCC / CERA: General Statistics at 01-08-2007 00:00:10 Database Size (TByte): 305 Number of blobs: 5596121886 (5.6 billion) Number of experiments: 975 Number of datasets: 124188 Total size divided by number of BLOBs gives the average size of data access granules: 60 kb/blob

CERA Data Model Reference Contact Coverage Status Entry Parameter Distribution Data Org Local Adm. Data Access Spatial Reference

Coloured columns correspond to BLOB data tables in WDCC. Collections of matrix rows represents storage in model raw data files (complete model output storage time step by storage time step).

Additionally WDCC offers the primary data publication service Following the STD-DOI concept (Scientific and Technical Data Digital Object Identifier, URL: www.std-doi.de) Important aspects of the publication process are The identification of independent data entities which are suitable for publication at the level of scientific literature, The execution of an elaborated review process for metadata and climate data, The assigment of additional metadata for electronic publication (ISO 690-2) and of persistent identifiers (DOI / URN) and The integration of publication metadata and persistent identifiers into the TIB library catalogue (Technical Information Library, Hannover) so that primary data entities are searchable and citable together with scientific literature. Quality characteristic is presently approved by author, future development should be peer reviewed.

Data infrastructure integrates data stewardship in the long-term archive Bit-stream preservation Quality assurance Usability enabling

Long-term archive data stewardship Bit-stream preservation Secondary tape copies on different tapes and technology at separate location Copy to new tapes after maximum number of tape accesses are reached (Refreshment) Quality assurance Semantic examinations: behavior of a numerical model compared to observations and to other models, part of the scientific evaluation process Syntactic examinations: formal aspects of data archiving and ensurance that data archiving is free of errors as far as possible Consitency between metadata and climate data Completeness of climate data Standard range of values Spatial and temporal data arrangement

Long-term archive data stewardship (continued) Usability enabling Complete and searchable documenation of climate data entities (database tables and flat files) in the catalogue system of the WDCC WDCC offers web-based data access to small data granules (individual entries in BLOB DB tables) Archive technology transfer must be downward compatible to keep old data technically readable Data processing tools and data format access libraries must be migrated to new architectures

Summary DKRZ long-term data archive will still grow but slower than linear with the installed compute power (key increase factors are for compute power * 30, for longterm archive * 10, for WDCC * 5) Improvement of reliability of long-term archive because of more emphasis on data stewardship than on technical data service operations At the end the new data archive concept will result in a completely documented and searchable long-term data archive. (Data without documentation are only numbers)