Database preservation toolkit:

Similar documents
Database Preservation Toolkit: a flexible tool to normalize and give access to databases

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Why archiving erecords influences the creation of erecords. Martin Stürzlinger scopepartner Vienna, Austria

Digital Preservation. OAIS Reference Model

Data Warehouses in the Path from Databases to Archives

José Carlos Ramalho

2009 ikeep Ltd, Morgenstrasse 129, CH-3018 Bern, Switzerland (

Interagency Science Working Group. National Archives and Records Administration

INTEGRATING RECORDS SYSTEMS WITH DIGITAL ARCHIVES CURRENT STATUS AND WAY FORWARD

SEVENTH FRAMEWORK PROGRAMME THEME ICT Digital libraries and technology-enhanced learning

DIGITAL ARCHIVES & PRESERVATION SYSTEMS

RCAAP: Building and maintaining a national repository network

Semantic Exploration of Archived Product Lifecycle Metadata under Schema and Instance Evolution

Long-term Archiving of Relational Databases with Chronos

Databases in Organizations

Ex Libris Rosetta: A Digital Preservation System Product Description

A DICOM-based Software Infrastructure for Data Archiving

Questionnaire on Digital Preservation in Local Authority Archive Services

bigdata Managing Scale in Ontological Systems

Trends and solutions for archiving in pharmaceutical industry. Didier Coyman / Ömer Yilmaz

Achieving a Step Change in Digital Preservation Capability

Metadata Repositories in Health Care. Discussion Paper

A generic approach for data integration using RDF, OWL and XML

Long-term archiving and preservation planning

Technical concepts of kopal. Tobias Steinke, Deutsche Nationalbibliothek June 11, 2007, Berlin

A Grid Architecture for Manufacturing Database System

A Selection of Questions from the. Stewardship of Digital Assets Workshop Questionnaire

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Organization of VizieR's Catalogs Archival

Information Management Advice 18 - Managing records in business systems Part 1: Checklist for decommissioning business systems

The challenges of becoming a Trusted Digital Repository

TRANSFoRm: Vision of a learning healthcare system

Conceptualizing Policy-Driven Repository Interoperability (PoDRI) Using irods and Fedora

Digital Assets Repository 3.0. PASIG User Group Conference Noha Adly Bibliotheca Alexandrina

Digital preservation a European perspective

UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications

MultiMimsy database extractions and OAI repositories at the Museum of London

Preservation Action: What, how and when? Hilde van Wijngaarden Head, Digital Preservation Department National Library of the Netherlands

Preserving French Scientific data

dati.culturaitalia.it a Pilot Project of CulturaItalia dedicated to Linked Open Data

EPrints Preservation Update

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Digital Stewardship Education at the Graduate School of Library & Information Science, Simmons College

North Carolina Digital Preservation Policy. April 2014

Database Preservation Case Study: Review

Second EUDAT Conference, October 2013 Data Management Plans and Certification Motivation: increasing importance of Data Management Planning

Digital Archiving Policy

Adding Robust Digital Asset Management to Oracle s Storage Archive Manager (SAM)

DDI Lifecycle: Moving Forward Status of the Development of DDI 4. Joachim Wackerow Technical Committee, DDI Alliance

Geospatial Data Stewardship at an Interdisciplinary Data Center

Introduction. Chapter 1. Introducing the Database. Data vs. Information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

D5.3.2b Automatic Rigorous Testing Components

UniGR Workshop: Big Data «The challenge of visualizing big data»

Cloud Service Contracts: An Issue of Trust

BUSINESS REQUIREMENTS SPECIFICATION (BRS) Business domain: Archiving and records management. Transfer of digital records

Core Enterprise Services, SOA, and Semantic Technologies: Supporting Semantic Interoperability

GetLOD - Linked Open Data and Spatial Data Infrastructures

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database System in Energy Data Management

DIGITAL ARCHIVING DISCUSSION PAPER: INFORMING AN APPROACH TO THE LONG-TERM MANAGEMENT AND PRESERVATION OF GOVERNMENT DIGITAL RECORDS

Tibiscus University, Timişoara

Federated, Generic Configuration Management for Engineering Data

Cloud and Big Data Standardisation

Master s Program in Information Systems

DIGITAL PRESERVATION AT THE U.S. GOVERNMENT PRINTING OFFICE: WHITE PAPER. Version July 2008 UNITED STATES GOVERNMENT PRINTING OFFICE

Digital Archiving at the Swiss Federal Archives (SFA)

Semantic Interoperability

The Visualization Pipeline

Archival Data Format Requirements

DSpace: An Institutional Repository from the MIT Libraries and Hewlett Packard Laboratories

Creating the ultimate digital insurance

The Ontological Approach for SIEM Data Repository

Chapter WAC PRESERVATION OF ELECTRONIC PUBLIC RECORDS

Best Archiving Practice Guidance

Transcription:

Nov. 12-14, 2014, Lisbon, Portugal Database preservation toolkit: a flexible tool to normalize and give access to databases DLM Forum: Making the Information Governance Landscape in Europe José Carlos Ramalho jcr@keep.pt KEEP SOLUTIONS // University of Minho

KEEP SOLUTIONS: Projects DigitArq, CRAV (2003..[2008-2012]) RODA (2006..[2008- [) RCAAP (2008- ) PPA (2009) Open source: RODA, KOHA, DSpace, Moodle, etc. Scientific research (european projects) SCAPE: Large scale preservation 4C: the cost for curation e- ark: presented by Kuldar http://www.keep.pt This work was partially supported by the SCAPE Project. The SCAPE project is co- funded by the European Union under FP7 ICT- 2009.4.1 (Grant Agreement number 270137). 2

Database Preservation Toolkit Developed within RODA project Now stand-alone open-source project http://keeps.github.io/db-preservation-toolkit/ Imports and exports between DBMS and DB formats Supports preservation formats: DBML, SIARD

The Past: 2006-2009 RODA: Repositório de Objectos Digitais Autênticos José Carlos Ramalho jcr@di.uminho.pt Miguel Ferreira mferreira@dsi.uminho.pt Rui Castro Rcastro@iantt.pt Luis Faria lfaria@iantt.pt Francisco Barbedo frbarbedo@iantt.pt Luis Corujo lcorujo@iantt.pt Cecília Henriques chenriques@iantt.pt

SCAPE What is RODA? A digital repository created for archives, with the following features: Long term preservation and Autenticity Standards based (OAIS, EAD, PREMIS, METS, etc.) Following TRAC demands (Trustworthy Repositories Audit & Certification) Security policies Service Oriented Architecture (SOA) Nice and simple design Open source DGArq and University of Minho 5

Context RODA (2006-2009) Metadata management (EAD) Digital Object management (...) Digital Preservation Policies and protocols National Project (Archives National Board) CRiB: Preservation Services Digital Repositories (2005-2008) Distributed Migration Service Migration Service supported by knowledge base phd Thesis / U. Minho (Miguel Ferreira)

Context RODA (2006-2009) Metadata management (EAD) Digital Object management (...) Digital Preservation Policies and protocols National Project (Archives National Board) CRiB: Preservation Services Digital Repositories (2005-2008) Distributed Migration Service Migration Service supported by knowledge base phd Thesis / U. Minho (Miguel Ferreira)

7 OAIS ISO 14721

Architecture 8

8 Architecture Being replaced

Data Model 9

Data Model 9

Data Model 9

Data Model 9

Data Model 9

Digital Object Classes 10

Digital Object Classes 10

10 Digital Object Classes Normalization

10 Digital Object Classes Normalization Significant properties

Ingest 11

SIP structure 12

12 SIP structure Object class and format

12 SIP structure Object class and format Place in classification plan

12 SIP structure Object class and format Place in classification plan Descriptive metadata

12 SIP structure Object class and format Place in classification plan Descriptive metadata Preservation metadata

12 SIP structure Object class and format Place in classification plan Descriptive metadata Preservation metadata Technical metadata

12 SIP structure Object class and format Place in classification plan Descriptive metadata Preservation metadata Technical metadata Representation files

12 SIP structure Object class and format Place in classification plan Descriptive metadata Preservation metadata Technical metadata Representation files Identification of root file

Databases

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)?

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment?

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment? How can we preserve the original data semantics and structure?

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment? How can we preserve the original data semantics and structure? How can we preserve authenticity and provenance of databases?

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment? How can we preserve the original data semantics and structure? How can we preserve authenticity and provenance of databases? How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen databases)

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment? How can we preserve the original data semantics and structure? How can we preserve authenticity and provenance of databases? How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen databases) How can we have efficient preservation frameworks, while retaining the ability to query different database versions?

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment? How can we preserve the original data semantics and structure? How can we preserve authenticity and provenance of databases? How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen databases) How can we have efficient preservation frameworks, while retaining the ability to query different database versions? How can multi-user online access be provided to hundreds of archived databases containing terabytes of data?

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment? How can we preserve the original data semantics and structure? How can we preserve authenticity and provenance of databases? How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen databases) How can we have efficient preservation frameworks, while retaining the ability to query different database versions? How can multi-user online access be provided to hundreds of archived databases containing terabytes of data? Can we move from a centralized model to a distributed, redundant model of database preservation?

Databases How do we keep archived databases readable and usable in the long term (at acceptable cost)? How do we separate the data from a specific database management environment? How can we preserve the original data semantics and structure? How can we preserve authenticity and provenance of databases? How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen databases) How can we have efficient preservation frameworks, while retaining the ability to query different database versions? How can multi-user online access be provided to hundreds of archived databases containing terabytes of data? Can we move from a centralized model to a distributed, redundant model of database preservation? General big questions...

Databases

Databases What are the salient features of a database that should be preserved? (significant properties...)

Databases What are the salient features of a database that should be preserved? (significant properties...) What are the different stages in the database preservation s life cycle?

Databases What are the salient features of a database that should be preserved? (significant properties...) What are the different stages in the database preservation s life cycle? What documentation is preserved together with a database, and in what format?

Databases What are the salient features of a database that should be preserved? (significant properties...) What are the different stages in the database preservation s life cycle? What documentation is preserved together with a database, and in what format? What are the legal encumbrances on database preservation?

Databases What are the salient features of a database that should be preserved? (significant properties...) What are the different stages in the database preservation s life cycle? What documentation is preserved together with a database, and in what format? What are the legal encumbrances on database preservation? What can be learned from traditional archival appraisal for the selection of databases for preservation?

Databases What are the salient features of a database that should be preserved? (significant properties...) What are the different stages in the database preservation s life cycle? What documentation is preserved together with a database, and in what format? What are the legal encumbrances on database preservation? What can be learned from traditional archival appraisal for the selection of databases for preservation? To what extent can the preservation strategies, and procedural policies developed by archivists be adapted for databases?

Databases What are the salient features of a database that should be preserved? (significant properties...) What are the different stages in the database preservation s life cycle? What documentation is preserved together with a database, and in what format? What are the legal encumbrances on database preservation? What can be learned from traditional archival appraisal for the selection of databases for preservation? To what extent can the preservation strategies, and procedural policies developed by archivists be adapted for databases? How can we mesure the quality of preservation strategies when they are applied to data- bases? What are DB significant properties? (quality assurance...)

Databases What are the salient features of a database that should be preserved? (significant properties...) What are the different stages in the database preservation s life cycle? What documentation is preserved together with a database, and in what format? What are the legal encumbrances on database preservation? What can be learned from traditional archival appraisal for the selection of databases for preservation? To what extent can the preservation strategies, and procedural policies developed by archivists be adapted for databases? How can we mesure the quality of preservation strategies when they are applied to data- bases? What are DB significant properties? (quality assurance...) Technical questions...

Databases: goals How do we store them? How do we access them?

Databases: goals How do we store them? How do we access them? RODA questions...

Databases

Databases Normal evolution path: Data => Structure => Semantics

Databases Normal evolution path: Data => Structure => Semantics First prototype: Data Structure Only frozen DBs

Databases Normal evolution path: Data? Data => Structure => Semantics First prototype: Data Structure Only frozen DBs

Databases Normal evolution path: Data? Data => Structure => Semantics Structure? First prototype: Data Structure Only frozen DBs

Databases Normal evolution path: Data? Data => Structure => Semantics Structure? Views? First prototype: Data Structure Only frozen DBs

Databases Normal evolution path: Data? Data => Structure => Semantics Structure? Views? Reports? First prototype: Data Structure Only frozen DBs

Databases Normal evolution path: Data? Data => Structure => Semantics Structure? Views? Reports? First prototype: Data Stored Procedures? Structure Only frozen DBs

Databases Normal evolution path: Data? Data => Structure => Semantics Structure? Views? Reports? First prototype: Data Stored Procedures?... Structure Only frozen DBs

HOW? The need for an intermediate representation Input Formats (M) Output Formats (N)

HOW? The need for an intermediate representation Input Formats (M) Output Formats (N) M*N migrators

IR: DBML Input Formats (M) Output Formats (N) DBML

IR: DBML Input Formats (M) Output Formats (N) DBML For each new output or input format you only need to code one new migrator.

DBML design Principles Hardware independent; Software independent; Easy to process; Descriptive; It should be possible to add metadata; It should be possible to add semantics;

DBML design Principles Hardware independent; Software independent; Easy to process; Descriptive; It should be possible to add metadata; It should be possible to add semantics; XML was the obvious choice

SIP Structure (DB example) KEEPS Employees

SIP Structure (DB example) Manifest KEEPS Employees

SIP Structure (DB example) Manifest Descriptive Metadata (EAD): - producer - colection -... DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Manifest Binaries Descriptive Metadata (EAD): - producer - colection -... DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Technical Metadata: - color - dimensions -... Manifest Descriptive Metadata (EAD): - producer - colection -... Binaries DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Technical Metadata: - color - dimensions -... Manifest Descriptive Metadata (EAD): - producer - colection -... Binaries DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Technical Metadata: - color - dimensions -... Manifest Descriptive Metadata (EAD): - producer - colection -... Binaries DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Technical Metadata: - color - dimensions -... Manifest Descriptive Metadata (EAD): - producer - colection -... Binaries DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Technical Metadata: - color - dimensions -... Manifest Descriptive Metadata (EAD): - producer - colection -... Binaries DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Technical Metadata: - color - dimensions -... Manifest Descriptive Metadata (EAD): - producer - colection -... Binaries DBML: - Data; - Structure; - Other metadata. KEEPS Employees

SIP Structure (DB example) Technical Metadata: - color - dimensions -... Manifest Binaries Compressed File Descriptive Metadata (EAD): - producer - colection -... DBML: - Data; - Structure; - Other metadata. KEEPS Employees

DBML

DBML: Structure

DBML: Data

Databases: RODA answers How do we store them? DBML + binaries + technical metadata How do we access them? PhpMyAdmin (hacked version)

Databases: RODA answers How do we store them? DBML + binaries + technical metadata How do we access them? PhpMyAdmin (hacked version) RODA answers...

Some problems Extracting data is easy: SELECT * FROM... Extracting the structure is not: DBMS protect this information; Each DBMS stores it differently; Different versions of the same DBMS can also act differently; We have to prepare/hack the DBMS.

Open Planets: Database Archiving (Copenhagen 2012) Two approaches emerged: Preserve the database and the environment allowing authentic access to the information and any accompanying applications Extract / migrate the raw data and table structure from the original database

Open Planets: Database Archiving (Copenhagen 2012) Two approaches emerged: Preserve the database and the environment allowing authentic access to the information and any accompanying applications Relies on emulation strategy Extract / migrate the raw data and table structure from the original database

Open Planets: Database Archiving (Copenhagen 2012) Two approaches emerged: Preserve the database and the environment allowing authentic access to the information and any accompanying applications Relies on emulation strategy Extract / migrate the raw data and table structure from the original database Relies on migration strategy

Open Planets: Database Archiving (Copenhagen 2012) Two approaches emerged: Preserve the database and the environment allowing authentic access to the information and any accompanying applications Relies on emulation strategy Extract / migrate the raw data and table structure from the original database RODA+DBML Relies on migration strategy

Open Planets: Database Archiving (Copenhagen 2012) Two approaches emerged: Preserve the database and the environment allowing authentic access to the information and any accompanying applications Relies on emulation strategy Extract / migrate the raw data and table structure from the original database RODA+DBML SIARD Relies on migration strategy

Open Planets: Database Archiving (Copenhagen 2012) Two approaches emerged: Preserve the database and the environment allowing authentic access to the information and any accompanying applications Relies on emulation strategy Extract / migrate the raw data and table structure from the original database RODA+DBML SIARD SIARD-DK Relies on migration strategy

Open Planets: Database Archiving (Copenhagen 2012) Two approaches emerged: Preserve the database and the environment allowing authentic access to the information and any accompanying applications Relies on emulation strategy Extract / migrate the raw data and table structure from the original database RODA+DBML SIARD SIARD-DK ADDML Relies on migration strategy

DBML vs SIARD DBML SIARD Data Yes (no segmentation) Yes (with segmentation) Structure Yes (XML) Yes (XSD) Stored Procedures No Yes (XML) Triggers No Yes (XML) Views No Yes (XML)

db-preservation-toolkit architecture

db-preservation-toolkit architecture ADDML

db-preservation-toolkit architecture ADDML ADDML

Pre-ingest / Production / SIP creation

Pre-ingest / Production / SIP creation new

Pre-ingest / Production / SIP creation new new

Pre-ingest / Production / SIP creation new new ADDML

Pre-ingest / Production / SIP creation new new ADDML to be implemented

Access

Access Working on RDF viewer, we don t need to stay stuck in the relational model

New prototype Msc thesis: Database Convert Tool: exploring SIARD OWL model development Triple Store Implementation RDF endpoint implementation SPARQL based access

Future plan: Fast database viewer

Recent work Phd thesis: Graph view dissemination, relational model reverse engineering, algorithm development (from relational to graph model) [http://repositorium.sdum.uminho.pt/handle/ 1822/25655] Msc thesis: SIARD support (DBML replacement) Msc thesis: RDF representation for databases

Questions? José Carlos Ramalho Engineer / Researcher / Teacher jcr@keep.pt / jcr@di.uminho.pt ARQUIVOS BIBLIOTECAS MUSEUS www.keep.pt