Challenges in digital preservation: Relational databases

Challenges in digital preservation: Relational databases Mark Brogan and Justin Brown School of Computer and Information Science Edith Cowan University Perth, Western Australia 6050 Abstract Chen (2001) coined the term digital preservation paradox to describe a philosophy of preservation management that recognises change as a cornerstone of managing digital records for permanence. Paradox also surrounds use of the terms structured and unstructured. As more and more so-called unstructured information sources are generated from structured repositories, preservation planning for unstructured sources must move to encompass the structured information sources from which they are created. This paper reviews strategies and methods for database preservation inclusive of issues and methods at enterprise level. Case study based on a widely used XML normalisation tool (MS Access) is used to investigate the advantages and disadvantages of XML normalisation as a preservation strategy. The rise of the structured information source In computing, the document has undergone profound changes since the emergence of information technology for business applications in the 1960s. In the early stage of its evolution, the document was unitary, proprietary and dumb. The open systems movement challenged the proprietary character of the document and led to a concept of the document as inter-operable and accessible outside the creating application context. The unitary character of the document was also challenged by the Web revolution of the 1990s. The Web created the phenomenon of the compound or so-called virtual document (Rheinhardt, 1994). As the importance of metadata descriptions of documents became established in the 1990s, documents acquired metadata wrappers describing authorship, content and change history and ceased to be dumb. These changes describe trajectories in document engineering over the past thirty years. Since 2000, the ideas of content re-use and re-purposing, have further shaped the evolution of the document. As earlier developments in document engineering have added structure to the document, so too has today s content management trajectory. This is most plainly seen in convergence on extensible Markup Language (XML) as the underlying technology of what were once unstructured types. For example, with Microsoft Office 2003, Microsoft made the transition to XML as the technology foundation of its Office productivity suite. By using XML format, an organisation can: 1

set up an environment for authors to create documents with a consistent look and feel, while at the same time facilitating re-use of content The Office Open XML format enables an organisation to define templates using an XML schema that is most suitable for their business requirements. This schema can consist of tags that correspond to various sections of a document, such as <Executive Summary/>, <MainBody/>, and <Conclusion/> Re-use of content is possible because the content is automatically tagged with the appropriate XML code and can be programmatically processed for document assembly, distribution, and conversion (Microsoft, 2006, pp.15-16). Parallel developments in the open source community have delivered Open Document Format (ODF), a competing XML open source file format for electronic office documents, such as spreadsheets, charts, presentations and word processing documents. Together these developments have placed what were once un-structured information sources, on a structured trajectory. Cognitive dissonance: Structured and un-structured information sources While the user experience of documents continues to suggest its unitary, unstructured character, the perception of types as being structured and unstructured is increasingly misleading. Such a perception disguises the real trend towards increasingly structured documents, many of which now emanate from technologies more appropriately described as databases or XML databases. Current trends in document engineering are poorly reflected in records programs and digital preservation, where the abiding concern continues to be un-structured types. So why isn t the current trajectory in document engineering recognised? Some of the responsibility for this can be sheeted home to the distinctions made by information and information systems managers between data, information and records. For example, many records managers believe that their sole purpose is to manage records. Innocent enough, but since records are considered a species of documents, and databases repositories of data, the records program is often defined specifically to exclude databases. As risk management discourse increasingly emphasises the importance of data retention policy, records managers and other Information Management (IM) professionals are being drawn reluctantly into a dialog with Information Systems (IS) professionals about structured information sources. Further, as the locus of recordkeeping increasingly switches to structured information sources, realignment of the records program will have to take place, particularly since information systems professionals regard data retention as a mine field and are actively seeking IM perspectives in this area. Foundation concepts in database archiving As the focus of retention policy expands to encompass databases, IM and IS professionals are actively contemplating methods and tools for database archiving. Review of the literature shows that database archiving is discarding some of its early ambiguity and taking shape around the core concepts of long term retention of usable digital memory as a core component of corporate 2

governance (Gartner, 2006). To understand these developments and to successfully deploy business solutions, it is important to understand foundation concepts in database archiving. These days most databases are relational. A relational database stores data in a series of related tables. Each table consists of columns and rows. These entities correspond to the user view of a database. Other frames concern the developer and physical layer view of the database. These are described in Table 1. The left hand side of the table describes the physical or file processing model of the database. The middle and right, the logical model: File Processing Environment Relational developer db Relational db user File Relation Table Record Tuple Row Field Attribute Column Table 1-Relational database frames Relations, tuples and attributes are concepts representative of the systems analysis domain that may be encountered in project documentation, but are also commonly used in connection with the theory of databases. In the implementation of databases, these theoretical constructs are translated respectively as tables, rows and columns which describe the user view. At the file processing level, tables are equivalent to files, rows to records and columns to fields. Relationships between tables are defined as to create rules by which data manipulation in one table causes corresponding changes in another. This governance of behaviour is a means by which integrity is ensured within a relational database. A table relationship is created by the establishment of a common key between two or more tables. Where a key first appears it is known as a primary key. Where it appears again to establish a relationship it is known as a foreign key. As part of the evolution of IS thinking about database archiving, the IS view has moved to a notion of archiving familiar to IM professionals. Gartner (2006) refers to db archiving as: A critical component of Information Life Cycle Management; A tool for implementing data retention policy and meeting compliance requirements; and A key component of corporate governance that enables the retention of usable digital memory for long periods. 3

This is quite a departure from earlier IS thinking where archiving was identified with Hierarchical Storage Tiering (HSM etc.), backing up or copying data (data management), or taking data offline. Discussion so far is sufficient to identify how and why database archiving must be different from other forms of archiving. Database archiving differs from filelevel archiving because data is stored in tables, and the rows and columns within those tables are all tightly linked. Unlike files which are self contained, information in a database is found in rows and columns which depend on other rows and columns. The primary and foreign keys are an example. Consequently, a piece of information in a row or a column of data cannot be selectively taken out of the database and moved off to an archive with standard archiving tools. Further, the meaning and evidential value of information in databases can often only be assessed in relation to various metadata sources that describe data organisation. Such sources include data dictionaries that describe rows, columns and data types; and Entity-Relationship (E-R) diagrams. Structured information sources are also highly system dependent and may undergo near constant content change. If archiving structured information sources is different from archiving unstructured sources, what possibilities exist for database archiving? Framing database archiving On reflection, the following possibilities exist regarding the record(s) within a database (Digital Preservation Testbed, 2003): the complete database system (database, Relational Database Management System -RDBMS, and application) together constitute the digital record; the database is the digital record; a single row of data stored in a database table (i.e. a tuple ) is the digital record; data distributed over a number of tables constitutes the digital record; information in the database as displayed onscreen by the application forms the digital record. There is no magic wand that can be applied to determine which of these frames of the record is best matched to discovered value(s). There are methodologies that purport to help, such as DIRKS (NAA, 2001), but archivists understanding of appraisal informs the decision making process. Conclusions from appraisal, lead to understanding of the most appropriate frame and the selection of preservation tactics. Methods and tools for RDBMS archiving Methods and tools for db archiving assume particular appraisal and disposition outcomes. The following is a survey of methods and tools removed from the independent variable of appraisal outcome. Two methods are distinguished in the literature: active and inactive archiving. 4

I Active Archiving Moving (archiving) records from one RDBMS table to another table located in the same or a different database instance, while preserving query and other core archival functions, defines an active archiving approach. Microsoft is an exponent of this, the simplest method of database archiving (Microsoft, 2007). With active archiving records are moved from an existing table to another table with the same structure and organisation as the source table. Structured Query Language (SQL) is used to select, copy and delete records from the source table. The user and application selectively switches context (the database connection) for historical queries, directing them to the archive instance. Figure 1 describes this process: Archive table with the same structure and organization as the source table. SQL is used to remove data from the source table to the archive table Figure 1- Active archiving The main advantage of such an approach is that it preserves user views and core functionality. By removing inactive records from the current domain of the database, information retrieval efficiency is also improved. Disadvantages of proprietary active archiving include: additional licensing costs; and the deferral of long term digital preservation planning. II. Inactive Archiving (Migration) Inactive archiving is the migrating (archiving) of records from an existing system to a newer hardware/software environment based on another system. Migration to a new system when an existing relational database system enters the legacy phase of its working life is a highly common form of database archiving. For example, records might be: migrated (archived) to a content management system (CMS); a new database host (e.g. Access->SQL Server); or 5

a new version of the current database. Since most enterprise RDBMS support backwards compatibility and interoperability, first generation migration can be unproblematic. However, as underlying information architecture changes over time, subsequent migration may involve loss of fidelity and function. For this reason migration is not usually thought of as a long term preservation solution for complex database systems. III. Inactive Archiving (XML Normalisation) The disadvantages of active archiving and migration suggest the role of open standards, particularly where long term preservation is contemplated and preservation of database behaviours is not mission critical. XML normalisation, where records are migrated to XML has arisen as the standards based approach of choice. Typically XML normalisation is performed as an end of life cycle activity, when databases enter a so-called legacy phase. XML normalisation involves no licensing costs and is highly vendor independent. A trade off is that XML normalisation results in flat file format and the loss of database behaviours such as queries, reports and user views. Because the preservation of referential integrity is problematic with XML normalisation, it is generally not applied as part of active archiving strategy. Figure 2 describes the consequences of XML normalisation for a case study health care database for breast cancer screening: Figure 2- Inactive Archiving (XML Normalisation) The products of normalisation typically consist of an XML schema that represents the data model and an XML file that consists of table row data. Importation of normalised archive data to a newer production environment usually requires the writing of importation scripts, that work by extracting values based on the node tree described in the XML schema. Normalisation case study Particularly for long term preservation, XML normalisation involves clear advantages compared with the other methods discussed. But how good is 6

normalisation as an archiving method? This question is best answered via case study with current generation normalisation tools. Beginning with MS Access 2003, Microsoft has supported XML normalisation as a migration pathway for Access databases. As a case study and precursor to a broader investigation, the authors undertook an XML normalisation of the MS Access 2003 database referred to in Figure 2 and evaluated the results. The criteria used were those originally developed by the Digital Preservation Testbed Project in 2003. These were: Authenticity; Reliability (Data integrity); Completeness; and Digital characteristics comprising context, content, structure, appearance, behaviour and metadata. Authenticity This criterion was deemed to be satisfied if sufficient evidence existed of authorship to establish the provenance of the database, the provenance of records held and their use history. In our trial database, this design feature was not supported. Experience gained suggested that if these features were supported, then normalisation might have been used successfully to capture the provenance and use history of records in the database. Reliability (Data integrity) This criterion was deemed to be satisfied if the integrity of data had been preserved. On inspection, the criterion was found to be satisfied with reservation. The Date/Time type was translated as a text string inclusive of a time stamp not found in record values. Completeness This condition was also satisfied subject to reservation. All tables and columns were successfully exported as parent and child nodes in XML. However, relationships between tables were lost. Prima facie, there is no way that table relationships can be represented with fidelity in XML. Structure Similarly, this criterion was satisfied subject to reservation. Foreign key values were often translated as numeric (foreign key) rather than text values. A project dictionary would therefore have been required to understand the translation. The root node itself was correctly translated as a complex type and all parent/child relationships were found to be correct. Appearance An XML document viewed in a parser has a very different appearance from a database. XSL transformation might have been used to improve performance against this criterion. However, if recordness is contained with RDBMS resident views, XML is not a good solution to this problem. 7

Behaviour Key aspects of behaviours such as input masks and validation rules were not translated during XML conversion. Likewise no method exists for translation of SQL queries and reports. A conclusion from evaluation against this criterion is that XML normalisation is not well suited to the preservation of database behaviour. Metadata The principal metadata outcome from the conversion was an XML Schema describing the vocabulary and structure of a valid patients.xml file. Data types and cardinality were often successfully captured. Primary keys were mostly correctly identified. Table, row and column structure was successfully translated as complex and simple types in correct hierarchical (Parent/Child) relationship. Figure 3- Breast Cancer Screening Schema Subject to the reservations described, the case study delivered a strong endorsement of XML normalisation. Digital Preservation Testbed (2003, p.33) concluded that XML is the most effective strategy for the durable preservation of databases. XML is highly capable of representing the context, content, and structure of databases. Enterprise application and tools While XML offers a durable way of storing the content of an RDBMS as well as its structure, not all enterprise solutions offer an out of the box functionality set that allows data to be moved from the operational to archival mode. As an example, neither of Microsoft s flagship database solutions, MS SQL Server 2000 and 2005 offer the ability to Save As to an xml structure directly. While these products allow for extraction of query results directly to xml output, and even the ability to store xml in a new xml datatype column (Juday, 2007), neither provide the ability to save table or view data to a structured xml document with accompanying xsd schema. To achieve such functionality within the context of MS SQL Server 2000/2005 requires exportation of data to an 8

intermediate storage solution, such as Microsoft Access, from which it can then be exported to a structured xml file. In case study, an observed problem with this approach was that of structural degradation and loss of fidelity in terms of original RDBMS formats, specifically datatypes and relational links. The left hand side of Figure 4 below shows a source table of data (in Design view) in MS SQL Server 2000. The figure on the right shows the same table design once it has been exported to Microsoft Access. Figure 4- MS SQL Server 2000 to MS Access As can be seen, export involves a loss of fidelity in terms of datatypes. Microsoft Access offers only simplistic support for datatypes. For example, the primary key identifier with associated auto incrementing counter has been lost on the ProductID column. Perhaps more importantly, as can be seen in Figure 5 below, the relational structure of the MS SQL Server 2000 database is lost during the translation to the Microsoft Access format: 9

Figure 5- Relational structure and XML Normalisation In some respects, the loss of datatype information during the translation process from an enterprise solution to a more basic desktop environment may not be crucial- especially if the source data is not totally reliant on custom or proprietary datatypes for the operation of any linked applications or business logic. However, the loss of relational structure could be a more pressing issue where this information may be needed in an instance where the original structure of database needs to be re-instated to its original level. In some ways this may not even be an issue in that MS SQL Server will not import relational structures from a Microsoft Access database anyway as the Data Definition Language (DDL) implementation is not compatible. Conclusion Microsoft SQL Server and Microsoft Access have been used as examples in this discussion of enterprise tools in that together they provide some interesting possibilities for database to xml archiving. Microsoft Access provides strong import and export facilities for xml and xml schema, with the added ability to place xsd style sheets onto outputted xml documents for ready display in almost any modern web browser. MS Access datatypes and relational structures can be captured to xml and schema and then re-instated back to their original form from the xml archive. However, the limitation of such functionality is that it is an 10

isolated implementation, with only MS Access data being truly exportable/importable to and from an xml archive. While most of the current enterprise level RDBMS solutions, such as MS SQL Server, Oracle 10g (Murthy, Liu, Krishnarprasad et al, 2007) and MySQL have varying levels of internal xml support, none export directly to xml with any level of fidelity for content and structure. This leaves information managers and the organisations that they serve in a position of having to look to 3 rd party vendors to provide the tools for implementing the RDBMS->XML->RDBMS cycle. A cycle described by the need to archive to xml but re-instate back to the enterprise environment. The alternate solution is to take the approach of an asynchronous archiving paradigm where data is archived to xml with the intention that the data later be retrievable and readable, but not able to be re-instated back into the RDBMS from which it came. References Cheng, Su-Shing Chen (2001, March). Perspectives: The paradox of digital preservation. IEEE Computer. 1-6. Digital Preservation Testbed. (2003). From digital volatility to digital permanence: preserving databases. Retrieved 10 July, 2007 from http://www.digitaleduurzaamheid.nl/bibliotheek/docs/volatility-permanencedatabases-en.pdf Gartner (2006). Archiving: Technology overview. Gartner Research. no. G00137070. Juday, J. (2007). The fundamentals of the SQL server 2005 XML datatype. [Electronic document]. Available: http://www.developer.com/db/article.php/3531196 Microsoft (2006). Enterprise content management: Breaking the barriers to broad user adoption. Retrieved 11 October 2007 from: http://www.microsoftio.com/content/bpio/prospect_and_demand/ecm_wp2. pdf Microsoft (2007). Periodically archived records in an Access database. Retrieved 20 July 2007 from: http://office.microsoft.com/enus/access/ha010345681033.aspx Murthy, R, Liu, Z, Krishnarprasad, M, Chandrasekar, S, Tran, A, Sedlar, E, Florescu, D, Kotsovolos, S, Agarwal, N, Arora, V, Krishnamurthy, V. (2005). Towards an enterprise XML architecture. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. pp. 953-957. National Archives of Australia (2001). epermanence: DIRKS a strategic approach to managing business information. Retrieved 11 October 2007 from: http://www.naa.gov.au/images/dirks_glossary_tcm2-954.pdf Reinhardt, A. (1994, August). Managing the new document. Byte. Retrieved 11 October 2007 from: http://www.byte.com/art/9408/sec7/art1.htm 11