Standards Development PROS 14/00x Specification 3: Long term preservation formats
1 2 Copyright Statement State of Victoria 2014 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 This work is licensed under a Creative Commons Attribution 3.0 Australia licence. You are free to re-use the work under that licence, on the condition that you credit the State of Victoria (through the Public Record Office Victoria) as author. The licence does not apply to any images, photographs, or branding, including the Victorian Coat of Arms, the Victorian Government Logo, and the Public Record Office Victoria logo Disclaimer General The State of Victoria gives no warranty that the information in this version is correct or complete, error free or contains no omissions. The State of Victoria shall not be liable for any loss howsoever caused whether due to negligence or otherwise arising from the use of this document. This issues paper should not constitute, and should not be read as, a competent legal opinion. Agencies are advised to seek independent legal advice if appropriate. Records Management Standards Application The recordkeeping standards issued by PROV apply to all records in all formats, media or systems (including business systems). Agencies are advised to conduct an independent assessment to determine what other records management requirements apply. State of Victoria 2014 v1.0 Page 2 of 10
23 24 25 26 27 28 29 30 31 32 33 Table of Contents 1. Introduction... 6 1.1. Existing standard... 6 1.2. Related documents... 6 1.3. General References... 6 1.4. Acknowledgements... 6 2. VERS Long Term Preservation Formats... 7 2.1. How PROV chose the long term preservation formats... 7 2.2. How to choose one of the long term preservation formats... 7 2.3. Long term preservation formats... 9 2.4. Other formats... 10 State of Victoria 2014 v1.0 Page 3 of 10
34 35 36 Acronyms The following acronyms are used throughout this document. PROS PROV VEO VERS Public Record Office Standard Public Record Office Victoria VERS Encapsulated Object Victorian Electronic Records Strategy State of Victoria 2014 v1.0 Page 4 of 10
37 38 39 40 41 42 43 44 45 46 47 48 49 50 Executive Summary This specification is part of the Standard for the encapsulation of digital information (PROS 14/xxx). It describes the long term preservation formats that are acceptable to PROV. Obsolescence of digital formats is a problem for long term digital preservation. When a format becomes obsolete it may be difficult (or impossible) to obtain software to read the format and render the information contained within it. A long term preservation format is one that is So ubiquitous that software to read the format is likely to be available for the foreseeable future, or Well documented so that readers can, if necessary, be implemented from scratch. We are interested in any suggestions on ways to improve this specification. Please send comments to standards@prov.vic.gov.au State of Victoria 2014 v1.0 Page 5 of 10
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 1. Introduction This specification lists a set of long term preservation formats that are considered to have a long usable life, and that will minimise the cost of providing access to the content over its life. All content transferred to PROV must be migrated (if necessary) to one of these formats. If content is migrated, the original format must also be transferred to PROV (unless specifically exempted). It is recommended that Agencies adopt these formats when creating content as this minimises subsequent migration costs. Agencies adopt these long term formats for content that they need to keep for a long time. The formats were chosen on the following criteria: Extremely widespread adoption. Being the dominant format in a particular category. Multiple independent implementations of the software. A published formal specification that implementations adhere to. Already accepted by the previous version of this standard. 1.1. Existing standard In order to protect the investment already made by vendors and agencies, PROV will continue to accept VEOs conformant to the existing VERS standard (PROS 99/007 (Version 2)) for the indefinite future. The new format, however, is not backwards compatible with the previous version. 1.2. Related documents This document should be read in conjunction with PROS 14/00x Standard for the encapsulation of digital information PROS 14/00x Specification 1: Constructing VERS Encapsulated Objects PROS 14/00x Specification 2: Adding metadata to VEOs PROS 14/00x Specification 3: Long term preservation formats 1.3. General References References to format specifications are included in the body of this specification. 1.4. Acknowledgements We would like to acknowledge and thank the people who took the time to comment on earlier drafts of this proposal. Nearly all the comments have been included in this draft, and where this is not possible we have included footnotes to explain the reasons. State of Victoria 2014 v1.0 Page 6 of 10
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 2. VERS Long Term Preservation Formats This specification identifies the formats that are considered low risk for the long term preservation of information. The risk being addressed is that in the future it will not be possible to obtain software to extract present the information embedded in a digital object. Over a sufficiently long period of time all formats can be expected to become unreadable. At some point, then, it will be necessary to undertake a preservation action for a format. The goal of this specification is to identify formats for which this preservation action is likely to be a long way in the future, and when it is necessary to perform this preservation action, the necessary tools are available. 2.1. How PROV chose the long term preservation formats Selection of long term preservation formats is based on the assumption that ultimately any format is likely to fall out of use and objects in that format will require preservation actions. Good long term preservation formats are those that are likely to have a long lifespan before preservation interventions are required, and that when a preservation action is required suitable tools should be easily obtained. It is important to note that we are not assuming that formats will have an indefinite lifespan. Characteristics that suggest that a format is likely to have a long lifespan before preservation interventions are required are that: The format is in extremely widespread use The format has the dominant market share in its domain. These two characteristics mean that economics sustains the format. New products in the domain must accurately support the format (otherwise it is extremely difficult for them to gain market share). The number of instances of the format means that there is an economic incentive for developers to produce readers or migration tools for that format, even if the original vendor ceases support. An additional benefit of selecting common, dominant, formats is that these are likely to be the majority of record content held by an agency. Characteristics that suggest that tools will be available to undertake the preservation action include: A published format specification exists 1. Multiple independent implementations exist of format creators/readers 2. 2.2. How to choose one of the long term preservation formats Requirement to use one of the formats All record content transferred to PROV must be represented in one of the long term preservation formats in this standard. Agencies are strongly encouraged to use these long term preservation formats for information that must be retained for more than seven years, even if the information does not have to be transferred to PROV. The longer the information needs to be kept, the more important it is to use one of these long term preservation formations. Use of the formats will aid in ensuring long term access to the content. 1 It is not necessary that the format be published by a standards body, but this is preferred. 2 Multiple independent implementations show 1) that there is an economic incentive to read the selected format, and 2) that the format can be accurately implemented by independent vendors. State of Victoria 2014 v1.0 Page 7 of 10
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 Avoiding migration Agencies are strongly encouraged to adopt these long term preservation formats for day to day business use. This will avoid the requirement to subsequently migrate information to a long term preservation format. Migration from one format to another is to be avoided if at all possible. In general, migration is expensive (to obtain the necessary tools, to carry out the migration, and to conduct the necessary quality assurance to ensure the migration was carried out successfully). There is always the risk when migrating of losing information. One of the criteria that PROV used when selecting long term preservation formats was to choose formats commonly in use within agencies to avoid, as far as possible, migration. Format selection Several long term preservation formats are provided for most categories of information. Agencies and agency staff can choose the most appropriate format for their business needs. In accordance with our policy of avoiding migration, the most appropriate format to choose will be the one in which the business is actually undertaken. Version selection Unless otherwise indicated, PROV will accept any version or variant of the selected long term preservation formats. This is because, in most cases, it is difficult for agencies to configure or set up software products to produce particular variants of a format. Further, most software will produce the latest version 3 of a particular format. PROV reserves the right, however, to not accept records that do not conform to a particular format, even though the production software is claimed to produce valid objects. Where appropriate, we provide recommendations if particular versions of a format are preferred over others. Agencies are encouraged to adopt the recommended versions where possible. Including the original format Where record content has been migrated to a long term preservation format for the purposes of transfer, a copy of the original, un-migrated, format must also be included in the VEO unless otherwise agreed by PROV. The requirement to include a copy of the original format guards against the following risks: that the migration did not result in an accurate representation of the original record that the migration resulted in loss of information that a better migration approach may be available in the future PROV will not generally require a copy in original format where The record is extremely large, and The migration process is a routine technology process with little chance of content loss 3 At the time the software was created State of Victoria 2014 v1.0 Page 8 of 10
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 A typical example of a situation where the original would not normally be required is the conversion of video or audio to a long term preservation formation. 2.3. Long term preservation formats 45 The following formats are defined as long term preservation formats: Document and text formats: Plain text (.txt) Portable Document Format (.pdf). PDF documents should be conformant to PDF/A-1 or PDF/A-2 6. PDF/A-3 must not be used. Microsoft Word 7 (.doc,.docx) Web formats: HTML (.htm,.html,.css) 8. extensible Mark-up language (.xml) 9 Web ARChive format (.warc) 10 Spreadsheet formats: Comma separated values (.csv) 11 ; Microsoft Excel 12 (.xls,.xlsx) Presentation formats: Microsoft PowerPoint 13 (.ppt,.pptx) Portable Document Format (.pdf). PDF documents should be conformant to PDF/A-1 or PDF/A-2 14. PDF/A-3 must not be used. Image Formats: JPEG (.jpg); JPEG2000 15 (.jpg); Tagged image file format (.tif,.tiff) 4 PROV has tried to limit the number of alternative equivalent formats, particularly where the alternatives have low use (e.g. non inclusion of RTF as few documents are in this format and they can easily be migrated with no loss of functionality to Word). Some alternative formats, however, are already in the current version of the standard (e.g. JPEG2000). 5 Experience with PROS 99/007 has shown that it is extremely difficult for content creators to know what format their software products are creating, or to configure a product to create a particular profile of a format. Further, PROV does not currently have the capability of verifying format versions. It is for these reasons that we do not require specific versions of the listed formats. However, we encourage creators to use the current version of the products, and recommend particular format versions. These recommendations may become requirements over time. 6 It is recommended that PDF documents be conformant to either PDF/A-1 (ISO 19005-1) or PDF/A-2 (ISO 19005-2) 7 The Microsoft Word format has been chosen over the Open Office Write format because PROV believes a key characteristic of longevity is economic adoption, not status as an open standard. Selection of the Word format also avoids the cost of migrating the single most common record type within government. While we have selected the Word format, this does not mean that agencies must adopt the Microsoft Office product. Any application that creates valid Word (e.g. OpenOffice) can be used. 8 It is recommended that HTML files conform to HTML 4.01 standard (http://www.w3.org/tr/1999/rechtml401-19991224/ ) and CSS 2.1 (http://www.w3.org/tr/2011/rec-css2-20110607/ ) 9 XML files will be readable for the indefinite future, but they may not be interpretable. This is because the meaning of the XML markup is defined in separate standards (e.g. SVG for vector graphics). 10 Note that WARC is a container format that encapsulates web objects. Each web object (e.g. web pages) in the WARC file must be in one of the long term preservation formats specified in this specification. 11 See http://tools.ietf.org/html/rfc4180 for a non-normative definition of a CSV file. 12 The selection of the Microsoft Excel format over the Open Office Calc format is for the same reasons as preferring the Word format over the Write format. 13 The selection of the Microsoft PowerPoint format over the Open Office Impress format is for the same reasons as preferring the Word format over the Write format. 14 It is recommended that PDF documents be conformant to either PDF/A-1 (ISO 19005-1) or PDF/A-2 (ISO 19005-2) 15 JPEG2000 is accepted as a long term preservation format in PROS 99/007 (Version 2.0). For this reason, PROV will continue to accept it, however, at the moment (2014) it would fail the economic adoption test. State of Victoria 2014 v1.0 Page 9 of 10
195 196 197 198 199 200 201 202 203 204 205 206 207 208 Audio Formats: MPEG 1/2 Audio Layer 3 (.mp3); MPEG-4 (.mp4) WAVE (.wav) using an LPCM codec Video Formats: MPEG-4 (.mp4) Email Formats: MIME (.eml 16 ). 2.4. Other formats Agencies are encouraged to contact PROV if a common format they use is not in this list 17. However, before extending this list of formats PROV will work with the agency to determine the likely economic life of the format. End of Document 16 The.eml format is used by many email clients, including Microsoft Outlook Express, Lotus notes, Windows Mail, and Mozilla Thunderbird. Other straight representations of the MIME format can be used. 17 In particular, we are interested in discussing with agencies appropriate long term preservation formats for specialized types of data (e.g. CAD, GIS). State of Victoria 2014 v1.0 Page 10 of 10