By Daniela Bone and Peter Burns, Allen Press, Inc. An Overview of Content Archiving Services in Scholarly Publishing Digital versions of scholarly publications have become established as valued complements or even alternatives to their print counterparts. Many researchers enjoy the easy search options, paperless format, and almost instant online availability. However, a problem many librarians now face is how to preserve their long-term access to these digital publications. After all, digital content can t simply be kept on shelves; once a subscription expires, so too does access to the journal website and all archives, in most cases. And what happens to the content if the publication or publisher ceases to exist? Several initiatives have been undertaken to address this problem and offer possible solutions. This article presents several existing initiatives. While digital archiving is still far from resolved, these initiatives are a significant step toward preserving libraries longterm online access to digital publications. Light Archives vs. Dark Archives Content in light archives becomes available to authorized users when the original source is temporarily unavailable. Authorized users refers to libraries that have or used to have a subscription to the preserved content. Content in dark archives becomes available to authorized users only when a trigger event has occurred that makes other sources, such as the original journal or publisher website, inaccessible. Examples for possible trigger events are discontinuation of a publication or publisher, technical failure, or natural disaster. Specialized Initiatives Naturally, it is in the interest of each university library or other institution to preserve its own literature into the future. As a result, several smaller initiatives have begun developing worldwide. Some are restricted to preserving the research of individual 2011 Supplement no. 1 1
Daniela Bone & Peter Burns, Allen Press, Inc. Some online archive initiatives extend across subject areas and geographic regions and collaborate with a variety of academic institutions and publishers. universities, others that of geographic regions or specific fields of research. A few examples are listed below, although this is not a complete list. National Libraries Several national libraries are establishing their own systems for long-term preservation of scholarship in their countries. One example is e-depot, created by the National Library of the Netherlands, also known as the Koninklijke Bibliotheek, or the KB. The KB s e-depot is a long-term digital repository funded by the Dutch government. According to the KB website (www. kb.nl/index-en.html), access is controlled by agreements with individual publishers. In and of itself the KB s e-depot is neither a dark archive nor a light archive, the website says. The extent to which access is given to the information in the repository is determined by each individual archiving agreement. In these agreements, the KB aims to strike a balance between the justified commercial interests of publishers and its own intrinsic mission to serve its users. 1 Material in the archive is usually not made available to the general public, but is restricted to on-site use for research purposes. The KB makes exceptions for content that was open access to begin with and for material that is no longer available from another source. Such material is released by trigger events very similar to those used by Portico, a dark archive. 2 As of November 2010, one title had been released because of a trigger event. Subject-Specific Initiatives PubMed Central (www.ncbi.nlm.nih.gov/ pmc/) PubMed Central (PMC), an initiative by the National Library of Medicine, focuses on preserving scientific literature from the biomedical and life sciences. Content deposited in PMC is openly accessible by anyone to ensure the content s long-term preservation despite technology changes. Per its website, PubMed Central aims to fill the role of a world class library in the digital age. Publications that meet the editorial and technological requirements can elect to supply all or part of their content. Some publications archive only older volumes, with a gap between the newest archived volume and the current volume, often referred to as an embargo. 3 Biodiversity Heritage Library (www. biodiversitylibrary.org) The Biodiversity Heritage Library (BHL) digitizes and preserves literature on biodiversity. It is a project by 12 natural history libraries in the United States and the United Kingdom that has expanded to include other collaborations worldwide. Content preserved in BHL is openly available to the scientific community. Publishers wishing to suggest one of their titles for inclusion should contact BHL. 4 Global Initiatives Some online archive initiatives extend across subject areas and geographic regions and collaborate with a variety of academic institutions and publishers. The best-known at present are LOCKSS (Lots of Copies Keep Stuff Safe), CLOCKSS (Controlled LOCKSS), and Portico. LOCKSS (www.lockss.org) The LOCKSS Program is an initiative by Stanford University Libraries founded in 1998 with support from the National Science Foundation, the Mellon Foundation, the Soros Foundation, and the Library of Congress. It became widely accessible in 2004. The initiative enables libraries to build long-term collections of digital publications comparable to their print collections. It is a light archive, that is, it delivers content to readers only when the publisher s website does not deliver it. The system is based on the traditional print library model, in which libraries worldwide hold numerous copies of a publication. In that model, if one library s copy is lost, damaged, or destroyed, another location can lend a copy. Similarly, LOCKSS operates as a decentralized peer-to-peer network in which each participating library stores its own collection locally. At the same time, the more than 200 libraries and other institutions participating in LOCKSS constantly monitor each other s preserved titles digitally and correct or repair content if needed. A library using the LOCKSS software must meet two conditions for archiving a publication s content: 1. It must have an active subscription to the publication when it begins archiving, or the 2
An Overview of Content Archiving Services in Scholarly Publishing publication must be openly available. When a subscription lapses, only the content accessible during the subscription period will become accessible from that library s LOCKSS box. 2. LOCKSS must have permission from the content s publisher to preserve it in LOCKSS through a licensing agreement. The purpose of LOCKSS is to enable both libraries and publishers to maintain their established roles in digital publishing, says Vicky Reich, director of the LOCKSS Program. This prevents having to outsource content preservation to third parties. To date, more than 450 publishers have given permission to preserve more than 6,600 e-journal titles in LOCKSS. Participation is free for publishers. 5 To build their collection locally, libraries install open-source LOCKSS software on a local computer, converting it into a so-called LOCKSS Box. Library staff carries out basic administration of these computers while Stanford University Libraries provides the technical support. Each participating library can access and preserve only the content for which it is authorized; the network s other LOCKSS Boxes simply ensure that it is correct and complete. Libraries are welcome to bring online a test LOCKSS Box for no cost. 6 Formats One challenge in digital preservation is the frequent change of data formats. A file saved in a common format today may be impossible to open in 10 years. One solution to this problem is a method called Migration on Access, which is used by LOCKSS: Content is stored in its original data format but automatically converted into a current file format upon access. Thus, content continues to be available for reading even if its original file format has become obsolete. LOCKSS Alliance Libraries are encouraged to join the LOCKSS Alliance, which provides premium benefits: LOCKSS Alliance members receive technical support from the staff at Stanford University Libraries and have access to content outside their own collection that is not available to libraries running test LOCKSS Boxes. Fees for joining the LOCKSS Alliance support technical development and maintenance such as that of LOCKSS Boxes. Pricing is based on the Carnegie classifications, considers the type of academic institution and program size, and ranges from $1,080 to $10,800 annually. CLOCKSS (Controlled LOCKSS) The CLOCKSS Archive is a global not-for-profit joint venture between libraries and scholarly publishers. As a dark archive, CLOCKSS releases content only following specific trigger events: The publisher is no longer in business. The title is no longer offered. Back issues of the title are no longer available. General catastrophic failure. Both LOCKSS and CLOCKSS utilize LOCKSS software to preserve content, but the two archives are structured differently. While the LOCKSS Program enables libraries to build and preserve local collections, the CLOCKSS Archive is a closed system where the content is If a publication is no longer available online to a particular library, the library s LOCKSS Box becomes a source of content for that library s community. Library users will continue to find and search publications through search engines. When an article is accessed, it will appear as on its original website, with minor differences all dynamic elements of the original website, such as advertising, will remain static instead. 2011 Supplement no. 1 3
Daniela Bone & Peter Burns, Allen Press, Inc. preserved at a limited number of libraries. It preserves a variety of publications from participating publishers and releases them open access to the general scholarly community following trigger events. Archiving process Both publishers and libraries can become participants in CLOCKSS. To participate, publishers must grant permission to preserve their content in CLOCKSS and make it available in the unlikely case of a trigger event. Libraries can participate by providing financial support to maintain and improve CLOCKSS over time. They can submit their own collections for preservation as long as the publisher permits it, but unlike LOCKSS participants, they do not store the content locally. The long-term goal is to make the preserved publications that no longer have commercial value freely available to the general public and prevent them from being lost. The content is preserved in 12 CLOCKSS libraries (soon to be 15), which are located in different world regions to minimize external threats. Per the CLOCKSS website, the preservation process is as follows: To transfer content from publishers websites to the CLOCKSS Archive, publishers either allow CLOCKSS to crawl their web-published content, or they deposit the content via an FTP site. Using the LOCKSS technology, the CLOCKSS Archive preserves all file formats. CLOCKSS ingest boxes at Rice, Indiana, and Stanford Universities receive the content from the publishers. The ingest boxes then verify each other s versions of the content and repair any damaged or missing data until all boxes contain a validated version. CLOCKSS preservation boxes at the 12 CLOCKSS libraries collect the validated content from the CLOCKSS ingest boxes. The preservation boxes also carry out continuous audit and repair processes. If data are missing or incomplete in any of the boxes, they are replaced or repaired using data from the other boxes. 7 When a publication is discontinued or another trigger event occurs, the CLOCKSS Board of Directors votes whether to release that content from the dark archive. If it is released, the content is copied from the archive, migrated to a current data format if required, and hosted from a web server. Currently, these web servers exist at two CLOCKSS host organizations, the EDINA Data Center at the University of Edinburgh and Stanford University. The content is now openly and freely accessible to the scientific community. The format of the released content is as similar to the publisher s original files as possible. The page structure from the original website is largely preserved; however, dynamic page elements such as search functions, links, and advertisements no longer work. To date, three titles have been released or triggered by CLOCKSS. Portico (www.portico.org) Portico is a dark archive, in that the content stays in reserve until an unusual event triggers its release to participating institutions. The four types of trigger events, according to Portico s website, include: 1. Publisher ceases operation, and titles are no longer available from any other source. 2. Publisher ceases to publish a title, and it is not available from another publisher or another entity. 3. Back issues are removed from a publisher s offering and are not available elsewhere. 4. Catastrophic failure, such as the failure of a business or a natural disaster, prevents content from being available for a sustained period of time. 8 If the disruption to availability is temporary, Portico can release the content until the situation is resolved, but Toni Tracy, director of publisher relations for Portico (interview, October 21, 2010), stresses that the service is not intended to be used as an immediate backup solution or a redundant component of a publisher s own storage system. Instead, Portico is an independent third-party repository that provides access only when the status quo simply doesn t exist anymore. 4
Trigger events are not just hypothetical, as Portico has demonstrated a real-world test several times. Of more than 12,000 journal titles in the Portico archive, four are now available following various trigger events that prompted their release. Because Portico s content is released only to libraries that participate in the service, these titles are not freely available to everyone. But at the end of 2010, more than 700 libraries in over 15 countries were participating in the service, making released titles available to thousands of library patrons worldwide who otherwise would not have access to these journals. Portico also makes content available to participating institutions as the result of postcancellation access claims if publishers of the canceled titles have opted in to this element of the Portico service. Libraries pay an annual participation fee to Portico based on their total materials expenditures as defined by the Association of Research Libraries. For the e-journal service, the payment schedule consists of several tiers, ranging from the smallest libraries with annual expenditures below $150,000 to those spending tens of millions each year. For example, a library spending between $15 million and $20 million per year on expenditures would pay Portico $18,025, while the smallest library might pay as little as $1,500 per year, or 1% of its total library material expenditure. In 2011, Portico plans to introduce fees that will give libraries the opportunity to participate in its new e-book preservation service, to which publishers have already committed more than 70,000 titles. An Overview of Content Archiving Services in Scholarly Publishing only $250 per year. Full information about all of Portico s preservation services is available at Portico s website. It s all in the metadata The trick to making sure content will be available despite inevitable changes in technology can be summed up in one word: metadata. While Portico preserves the publisher s own metadata, it also goes a step further. We create Portico-specific preservation metadata, Tracy says. Our approach is to analyze, understand, and migrate the content up front to ensure that it s in good shape to be accessible over time. It is only after we have conversations with the publisher to understand the content that we migrate the content to an archival version of the NLM DTD [National Library of Medicine Document Type Definitions, the industry standard], and then ingest the content into the archive for long-term management and preservation. The creation of normalized content specifically for preservation purposes ensures that the information will remain accessible over time, but the look and feel of the articles might be different from what we re used to seeing on the publisher s web version of the title. We don t make any effort to emulate the publisher s website, Tracy says. Our approach is to preserve the intellectual content. Since keeping up with web design Publishers are also expected to contribute to Portico for the assurance that their content will be available into the future. The amount publishers pay is also tiered, based on total annual e-journal revenues or e-book revenues. The largest e-journal publishers contribute $77,250 per year, while the smallest e-journal publisher contributes 2011 Supplement no. 1 5
Daniela Bone & Peter Burns, Allen Press, Inc. Perhaps there is no way to be certain that current research will still be around in 10, 100, or 1,000 years, but the best way to ensure the research doesn t last indefinitely would be to do nothing. requires significant investment, Portico prefers to invest primarily in preservation-focused activities. Although technology is obviously a factor in keeping scholarly research available for future generations, it s not the biggest factor. The biggest challenge from our viewpoint is getting the community to recognize that the responsibility for preservation has to be shared, Tracy says. In a print world, preservation was primarily the responsibility of the library. In a digital environment, the responsibility must be shared between the publisher and the library. Whither print? Which is not to say that print shouldn t be part of a strategic approach to preservation. Ithaka S+R another service that is part of the not-for-profit organization ITHAKA, which also includes Portico and JSTOR advises libraries to retain some print copies under certain circumstances. Ithaka S+R has published its recommendations for retaining print copies in the paper, What to Withdraw? Print Collections Management in the Wake of Digitization. 9 The paper draws from the experience ITHAKA has gained from digitizing print collections through JSTOR, which was founded to help libraries manage such transitions. 10 In this report, ITHAKA S+R pairs analysis of the continuing role of print with a quantitative operations research model to offer the library community a framework to guide strategic planning in collections management. According to Ithaka S+R, preservation is the primary remaining role of the print original for journals that are principally accessed in digital form. This study provides librarians with the following rationales for retaining some copies of these scholarly journals in print format: Very rare or unique print copies that represent historic or aesthetic value. Such value exists beyond the information contained within the pages and extends to the journal as an object worthy of preservation. Backup copies to fix scanning errors. Ithaka S+R notes that even scanning processes that focus on high quality can make mistakes that may not be discovered until well after scanning takes place. For example, according to Ithaka S+R s report, half of the error reports JSTOR has received have come within two years of online availability, while 92% come within five years. 9 Initial scanning standards are inadequate or technology improves. Scanning has evolved over the years and now includes the ability to digitize color images, readable black-andwhite text, and grayscale photos. In some cases, these different elements exist on the same page of the original, so the best quality results from the same page being scanned multiple times with the various elements stitched together. As scanning quality continues to improve and the price drops, better results may be attained by rescanning the original. Loss of digitized material requiring the print originals to be scanned again. Major preservation initiatives such as Portico are unlikely to lose content, but not all content is stored in such a thorough system. Insufficiently preserved digitized materials are far more fragile and subject to loss, according to the Ithaka S+R paper. In addition, the digital preservation processes with which digital resources are managed may not always be clear, leaving the library community unsure how much trust to place in a given digital collection. 9 Reliability of access, including such terms of access as licensing and pricing. Scholarly needs, particularly regarding high-quality images. Local faculty needs and campus politics. University libraries should take into account the needs of their institution s faculty, who may be reluctant to dispose of print entirely. However, Ithaka S+R expects faculty interest in print preservation to decline. Digital information is obviously not going away anytime soon. Many publishers and librarians, however, are understandably concerned that it could go away eventually, leaving future historians with a large gap in knowledge about our era. Perhaps there is no way to be certain that current research will still be around in 10, 100, or 1,000 years, but the best way to ensure the research doesn t last indefinitely would be to do nothing. Preserving digital information in 6
An Overview of Content Archiving Services in Scholarly Publishing multiple formats and various physical locations, along with judicious use of print archives, gives today s research the best chance possible that it will still be here tomorrow. Accessible Archive JSTOR (www.jstor.org) In the mid-1990s, university libraries were pressed for space on their shelves while the amount of published scholarship was continuing to grow. Out of this conundrum grew JSTOR, a non-profit charged with the task of converting print journals to digital format. This relieved the libraries space crisis, and the resulting digital database helped usher in a new way of providing information to researchers. As the JSTOR website states: There would also be other benefits: material would never be lost or checked out, small institutions could have access to large collections, and ultimately, trust in digital preservation could help to bring about acceptance of electronic publication. 10 Unlike Portico, JSTOR makes its collection available immediately to subscribers. Publishers whose journals are included in the collection receive royalty payments based on usage, but not every journal is accepted into the collection. Each journal submitted for consideration will be evaluated on the following criteria, according to the JSTOR website: Historical significance of the title Recommendations from scholars and librarians Citation analysis Number of institutional subscribers around the world Relevance to a scholarly audience 11 As with other aggregators that sell subscriptions or make content freely available, publishers often delay the release of their most recent content in order to prevent cannibalization of their direct institutional subscriptions. Another feature of JSTOR is its ability to provide individual access to content for publishers and society members. Users may log in to the members-only area on the society website and then pass effortlessly to the archived content in JSTOR through a link on the page. Although originally conceived as an archive of backfile journal issues, JSTOR is evolving to include currently published content as part of its Current Scholarship Program. JSTOR currently works with more than 800 publishers who 2011 Supplement no. 1 7
Daniela Bone & Peter Burns, Allen Press, Inc. Many online aggregators already collaborate with online archives and transfer content included in their collections to Portico or LOCKSS. participate in either the back issues program or the newly launched Current Scholarship offering. Conclusions for Scholarly Publishers Despite all efforts, no guaranteed method exists yet to preserve digital content for centuries to come. Changes in technology and infrastructure can be unpredictable. Still, societies wishing to preserve their content will most likely benefit from participating in the existing archiving initiatives. At the very least, their content will remain available to the scholarly community longer than it would otherwise. Current online archives are careful not to interfere with a publication s current online presence. Dark archives become active only when the publication s original website has become permanently unavailable, so there is no competition. Light archives provide preserved content to a limited audience and usually only when the original publication website is temporarily unavailable. Many online aggregators (www.bioone.org, www. jstor.org) already collaborate with online archives and transfer content included in their collections to Portico or LOCKSS. Societies participating in online aggregations can contact them to find out the options for archiving their content. Societies should also keep in mind that printed editions of publications have not become obsolete. Many readers want to retain paper because it is their preferred reading format and is easily accessible, or simply because it is an easy way to build a collection that is subject to different risks than a digital library. References 1. Koninklijke Bibliotheek. Information for International Publishers: The KB s e-depot: A Trustworthy Steward for the Digital Scholarly Record. http://www.kb.nl/dnp/e-depot/ operational/suppliers/national_suppliers-en.html. Accessed January 5, 2011. 2. Portico. Preservation Approach. http://www. portico.org/digital-preservation/services/ preservation-approach/. Accessed January 5, 2011. 3. PubMed Central. PMC Overview. http://www.ncbi. nlm.nih.gov/pmc/about/intro.html. Accessed January 7, 2011. 4. Biodiversity Heritage Library. Who Are We? http:// biodivlib.wikispaces.com/about. Accessed January 7, 2011. 5. LOCKSS. Publishers and Titles. http://www.lockss. org/lockss/publishers_and_titles. Accessed January 7, 2011. 6. LOCKSS. Configuring a LOCKSS Box. http://www. youtube.com/watch?v=0wdcnxrqkai. Accessed January 7, 2011. 7. CLOCKSS. How CLOCKSS Works. http://www.clockss. org/clockss/how_clockss_works. Accessed January 7, 2011. 8. Portico. Access to Archived Content. http://www. portico.org/digital-preservation/the-archivecontent-access/access-to-archived-content/. Accessed January 5, 2011. 9. Schonfeld, R. C., and Housewright, R. September 29, 2009. What to Withdraw? Print Collections Management in the Wake of Digitization. Ithaka S+R. Accessed January 5, 2011. 10. JSTOR. Our History. http://about.jstor.org/about-us/ our-history. Accessed January 5, 2011. 11. JSTOR. Publishers & Content Providers. http:// about.jstor.org/participate-jstor/publishers. Accessed January 5, 2011. 8