WEB ARCHIVING AT SCALE



Similar documents
Archive-IT Services Andrea Mills Booksgroup Collections Specialist

PASIG San Diego March 13, 2015

Archiving the Social Web MARAC Spring 2013 Conference

University of California Libraries. Shared Print Program. Strategic Plan

Capturing the Web WEB ARCHIVING. Tools for the Capture of Digital Assets on Websites March 27, 2006 Kelly Eubank

Advanced Archive- It Applica2on Training: Archiving Social Networking and Social Media Sites

ALEX THURMAN. PCC Participants Meeting ALA Annual June 30, Columbia University Libraries

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

How To Understand Web Archiving Metadata

Archiving the Internet

TERMS OF REFERENCE. Revamping of GSS Website. GSS Information Technology Directorate Application and Database Section

How To Use The Internet Archive For A Library Collection

Local Loading. The OCUL, Scholars Portal, and Publisher Relationship

Indexing big data with Tika, Solr, and map-reduce

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

Web Archiving and Scholarly Use of Web Archives

How To Manage Pandora

WEB ARCHIVING IN THE UNITED STATES: A 2013 SURVEY AN NDSA REPORT

PEER BENCHMARKING. A Powerful Tool for IT Portfolio Planning. Noah Wittman, Educational Technology Services, UC Berkeley UCCSC - 04 August 2014

A Strategy for Managing Freight Forecasting Data Resources

The A-Z of Building a Digital Newspaper Archive: A Case Study of the Upper Hutt City Leader

How collaboration can save [more of] the web: recent progress in collaborative web archiving initiatives

Progress Made and Lessons Learned through Collaborative Web Archiving Projects

Web Archiving Tools: An Overview

Practical Options for Archiving Social Media

SHared Access Research Ecosystem (SHARE)

The Center for Research Libraries

Archiving the Web: the mass preservation challenge

Request for Proposals

Building The Business Case For Launching an App Store

Contemporary Composers Web Archive (CCWA): Progress in Collaboratively Collecting Composers' Websites

How collaboration can save [more of] the web: recent progress in collaborative web archiving initiatives

Brown University Libraries Technology Plan,

Nonprofit Org Web Overhaul. Request for Proposals

Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS.

NGTS Management Team Final Report

First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

HOW TO IMPROVE YOUR WEBSITE FOR BETTER LEAD GENERATION Four steps to auditing your site & planning improvements

Statement of Work (SOW) for Web Harvesting U.S. Government Printing Office Office of Information Dissemination

How you can meet your top marketing priorities

Your Toughest Questions. Answered

Scholarly Use of Web Archives

Digital Public Library of America (DPLA)

INTRODUCTION. by Kristine Hanna Director of Archiving Services at the Internet Archive

Overcoming the Technical and Policy Constraints That Limit Large-Scale Data Integration

Cisco Integrated Video Surveillance Solution: Expand the Capabilities and Value of Physical Security Investments

SYMANTEC NETBACKUP APPLIANCE FAMILY OVERVIEW BROCHURE. When you can do it simply, you can do it all.

Real World Strategies for Migrating and Decommissioning Legacy Applications

Web and Twitter Archiving at the Library of Congress

Joe Murphy Science Librarian, Coordinator of Instruction and Technology Yale University Science Libraries

Affordable Digital Preservation for Libraries and Museums

CALL FOR PROPOSALS INFORMATION TECHNOLOGY FOR UNIVERSAL HEALTH COVERAGE

1: 2: : 3.1: 3.2: 4: 5: & CAPTCHA

The Next Wave of Data Management. Is Big Data The New Normal?

Cisco Collaboration Architecture: Enhance Employee Effectiveness for Greater Business Impact

Ex Libris Rosetta: A Digital Preservation System Product Description

THE DIGITAL REPOSITORY OF IRELAND: REQUIREMENTS SPECIFICATION

Information Technology Strategic Plan /23/2013

Best Practices of Mobile Marketing

The first truly cooperative, Webscale management service for libraries

Web Site Collection Plan. for. Michigan State University Archives & Historical Collections. May 21, 2015

ERA Challenges. Draft Discussion Document for ACERA: 10/7/30

Archiving the Web and Beyond: A Look at Twi8er and Facebook (and some other things too)

REQUEST FOR PROPOSAL Website Design and Development. DUE DATE: FEBRUARY 19, :00 p.m.

Next-Generation IT Asset Management: Transform IT with Data-Driven ITAM

Canadian National Research Data Repository Service. CC and CARL Partnership for a national platform for Research Data Management

Building a master s degree on digital archiving and web archiving. Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF)

THE WEB ARCHIVING LIFE CYCLE MODEL

Strategic Goals. 1. Information Technology Infrastructure in support of University Strategic Goals

Current Quality Assurance Practices in Web Archiving. Prepared By. Brenda Reyes Ayala

Social Media Solutions. Be where the conversation is happening

Feet On The Ground: A Practical Approach To The Cloud Nine Things To Consider When Assessing Cloud Storage

III. 9:35-9:50 Retirement of Web Space- Communication Plan Endorse (Dave Moss)

February 22, 2013 MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES

Business Intelligence Cloud Service Deliver Agile Analytics

RAMP Marketing Automation Checklist

Five Steps to Integrate SalesForce.com with 3 rd -Party Systems and Avoid Most Common Mistakes

Request for Proposals

How To Use Open Source Software For Library Work

Enterprise Social Software

Oracle WebCenter Sites Mobility Server Enabling exceptional mobile and tablet web applications and web sites without compromise

SUBMISSION CASE STUDY ROUND 3 CHALLENGE 1 DIGITAL VIDEO & IMAGE LIBRARY

VOLUME 4, NUMBER 1, JANUARY Alloy Navigator 5: Robust New Service Management Solution Slated for Release

DIGITAL STEWARDSHIP SUPPLEMENTARY INFORMATION FORM

How To Harvest For The Agnic Portal

Archiving Social Media in Senators Offices

How to Develop a Free Physical Science Preprint Service

Blue: C= 77 M= 24 Y=19 K=0 Font: Avenir. Clockwork LCM Cloud. Technology Whitepaper

Building community clouds to support access to scholarship. Michele Kimpton CEO, DuraSpace Jonathan Markow CSO, DuraSpace

Stanford University Libraries & Academic Information Resources

REFORMING THE SCHOLARLY COMMUNICATION OF MATHEMATICS AND STATISTICS: PROJECT EUCLID AND ITS ECONOMIC MODEL

Capturing Meaningful Competitive Intelligence from the Social Media Movement

Creating Newsletter Messages

Penn State University IT Assessment. Opportunity Analysis - Advisory Committee March 30, 2011

OpenAIRE Research Data Management Briefing paper

I will be showing you exactly how to achieve this

The Smart Archive strategy from IBM

Strategic Plan

Creating Highly Interactive Websites for the Dissemination of Statistics

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

Transcription:

WEB ARCHIVING AT SCALE (updated 12/19/14) by Rosalie Lack, Stephen Abrams, Trisha Cruse THE VALUE OF WEB CONTENT Web content holds a critical place in modern library collections. Web archiving is an essential component of a twenty-first century research library. As more and more of our communications and cultural interactions are played out on the web, one cannot imagine researchers of today and in the future studying our time without having web content among the primary sources that they analyze. Our responsibility is to select, curate, and ensure access to this valuable content, as well as to provide the necessary tools to support researchers in their work. FUTURE PATH OVERVIEW Similar to HathiTrust and other at scale initiatives, web archiving is most effectively managed through collaborations beyond our individual organizations. No one institution can do it all. It is simply not economically or technically sustainable to operate standalone web archiving programs. Nor is it efficient to build a new shared infrastructure when one already exists that can be leveraged. This document provides an overview of factors that led to the decision by CDL to engage in a partnership with the Internet Archive s (IA) Archive-It (AIT) service, in conjunction with the web archiving programs at Harvard and Stanford. In order to realize long-term efficiencies through the use of a vendor solution for core web archiving infrastructure, and to take advantage of AIT s enhanced crawler capacity and curator tools, CDL will work with all current Web Archiving Services (WAS) customers to transfer their existing collections and future collecting activities to AIT. CDL, in partnership with Harvard and Stanford, will also collaborate closely with AIT to create an expanded roster of added-value tools and services that interoperate with AIT. Over time, our collective goal is to increase the number of partnering institutions by lowering the technical barriers to participation (MIT and UCLA have already expressed interest in being included in such collaboration). BACKGROUND The core web archiving infrastructure is comprised of three open source tools: Heritrix (for crawling), NutchWax (full-text indexing and search), and OpenWayback (discovery and playback). Heritrix and OpenWayback were originally developed by the Internet Archive. The nature and complexity of the web is rapidly changing and thereby adding an exponential degree of complexity to web archiving. The mostly HTML web of a few years ago is now comprised more and more of multimedia, social media, JavaScript, and other technologies that significantly challenge the web archiving capabilities of the Heritrix crawler and OpenWayback. In web archiving circles, this has Page 1 of 9

resulted in an ever-increasing need to develop new tools to keep up with the changing technology. Some have likened this situation to an arms race in which web archiving providers struggle to upgrade tools to stay ahead of the changing web. Added to this complex landscape is the fact that WAS is at a fork in the road. WAS was developed with National Digital Information Infrastructure Preservation Program (NDIIPP) funds and launched in 2007. For many years, WAS was a cutting edge tool that was the model for other web archiving services. We now face some critical upgrades that must be built to remain relevant, including significant enhancements needed to keep up with the web archiving arms race. If we continue to run WAS, we are also facing an internal CDL deadline of a migration to a new environment that needs to be completed by June 2015. Given the upgrade needs and migration, it is unlikely that we could start to address significant WAS enhancements before Q4 2015. To help us make the decision as to whether we should continue to commit significant resources to WAS or collaborate on an alternate solution, we consulted with peer institutions about how we might collaborate on building a shared, core web archiving infrastructure. Through our discussions (including an in-person meeting organized by the CDL with other institutions engaged in running web archiving services (Harvard, Stanford, Columbia, LoC, University of North Texas, and George Washington), we came to a consensus that engaging in this arms race (i.e., running core web archiving infrastructure) is not the best use of our limited resources. We all came to the conclusion that using a vendor solution for the core web archiving infrastructure not only gets us a more efficient solution (i.e., it eliminates a duplication of effort and resources), but it also allows for better co-location of our collections, frees up resources to focus on our local desires to create tools and services that meet researchers needs, and allows us to tackle integration of our collections with other format types. In the end, we all determined that AIT offers a more robust and lower cost solution, and they have proven that they are committed to keeping up with web archiving technology. AIT recently developed an alternative crawler tool (Umbra), and they now in fact can capture more of the web than WAS. They have also agreed to work with us on creating the needed APIs that will facilitate the development of new added-value web archiving services, such as web archiving tools for researchers, computational analysis of aggregated archive corpora, etc. KEY BENEFITS 1. AIT immediately provides better functionality. Their crawlers capture and display more content, particularly social media, than is currently possible in WAS (see Appendix B - Feature Comparison) 2. AIT has a major 5.0 release scheduled for spring 2015 for their curator tools, which will mean significantly enhanced functionality and usability. Learn more: https://archive-it.org/learn-more/archiveit-5-0-release-schedule 3. AIT has recently added new staff to help ensure timely enhancements for web archiving tools. 4. Web archive collections will not be siloed from each other. Page 2 of 9

5. CDL staff will be available to collaborate with Harvard, Stanford, UCLA, and possibly others to create novel new services to support researcher needs and collection integration. 6. CDL staff will avoid a significant amount of effort planning and implementing technical upgrades and infrastructure migration. KEY CHANGES FOR CAMPUSES 1. The cost will be initially higher for all UCs. This is partly due to the one-time transfer fee, but also due to the fact that WAS UC prices have been subsidized significantly below market rates. Note that the UC AIT costs are estimates based on prior web archiving activities. Each institution will work with AIT to determine what their web archiving rate (i.e., AIT subscription rate) will be going forward. Further note that because AIT has a higher one-time charge per unit of content compared to WAS s lower annual charge per unit of content, while AIT initially will be more expensive, over time it will become less expensive. At the end of ten years, more than half of the UC campuses will see lower annual expenditures on AIT than they would have with comparable activity in WAS. 2. UC web curators will have to learn a new interface. AIT curator tools offer more choices for crawl scope and frequency, in addition to data visualization for reports and compare captures. AIT has a robust client services team to help with the transition and beyond. CDL will work with AIT staff to implement an automatic migration of all existing WAS crawl specifications into their AIT equivalents, so that when UC curators first use the AIT curatorial interface it will be prepopulated with their existing specifications. CURRENT WAS PRICING FOR UC UC affiliated institutions currently pay for the cumulative WAS storage costs for their web archived sites. They do not pay the standard $10,000 annual service fee for WAS that we charge to non-uc institutions. Some UCs participated in the original Web-at-Risk grant from 2007-2010. These institutions are not charged for any websites captured during that time period. Billing for UC institutions starts on January 1, 2010; all campuses are billed in January for the previous calendar year. PROPOSED AIT PRICING FOR UC AIT s pricing model is based on an agreed-upon pre-allocation of a certain amount of storage for the upcoming year (e.g.,.5 Terabyte); partners must then stay within the agreed-upon amount in order to avoid additional charges. NB: Partners are not charged cumulative storage; in any given billing year they are only charged for the new storage capacity that is used in that year. AIT will be more expensive for most UC campuses in the near term (during the next one to five years). However, it will become much less expensive over the long term since under AIT you only pay once for any given unit of web archived sites; whereas with WAS pricing you pay annually for the cumulative amount of web archived sites. Page 3 of 9

BASELINE PRICING As part of our transition agreement, AIT will give all UCs an ongoing 20% consortial discount (assuming we have six or more campuses on board). In addition, all UCs will all receive a 50% discount in Year 1 and a 25% discount in Year 2. There are a few UC organizations (e.g., IRLE, Museum of Vertebrate Zoology) who are web archiving at a very low rate (less than 10 GB/year). For these organizations, which we will clearly identify moving forward, AIT will offer their Friends of Archive-it (FOA) rate of no charge for crawls as long as their crawls each year remain as small as they are. There will also be a one-time data transfer fee for all current WAS web archived sites. This one-time fee includes ongoing access and storage. It is prorated at $1,000/TB. The fee will be a flat rate of $500 for FOA organizations. VARIABLE FEES Some of these variables are under the control of the libraries and may result in reduced costs. Consortial pricing. AIT offers a 10% discount for a consortium of 3-5 partners and 20% discount for a consortium with 6-11. They require a single point of billing and same start date. AIT subscription. Each campus works directly with AIT to determine how much content they will capture, which in turn defines their AIT subscription rate. Current AIT subscriptions. UCLA and UCSF are currently subscribed to AIT. A discussion of how to best integrate these accounts will need to be held. Data transfer. It is suggested that each curator review their collections and determine what/if they can weed before data is transferred. (CDL staff is available to work with each campus) De-duplication savings. AIT offers de-duplication (WAS does not). De-duplication can result in storing less content (20-45% less). The amount of savings realized through de-duplication varies depending on crawling activities. For example, if a curator captures one site over time and that site changes minimally, then de-duplication can result in significant savings since only unique content is stored. SUMMARY OF FINANCIAL IMPACT NEW COSTS 1. One-time data transfer fee for all web archived sites. The one-time fee includes ongoing access and storage. It is prorated at $1,000/TB, with a minimum rate of $500 applicable for FOA organizations. 2. Annual subscription based on the size of web archived sites. Page 4 of 9

NEW SAVINGS 1. CDL will absorb the costs for Jan - Dec 2014 WAS storage. 2. Assuming consistent web archiving size, UCs will pay less over time, since WAS charges for cumulative storage and AIT does not. NEW SOFT SAVINGS 1. Soft savings are represented by the ability to reallocate people to other added-value web archiving work, such as Heritrix/Wayback enhancement, computational analysis, visualization, researcher tools, etc. TIMELINE CDL began initial conversations with Kristine Hanna, Archive-It Manager, in October 2014. We are working toward finalizing the pricing and terms of agreement by the end of December 2014. All WAS customers will be notified in January 2015, or if the terms are finalized sooner, they will be notified in December 2014. (The WAS terms of agreement stipulates 60 days notice before termination of service.) For those partners who are interested to move to AIT they can begin that process in January 2015. We have already conducted a successful transfer of data from WAS to AIT. WAS will cease capturing new websites in May 2015. We will continue to provide public access to the archived websites at http://was.cdlib.org until December 2015. If it is needed, we can possibly provide public access longer. NEXT STEPS Content/Account Transfer Share this summary document with CoUL for discussion Determine when and how to share this document with SAG3 Form a Web Archiving CKG CDL work with AIT to draft an agreement CDL define campus billing arrangement CDL staff contact each UC campus curator Page 5 of 9

Campus staff review and weed collections AIT schedules training with campus staff AIT works with CDL tech staff to copy existing crawl specifications into the AIT curatorial interface and do data transfer CDL develops accounting and payment mechanisms for the UC consortium Ongoing Web Archiving Partnership CDL drafting a vision and MOU document Pursue funding opportunities to help clarify and define partnership roles and responsibilities and determine priorities Page 6 of 9

APPENDIX A - DETAILS ABOUT WEB ARCHIVING CHALLENGES (excerpt from summary of CDL in-person meeting) Recurring themes among the challenges raised during the meeting included the resource-intensive nature of web archiving; keeping pace with the dynamic and ever-evolving web; how to better support research; and the siloed nature of web archive collections, both in terms of development and access RESEARCH USE Currently, the traditional mechanism for providing access to web archives via OpenWayback facilitates browsing but provides limited alternative means for interacting with or manipulating web archive data. Though there is work being done collectively in the field to articulate the lowest common denominator for data requirements required for research, it is clear that we will have an everincreasing need for large-scale data processing (e.g., for full-text indexing, longitudinal characterization, production of derivatives to feed APIs, and other types of computational analyses). This favors collocation of web archive data and processing power. RESOURCE REQUIREMENTS There is broad recognition of the resource-intensive nature of web archiving. No single institution (among the group in discussion with CDL) can afford to manage every single step of the web archiving lifecycle as efficiently and effectively as well as any given step is managed by at least one organization in the group, and individually reaching parity with each other would likely be more costly than collaborating. EVOLVING WEB As the web continues to change, there must be ongoing investment in web archiving technologies to keep pace. The current AIT tools and techniques can handle web technologies such as AJAX, social media, and streaming video. This entails custom engineering that is particularly difficult for smaller web archiving programs to sustain. Moreover, our existing collaboration frameworks have not been effective in disseminating technologies and best practices. Additional research into and development of new open-source tools are meanwhile demanded for the archiving of mobile applications and other nonbrowser-based modes of presenting web content. A more robust collaboration offers the prospect of at least raising participating organizations capabilities to parity, with the opportunity to further advancing state-of-the-art technology and practices for the field as a whole. COLLECTION DEVELOPMENT AND DISCOVERY U.S. web archiving organizations generally have engaged in individualized approaches to collection development, which means missed opportunities to bring collective curatorial expertise to bear and minimize resource expenditure on redundant archiving and curation. Affordances to users for discovering and exploring web archive collections, such as rich descriptive metadata, full-text search, faceting, and thumbnails, are inconsistent depending on the platform used to provide access, and those Page 7 of 9

platforms are not federated with each other to allow for a broad-based search of curated web archive content. Closer coordination on collection development and providing a unified discovery framework could deliver obvious user value. Page 8 of 9

APPENDIX B - FEATURE COMPARISON Feature Archive-It WAS Solr Index for public interface (includes faceting features) yes (Solr for metadata only) no Impose access restrictions to captured content selectively yes yes Ignore robots.txt yes yes Set crawl frequency for individual seeds yes yes View status before, during, and after crawls yes yes Show when crawl was redirected yes yes View harvested files by file type yes yes Multiple URLs for single site (aka include function) yes yes Single page scope yes yes Stop running harvest yes yes Compare harvests no (coming in AIT 5.0) yes De-duplication yes no Exclude function yes no Delete harvests no (coming in AIT 5.0) yes Byte volume per seed for harvest no yes File count per seed for harvest yes yes Harvest password-protected sites yes no Transfer existing harvests yes yes Harvests have their own URNs yes yes Start crawl immediately yes yes capture fb, twitter, youtube yes no display youtube in page yes no Google analytics yes no Run patch crawls to ensure capture of all content yes no QA module in proxy mode to review captures yes no Page 9 of 9