1 A VISION OF THE ROLE AND FUTURE OF WEB ARCHIVES Kalev H. Leetaru 1 Graduate School of Library and Information Science University of Illinois Imagine a world in which libraries and archives had never existed. No institutions had ever systematically collected or preserved our collective cultural past: every book, letter, or document was created, read and then immediately thrown away. What would we know about our past? Yet, that is precisely what is happening with the web: more and more of our daily lives occur within the digital world, yet more than two decades after the birth of the modern web, the libraries and archives of this world are still just being formed. We ve reached an incredible point in society. Every single day a quarter billion photographs are uploaded to Facebook, 300 billion s are sent and 340 million tweets are posted to Twitter. There are more than 644 million websites with 150,000 new ones added each day, and upwards of 156 million blogs. Even more incredibly, the growth rate of content creation in the digital world is exploding. The entire New York Times over the last 60 years contained around 3 billion words. More than 8 billion words are posted to Twitter every single day. That s right, every 24 hours there are 2.5 times as many words posted to Twitter as there were in every article of every issue of the paper of record of the United States over the last half century. By some estimates there have been 50 trillion words in all of the books published over the last halfmillennia. At its current growth rate, Twitter will reach that milestone less than three years from now. Nearly a third of the planet s population is now connected to the internet and there are as many cell phones as there are people on earth. Yet, for the most part we consume all of this information as it arrives and discard it just as quickly, giving little thought to posterity. That s where web archives come in: to make sure that a few years, decades, centuries, and millennia from now we will still have at least a partial written record of human society at the dawn of the twenty first century. THE WEB ARCHIVE IN TODAY S WORLD The loss of the Library of Alexandria, once the greatest library on earth, created an enormous hole in our understanding of the ancient world. Imagine if that library had not only persisted to present day, but had continued to collect materials through the millennia? Yet, in the web era, we are repeating this cycle of loss, not through a fire or other sudden event that destroyed the Library of Alexandria, but rather through inaction: we are simply not collecting it. The dawn of the digital world exists in the archives of just a few organizations. Many mailing lists and early services like Gopher have largely been lost, while organizations such as Google have invested considerable resources in resurrecting others like USENET. The earliest years of the web are gone forever, but beginning in 1996 the Internet Archive began capturing snapshots, giving us one of the few records of the early iterations of this world. Organizations like the International Internet Preservation Consortium (IIPC) are helping to bring web archivists from across the world and across disciplines together to share experiences and best practices and forge collaborations to help advance these critical efforts. 1
2 UNINTENDED USES Archives exist to preserve a sample of the world for future generations. They accept that they cannot archive everything and don t try to: they operate as an opportunistic collector. Traditional humanities and social sciences scholarship was designed around these limitations: the tradition of deep reading of a small number of works in the humanities was born out of this model. Yet, a new generation of researchers is increasingly using archives in ways they weren t intended for and need a greater array of information on how those archives are created to anticipate biases and impacts on their findings. The Library of Congress Chronicling America site, while technically a web delivered digital library, not a web archive, offers an example of why greater insight into the archiving process is critical for research. Using the site recently for a project, my search returned ten times as many hits for my topic in El Paso, Texas newspapers, as it did for New York City. Further inspection showed this was actually because the Chronicling America site had more content from El Paso newspapers during this time period than it did from New York City papers, rather than this being a reflection of El Paso papers covering my topic in more detail. Part of this issue stems from the acquisition model of Chronicling America: each individual state determines the order it digitizes newspapers printed in its borders: one state might begin with smaller papers, while other begins with larger papers, one state might digitize a particular year from every paper, while another might digitize the entirety of each paper in turn. Chronicling America also excludes papers that have been digitized already by commercial vendors: thus New York City s largest paper, the New York Times, is not present in the archive. This landscape introduces significant artifacts into searches, but normalization procedures can help address them. In order to do so, however, a bibliography is needed that lists every page from every paper that has been included in the archive. This would have allowed me to switch my search results from a raw count of matching newspaper pages into a percent of all pages from each city, which would have accounted for there being more content in El Paso than New York City. This is even clearer when conducting searches of the historic New York Times. A search of the Times for any keyword over the period 1945 present will show its use declining by 50% over that period. This is not a reflection of that term declining in use, but rather reflects the fact that the Times itself shrunk by more than half over this period. Similarly, searches covering the year 1978 will show an 88 day period where the term was never used. This is not because the term dropped out of favor during that period, but rather because a machinist s strike halted the paper s publication entirely. Having an index of the total number of articles published each day (and thus the possible universe of articles the term could have been used in) allows the raw counts to be normalized to yield the true picture of the term s usage. However, no web archive today offers such a master index of its holdings. One of the core optimizations used by web crawlers can have a significant impact on certain classes of research. Nearly every web archive uses crawlers designed to measure the rate of change of a site (ie, how often on average pages on that site change) in order to crawl sites that change more often faster than those that rarely change. This allows bandwidth and disk storage to be prioritized towards sites that change often, rather than storing a large number of identical snapshots of a site that never changes. However, sometimes it is precisely that rare change that is most interesting. For example, when studying how White House press releases had changed, I was examining pages that should never show any change whatsoever, and when there was a change, I needed to know the specific day on which the change occurred to reconcile it with political winds at the time. However, the rare rate of the
3 change on that portion of the site meant that snapshots often were months or sometimes years apart, making it impossible to narrow some changes down below the level of several years. In other analyses, the dynamic alteration of the recrawl rate itself is a problem. For example, when studying the inner workings of the Drudge Report over the last half decade, a key research question revolved around the rate at which various elements of that site changed. If the rate of snapshotting was being varied by a software algorithm based on the very phenomena I was measuring, that would strongly bias my findings. In that particular case I was lucky enough to find a specialty archive that existed solely to archive the Drudge Report, and which had collected snapshots every 2 minutes nonstop for more than 6 years. This is not an easy problem, as archives must balance their very limited resources between crawling for new pages and recrawling existing pages looking for changes. Within recrawling, they must balance the need to pinpoint changes to the most narrow timeframe possible with ensuring they capture as many changes as possible from high velocity sites. Finally, the very notion of what constitutes change varies dramatically among research projects. Has a page changed if it still looks the same, but an HTML tag was changed? What about if the title changes, or the background color? Does a change in the navigation bar at the top count the same as a change to the body text? There are as many answers to these questions as there are research projects, and no single solution satisfies them all. When looking at changes to White House press releases, only a change to a page title or body text counted as change, while the Internet Archive counted all of the myriad edits and additions to the White House navigation bar as changes. This required downloading every single snapshot of each page and applying our own filters to extract and compare the body text ourselves. One possible solution to this might be the incorporation of hybrid hierarchical structural and semantic document models that allow a user to indicate which areas of the document he or she cares about and to return only those snapshots in which that section has changed. WHAT TO KEEP? As noted in the introduction to this blog post, the digital world is experiencing explosive growth, producing more content in a few hours than was produced in the greater part of a century in the print era. This growth is giving us an incredible view of global society and enabling communication, collaboration, and social research at scales unimaginable even a decade ago, yet the richer this archive becomes, the harder it is to archive. The very volume of material that makes the web so exciting as a communications platform means there is simply too much of it to keep. Even in the era of books, there were simply too many of them for any library to keep, but at least we could assume that some library somewhere was probably collecting the books that we weren t: an assumption that isn t necessarily true in the digital world yet. An age old mechanism for dealing with overflow is to determine which works are the most important and which can be discarded. Yet, how do we decide what constitutes noise and what should be kept? Talk to a historian writing a biography of a historic figure and he or she will likely point to routine day today letters and diary entries as a critical source of information on that person s mood, feelings, and beliefs. Emerging research on using Twitter to forecast the stock market or measure public sentiment are finding that only when one considers the entirety of all 340 million tweets each day do the key patterns emerge. A tweet of I m outside hanging the laundry, such a beautiful day might at first seem a prime candidate for discarding, but by its very nature, it reflects an author feeling calm and secure and
4 relaxed: critical population level dynamics of great interest to social scientists. Another mechanism is to discard highly similar works, such as multiple editions of the same work. Yet, an emerging area of research on the web is the tracing of memes, which are variations of a quote or story that evolve as they are forwarded across users and communities much like a realtime version of the telephone game. It is critical for such research to be able to access every version of a story, not just the most recent. The rise of dual electronic + print publishing pipelines has led to the need to collect TWO copies of a work, instead of just a single authoritative print edition. Digital editions of books released as websites may include videos, photographs, multimedia and interactive features that provide a very different experience from the print copy. Even in subject domains where print is still the official record, digital has become the defacto record through its ease of access. How many citizens travel to their nearest Federal Depository Library and browse the latest edition of the Public Papers of the President to find press releases and statements by their government? Most likely turn instead to the White House s website, yet a study I coauthored in 2008 found that official US government press releases on the White House website were being continually edited, with key information added and removed and dates changed over time to reflect changing political realities. In a world in which information is so easily changed and even supposedly immutable government material changes with a click of a mouse, how do we as web archivists capture this world and make it available? This brings up one very critical distinction between the print and digital eras: the concept of change. In the print era, an archive simply needed to collect an item as it was published. If a book was subsequently changed, the publisher would issue a new edition and notify the library of its availability. A book sitting on a shelf was static: if 20 libraries each held a copy of that book, they could be reasonably certain that all 20 copies were identical to each other. In the digital era, we must constantly scour for new pages to archive, but we also have a new role: checking our existing archive for change. Every single page every saved by the archive must be rechecked on a regular basis to see if it has changed. Websites don t make this easy. A study of the Chicago Tribune I conducted for the Center of Research Libraries in 2011 found there was no single master list of articles published on the Tribune s site each day and the RSS feeds were sorted by popularity, not date. To ensure one archived every new article posted to the site, an archivist would have to monitor all 105 main topic pages on the Tribune s site every few hours or risk losing new articles on a news heavy day. At the level of the web as a whole, one can monitor the DNS domain registry to get a continually updated list of every domain name in existence. However, even this provides only a list of websites like cnn.com, not a list of all of the pages on that site. In the era of books, a library needn t purchase a work the day it was released, as most books continued to be printed and available for at least months, if not years afterwards. A library could wait a year or two until it had sufficient budget or space to collect it. Web pages, on the other hand, may have halflives measured in seconds to minutes. They can change constantly, with no notice, and the velocity of change can be extreme. In addition, more content is arriving in streaming format on the web. Archiving Twitter requires being able to collect and save over 4,000 messages per second in realtime, with no ability to go back for missed ones. A network outage of 10 minutes means 2.5 million tweets that have been lost forever. In the web world, content producers set the schedule for collection and archivists must adhere to those schedules. Myron Gutmann, Assistant Director of the National Science Foundations Directorate for Social, Behavioral, & Economic Sciences recently gave a talk earlier this year where he argued that in the print
5 era the high cost of producing information meant that whatever was published was worth keeping because there were so many layers of review. In contrast, the tremendously low cost of publication in the digital era means anyone can publish anything without any form of review. This raises the question even in scholarly disciplines of what is worth keeping? If an archive becomes too full and a massive community of researchers is served by one set of content and just 10 users are served by another collection of material, whose voice matters the most in what is deleted? How do we make decisions about what to keep? Historically those decisions were made by librarians or archivists by themselves, but as users and data miners become increasing users of archives, this raises the question of how to engage those communities in these critical decisions. THE RISE OF THE PARALLEL WEB When we speak of archiving the web we often think of the web as a single monolithic entity in which all content that is produced or consumed via a web browser is accessible for archiving. The original vision of the web was based on this ideal: an open unified platform in which all material was available to all users. For the most part this vision survived the early years of the web, as users strove to reach the greatest possible audience. Yet, a new trend has begun over the past half decade, corresponding with the rise of social media: the creation of parallel versions of the web. Every one of those quarter billion photographs uploaded to Facebook each day is posted and consumed via the web, whether through browser on a desktop or a mobile app on a smartphone. Yet, despite transiting the same physical telecommunications infrastructure as the rest of the web, those photos are stored in a parallel web, owned and controlled entirely by a commercial entity. They are not part of the public web and thus not available to web archives. In many ways this is no different than the libraries and archives of the print era. Libraries focused on collecting books and pamphlets, while a good deal of communication and culture occurred in letters, diaries, drawings, and artwork that have largely been lost. The difference in the digital era is that instead of being scattered across individual households, all of this material is already being centralized into commercially owned archives and libraries. Not everyone desires every conversation of theirs to be preserved for posterity, but in the print era one had a choice: a letter or diary or photograph was a physical object, held by its owner and could be passed down to later generations. How many of us have come across a shoebox of old photographs or letters from a grandparent? In the digital era, a company holds that material on our behalf and while most have terms of service that agree we own our material, only one major social media platform today offers an export button that allows us to download a copy of the material we have given it over the years: Google Plus Google Takeout. Twitter has recognized the importance of the communications that occur via its service and has made a feed of its content available to the Library of Congress for archiving for posterity. Most others like Facebook and international platforms like Weibo or VK (formerly VKontakte) have not. Facebook in effect has become a parallel version of the web, hosted on the web, but walled off from it, with no means for users to archive their material for the future. Twitter offers a shining example of how such platforms can interact with the web archiving community and ensure that their material is archived for future generations. Self archiving services like Google Takeout offer an intermediate step in which users at least retain the ability to make their own archival copy of their contributions to the web for future generations. As more of the web moves behind paywalls, password protection, and other mechanisms, creating more and more parallel versions of the web, there must be greater discussion within the web archiving community about how we reach out to
6 these services to find ways of ensuring users of these communities may archive their material for the future. DATA MINING For millennia, scholarship in archives and libraries has meant intensive reading of a small number of works. In the past decade the digital humanities and computational social sciences has led to the growing use of computerized analysis of archives in which software algorithms are used to identify patterns and point to areas of interest in the data. Digital archives have largely been built around this earlier access modality of deep reading, while computational techniques need rapid access to vast volumes of content, often encompassing the entire archive. New programming interfaces and access policies are needed to enable this new generation of scholarship using web archives. Informal discussions with web archivists suggest a chicken or the egg dilemma in this regard: data miners want to analyze archives, but can t without the necessary programmatic interfaces, and archives for the most part want to encourage use of their archives, but don t know what interfaces to support without working with data miners. Few archives today support the necessary programmatic interfaces for automated access to their collections, and those that do tend to be aimed at metadata, rather than fulltext content, and use library centric protocols and mindsets. Some have fairly complex interfaces, with very fine grained toolkits for each possible use scenario. The few that offer data exports offer an either or proposition: you either download a ZIP file of the entire contents of everything in the archive or you get nothing: there is no in between. Though there are some bright spots: the National Endowment for the Humanities has made initial steps towards helping archivists and data miners work together through grand challenge programs like its Digging into Data initiative where a selection of archives made their content available to awardees for large data mining. Yet, one only has to look at Twitter for a model of what archives could do. Twitter provides only a single small programming interface with a few very basic options, but through that interface it has been able to support an ecosystem of nearly every imaginable research question and tool. It even offers a tiered cost recovery model: users needing only small quantities of data (a sip ) can access the feed for free, while the rest are charged at a tiered pricing model based on the quantity of data they need, up to the entirety of all 340M tweets at the highest level. Finally, the interfaces provided by Twitter are compatible with the huge numbers of analytical, visualization, and filtering tools provided by the Google s and Yahoo s of the world with their open cloud toolkits. If archives took the same approach with a standardized interface like Twitter s, researchers could leverage these huge ecosystems for the study of the web itself. For some archives, the bottleneck has become the size of the data, which has become too large to share via the network. Through a partnership with Google, data miners can request from the HathiTrust a copy of the Google Books archive, consisting of around 750 million pages of material. Instead of receiving a download link, users must pay the cost of purchasing and shipping a box full of USB drives, because networks, even between research universities, simply cannot keep up with the size of datasets used today. In the sciences, some of the largest projects, such as the Large Synoptic Survey Telescope, are going as far as to purchase and house an entire computing cluster in the same machine room as the data archive and allowing researchers to submit proposals to run their programs on the cluster, because even with USB drives the data is simply too large to copy.
7 Not all of the barriers to offering bulk data mining access to archives are technical: copyright and other legal restrictions can present significant complications. Though even here technology can provide a possible alternative: nonconsumptive analysis, in which software algorithms perform surface level analyses rather than deep reading of text, may satisfy the requirements of copyright. In other cases, transformations of copyright material to another form, such as to a wordlist, as was done with the Google Books Ngrams dataset, may provide possible solutions. Not everyone appreciates or understands the value web archives provide society and they are constantly under pressure just to find enough funds to keep the power running. This is an area where partnering with researchers may help: there are only a few sources of funding for the creation and operation of web archives compared with the myriad funding opportunities for research. The increased bandwidth, hardware load, and other resource requirements of large data mining projects comes at a real cost, but at the same time, it directly demonstrates the value of those archives to new audiences and disciplines that may be able to partner with those archives on proposals, potentially offering new funding opportunities. USER INSIGHT While some archives cannot offer access to their holdings for legal reasons and instead serve only as an archive of last resort, most archives would hold little value to their constituents if they were not able to provide some level of access to the content they archived. User interfaces as a whole today are designed for casual browsing by non expert users, with simplicity and ease of use as their core principles. As archives become a growing source for scholarly research, archives must address several key areas of need in supporting these more advanced users: Inventory. There is a critical need for better visibility into the precise holdings of each archive. With most digital libraries of digitized materials a visitor can browse through the collection from start to end, though even there one usually can t export a CSV file containing a master list of everything in that collection. Most web archives, on the other hand, are accessible only through a direct lookup mechanism where the user types in a URL and gets back any matching snapshots. Archives only store copies of material, they don t provide an index to it or even a listing of what they hold: it is assumed that this role is provided elsewhere. For domains that have been deleted or now house unrelated content, this is not always the case. This would be akin to libraries dropping their reading rooms, stacks, and card catalogs, and storing all of their books in a robotic warehouse. Instead of browsing or requesting a book by title or category, one could only request a book by its ISBN code, which had to be known beforehand, and it was someone else s responsibility to store those codes. A tremendous step forward would be a list from each archive of all of the root domains that it has one or more pages from, but ultimately having a list of all URLs, along with the number of snapshots and a list of the dates of those snapshots would really enable an entirely new form of access to these archives. This data could be used by researchers and others to come up with new ways of accessing and interacting with the data held by these archives. Meta Search. With better inventory data, we could build metasearch tools that act as the digital equivalent of WorldCat for web archives. Web archives today operate more like document archives than libraries: they hold content, but they themselves often have no idea the full extent of what they hold. A scholar looking for a particular print document might have to spend months or even years scouring archives all over the world looking for one that holds a copy of that document, whereas if she was looking for a book, a simple search on WorldCat would turn
11 Preservation. First and foremost, web archives preserve the web. They act as the web equivalent of the archive or library, constantly monitoring for new content, requesting a copy of that content, and keeping a copy of it for posterity. In this role, their mission is to acquire and preserve the web for future generations, with access being primarily through basic browsing and retrieval. Some archives, for legal purposes, may not even be able to provide access to their holdings during the lifetime of the organizations providing them content, instead holding that material under embargo for a certain number of years, but ensuring its continued survival for future generations. Research. A unique and emerging use of archives is as a research service for scholars. Very few academics, especially in the social sciences and humanities, have the computational expertise or resources to crawl and download large portions of the web for research. Commercial web crawling companies like Google do not provide their data for research, and thus web archives provide a fundamentally unique and enabling resource for the study of the web that scholars can turn to. Even more critically, many key humanities and social science questions revolve around how ideas and communication change over time, and web archives capture the only view of change on the web. In this role, the secondary mission of archives is to provide access to their holdings that goes beyond the basic browsing needed for casual use or deep scholarly reading of a small number of works, towards providing programmatic tools and access policies that support computational data mining of large portions of their holdings. Authentication. A final emerging use of archives is as an authentication service. Web data is highly mutable, changing constantly, and there is no way to authenticate whether the page I see today is the same as what I saw yesterday, especially if the change is a small one. It took more than five years for changes to White House press releases to be spotted via copies held in the Internet Archive, and even then the discovery was entirely by accident. Third party archives allow authentication of what a page looked like at a given moment. One could even imagine someday a browser plugin that, as a user browsed certain sites on the web (such as government pages, perhaps medical or other pages), would compare each page with the most recent copy stored by a network of web archives, and display an indicator to the user as to whether the page has changed since it was last archived, as well as highlight those changes. In this role, the third peripheral mission of the web archive is to act as a disinterested third party that can authenticate and verify the contents of a given web page at a given moment in time. Wikipedia offers an intriguing vision of what the ultimate web archive might look like. Every edit to every page since the inception of the site has been archived and is available at a mouse click, allowing a visitor or scholar to trace the entire history of every word. Every operation taken on the site and the complete source code to every algorithm used for various automated processes are fully documented and make available, offering complete technical transparently. Finally, a dedicated bulk download page is maintained in which researchers may download a ZIP file containing the entirety of the site and every edit ever performed, which has made Wikipedia a mainstay of considerable social and computer science research. As our digital world continues to grow at a breathtaking pace and more and more of our daily live occurs within its digital boundaries, we must ensure that web archives are there to preserve our collective global consciousness for future generations.