A VISION OF THE ROLE AND FUTURE OF WEB ARCHIVES Kalev H. Leetaru 1 Graduate School of Library and Information Science University of Illinois

Size: px
Start display at page:

Download "A VISION OF THE ROLE AND FUTURE OF WEB ARCHIVES Kalev H. Leetaru 1 Graduate School of Library and Information Science University of Illinois"


1 A VISION OF THE ROLE AND FUTURE OF WEB ARCHIVES Kalev H. Leetaru 1 Graduate School of Library and Information Science University of Illinois Imagine a world in which libraries and archives had never existed. No institutions had ever systematically collected or preserved our collective cultural past: every book, letter, or document was created, read and then immediately thrown away. What would we know about our past? Yet, that is precisely what is happening with the web: more and more of our daily lives occur within the digital world, yet more than two decades after the birth of the modern web, the libraries and archives of this world are still just being formed. We ve reached an incredible point in society. Every single day a quarter billion photographs are uploaded to Facebook, 300 billion s are sent and 340 million tweets are posted to Twitter. There are more than 644 million websites with 150,000 new ones added each day, and upwards of 156 million blogs. Even more incredibly, the growth rate of content creation in the digital world is exploding. The entire New York Times over the last 60 years contained around 3 billion words. More than 8 billion words are posted to Twitter every single day. That s right, every 24 hours there are 2.5 times as many words posted to Twitter as there were in every article of every issue of the paper of record of the United States over the last half century. By some estimates there have been 50 trillion words in all of the books published over the last halfmillennia. At its current growth rate, Twitter will reach that milestone less than three years from now. Nearly a third of the planet s population is now connected to the internet and there are as many cell phones as there are people on earth. Yet, for the most part we consume all of this information as it arrives and discard it just as quickly, giving little thought to posterity. That s where web archives come in: to make sure that a few years, decades, centuries, and millennia from now we will still have at least a partial written record of human society at the dawn of the twenty first century. THE WEB ARCHIVE IN TODAY S WORLD The loss of the Library of Alexandria, once the greatest library on earth, created an enormous hole in our understanding of the ancient world. Imagine if that library had not only persisted to present day, but had continued to collect materials through the millennia? Yet, in the web era, we are repeating this cycle of loss, not through a fire or other sudden event that destroyed the Library of Alexandria, but rather through inaction: we are simply not collecting it. The dawn of the digital world exists in the archives of just a few organizations. Many mailing lists and early services like Gopher have largely been lost, while organizations such as Google have invested considerable resources in resurrecting others like USENET. The earliest years of the web are gone forever, but beginning in 1996 the Internet Archive began capturing snapshots, giving us one of the few records of the early iterations of this world. Organizations like the International Internet Preservation Consortium (IIPC) are helping to bring web archivists from across the world and across disciplines together to share experiences and best practices and forge collaborations to help advance these critical efforts. 1

2 UNINTENDED USES Archives exist to preserve a sample of the world for future generations. They accept that they cannot archive everything and don t try to: they operate as an opportunistic collector. Traditional humanities and social sciences scholarship was designed around these limitations: the tradition of deep reading of a small number of works in the humanities was born out of this model. Yet, a new generation of researchers is increasingly using archives in ways they weren t intended for and need a greater array of information on how those archives are created to anticipate biases and impacts on their findings. The Library of Congress Chronicling America site, while technically a web delivered digital library, not a web archive, offers an example of why greater insight into the archiving process is critical for research. Using the site recently for a project, my search returned ten times as many hits for my topic in El Paso, Texas newspapers, as it did for New York City. Further inspection showed this was actually because the Chronicling America site had more content from El Paso newspapers during this time period than it did from New York City papers, rather than this being a reflection of El Paso papers covering my topic in more detail. Part of this issue stems from the acquisition model of Chronicling America: each individual state determines the order it digitizes newspapers printed in its borders: one state might begin with smaller papers, while other begins with larger papers, one state might digitize a particular year from every paper, while another might digitize the entirety of each paper in turn. Chronicling America also excludes papers that have been digitized already by commercial vendors: thus New York City s largest paper, the New York Times, is not present in the archive. This landscape introduces significant artifacts into searches, but normalization procedures can help address them. In order to do so, however, a bibliography is needed that lists every page from every paper that has been included in the archive. This would have allowed me to switch my search results from a raw count of matching newspaper pages into a percent of all pages from each city, which would have accounted for there being more content in El Paso than New York City. This is even clearer when conducting searches of the historic New York Times. A search of the Times for any keyword over the period 1945 present will show its use declining by 50% over that period. This is not a reflection of that term declining in use, but rather reflects the fact that the Times itself shrunk by more than half over this period. Similarly, searches covering the year 1978 will show an 88 day period where the term was never used. This is not because the term dropped out of favor during that period, but rather because a machinist s strike halted the paper s publication entirely. Having an index of the total number of articles published each day (and thus the possible universe of articles the term could have been used in) allows the raw counts to be normalized to yield the true picture of the term s usage. However, no web archive today offers such a master index of its holdings. One of the core optimizations used by web crawlers can have a significant impact on certain classes of research. Nearly every web archive uses crawlers designed to measure the rate of change of a site (ie, how often on average pages on that site change) in order to crawl sites that change more often faster than those that rarely change. This allows bandwidth and disk storage to be prioritized towards sites that change often, rather than storing a large number of identical snapshots of a site that never changes. However, sometimes it is precisely that rare change that is most interesting. For example, when studying how White House press releases had changed, I was examining pages that should never show any change whatsoever, and when there was a change, I needed to know the specific day on which the change occurred to reconcile it with political winds at the time. However, the rare rate of the

3 change on that portion of the site meant that snapshots often were months or sometimes years apart, making it impossible to narrow some changes down below the level of several years. In other analyses, the dynamic alteration of the recrawl rate itself is a problem. For example, when studying the inner workings of the Drudge Report over the last half decade, a key research question revolved around the rate at which various elements of that site changed. If the rate of snapshotting was being varied by a software algorithm based on the very phenomena I was measuring, that would strongly bias my findings. In that particular case I was lucky enough to find a specialty archive that existed solely to archive the Drudge Report, and which had collected snapshots every 2 minutes nonstop for more than 6 years. This is not an easy problem, as archives must balance their very limited resources between crawling for new pages and recrawling existing pages looking for changes. Within recrawling, they must balance the need to pinpoint changes to the most narrow timeframe possible with ensuring they capture as many changes as possible from high velocity sites. Finally, the very notion of what constitutes change varies dramatically among research projects. Has a page changed if it still looks the same, but an HTML tag was changed? What about if the title changes, or the background color? Does a change in the navigation bar at the top count the same as a change to the body text? There are as many answers to these questions as there are research projects, and no single solution satisfies them all. When looking at changes to White House press releases, only a change to a page title or body text counted as change, while the Internet Archive counted all of the myriad edits and additions to the White House navigation bar as changes. This required downloading every single snapshot of each page and applying our own filters to extract and compare the body text ourselves. One possible solution to this might be the incorporation of hybrid hierarchical structural and semantic document models that allow a user to indicate which areas of the document he or she cares about and to return only those snapshots in which that section has changed. WHAT TO KEEP? As noted in the introduction to this blog post, the digital world is experiencing explosive growth, producing more content in a few hours than was produced in the greater part of a century in the print era. This growth is giving us an incredible view of global society and enabling communication, collaboration, and social research at scales unimaginable even a decade ago, yet the richer this archive becomes, the harder it is to archive. The very volume of material that makes the web so exciting as a communications platform means there is simply too much of it to keep. Even in the era of books, there were simply too many of them for any library to keep, but at least we could assume that some library somewhere was probably collecting the books that we weren t: an assumption that isn t necessarily true in the digital world yet. An age old mechanism for dealing with overflow is to determine which works are the most important and which can be discarded. Yet, how do we decide what constitutes noise and what should be kept? Talk to a historian writing a biography of a historic figure and he or she will likely point to routine day today letters and diary entries as a critical source of information on that person s mood, feelings, and beliefs. Emerging research on using Twitter to forecast the stock market or measure public sentiment are finding that only when one considers the entirety of all 340 million tweets each day do the key patterns emerge. A tweet of I m outside hanging the laundry, such a beautiful day might at first seem a prime candidate for discarding, but by its very nature, it reflects an author feeling calm and secure and

4 relaxed: critical population level dynamics of great interest to social scientists. Another mechanism is to discard highly similar works, such as multiple editions of the same work. Yet, an emerging area of research on the web is the tracing of memes, which are variations of a quote or story that evolve as they are forwarded across users and communities much like a realtime version of the telephone game. It is critical for such research to be able to access every version of a story, not just the most recent. The rise of dual electronic + print publishing pipelines has led to the need to collect TWO copies of a work, instead of just a single authoritative print edition. Digital editions of books released as websites may include videos, photographs, multimedia and interactive features that provide a very different experience from the print copy. Even in subject domains where print is still the official record, digital has become the defacto record through its ease of access. How many citizens travel to their nearest Federal Depository Library and browse the latest edition of the Public Papers of the President to find press releases and statements by their government? Most likely turn instead to the White House s website, yet a study I coauthored in 2008 found that official US government press releases on the White House website were being continually edited, with key information added and removed and dates changed over time to reflect changing political realities. In a world in which information is so easily changed and even supposedly immutable government material changes with a click of a mouse, how do we as web archivists capture this world and make it available? This brings up one very critical distinction between the print and digital eras: the concept of change. In the print era, an archive simply needed to collect an item as it was published. If a book was subsequently changed, the publisher would issue a new edition and notify the library of its availability. A book sitting on a shelf was static: if 20 libraries each held a copy of that book, they could be reasonably certain that all 20 copies were identical to each other. In the digital era, we must constantly scour for new pages to archive, but we also have a new role: checking our existing archive for change. Every single page every saved by the archive must be rechecked on a regular basis to see if it has changed. Websites don t make this easy. A study of the Chicago Tribune I conducted for the Center of Research Libraries in 2011 found there was no single master list of articles published on the Tribune s site each day and the RSS feeds were sorted by popularity, not date. To ensure one archived every new article posted to the site, an archivist would have to monitor all 105 main topic pages on the Tribune s site every few hours or risk losing new articles on a news heavy day. At the level of the web as a whole, one can monitor the DNS domain registry to get a continually updated list of every domain name in existence. However, even this provides only a list of websites like cnn.com, not a list of all of the pages on that site. In the era of books, a library needn t purchase a work the day it was released, as most books continued to be printed and available for at least months, if not years afterwards. A library could wait a year or two until it had sufficient budget or space to collect it. Web pages, on the other hand, may have halflives measured in seconds to minutes. They can change constantly, with no notice, and the velocity of change can be extreme. In addition, more content is arriving in streaming format on the web. Archiving Twitter requires being able to collect and save over 4,000 messages per second in realtime, with no ability to go back for missed ones. A network outage of 10 minutes means 2.5 million tweets that have been lost forever. In the web world, content producers set the schedule for collection and archivists must adhere to those schedules. Myron Gutmann, Assistant Director of the National Science Foundations Directorate for Social, Behavioral, & Economic Sciences recently gave a talk earlier this year where he argued that in the print

5 era the high cost of producing information meant that whatever was published was worth keeping because there were so many layers of review. In contrast, the tremendously low cost of publication in the digital era means anyone can publish anything without any form of review. This raises the question even in scholarly disciplines of what is worth keeping? If an archive becomes too full and a massive community of researchers is served by one set of content and just 10 users are served by another collection of material, whose voice matters the most in what is deleted? How do we make decisions about what to keep? Historically those decisions were made by librarians or archivists by themselves, but as users and data miners become increasing users of archives, this raises the question of how to engage those communities in these critical decisions. THE RISE OF THE PARALLEL WEB When we speak of archiving the web we often think of the web as a single monolithic entity in which all content that is produced or consumed via a web browser is accessible for archiving. The original vision of the web was based on this ideal: an open unified platform in which all material was available to all users. For the most part this vision survived the early years of the web, as users strove to reach the greatest possible audience. Yet, a new trend has begun over the past half decade, corresponding with the rise of social media: the creation of parallel versions of the web. Every one of those quarter billion photographs uploaded to Facebook each day is posted and consumed via the web, whether through browser on a desktop or a mobile app on a smartphone. Yet, despite transiting the same physical telecommunications infrastructure as the rest of the web, those photos are stored in a parallel web, owned and controlled entirely by a commercial entity. They are not part of the public web and thus not available to web archives. In many ways this is no different than the libraries and archives of the print era. Libraries focused on collecting books and pamphlets, while a good deal of communication and culture occurred in letters, diaries, drawings, and artwork that have largely been lost. The difference in the digital era is that instead of being scattered across individual households, all of this material is already being centralized into commercially owned archives and libraries. Not everyone desires every conversation of theirs to be preserved for posterity, but in the print era one had a choice: a letter or diary or photograph was a physical object, held by its owner and could be passed down to later generations. How many of us have come across a shoebox of old photographs or letters from a grandparent? In the digital era, a company holds that material on our behalf and while most have terms of service that agree we own our material, only one major social media platform today offers an export button that allows us to download a copy of the material we have given it over the years: Google Plus Google Takeout. Twitter has recognized the importance of the communications that occur via its service and has made a feed of its content available to the Library of Congress for archiving for posterity. Most others like Facebook and international platforms like Weibo or VK (formerly VKontakte) have not. Facebook in effect has become a parallel version of the web, hosted on the web, but walled off from it, with no means for users to archive their material for the future. Twitter offers a shining example of how such platforms can interact with the web archiving community and ensure that their material is archived for future generations. Self archiving services like Google Takeout offer an intermediate step in which users at least retain the ability to make their own archival copy of their contributions to the web for future generations. As more of the web moves behind paywalls, password protection, and other mechanisms, creating more and more parallel versions of the web, there must be greater discussion within the web archiving community about how we reach out to

6 these services to find ways of ensuring users of these communities may archive their material for the future. DATA MINING For millennia, scholarship in archives and libraries has meant intensive reading of a small number of works. In the past decade the digital humanities and computational social sciences has led to the growing use of computerized analysis of archives in which software algorithms are used to identify patterns and point to areas of interest in the data. Digital archives have largely been built around this earlier access modality of deep reading, while computational techniques need rapid access to vast volumes of content, often encompassing the entire archive. New programming interfaces and access policies are needed to enable this new generation of scholarship using web archives. Informal discussions with web archivists suggest a chicken or the egg dilemma in this regard: data miners want to analyze archives, but can t without the necessary programmatic interfaces, and archives for the most part want to encourage use of their archives, but don t know what interfaces to support without working with data miners. Few archives today support the necessary programmatic interfaces for automated access to their collections, and those that do tend to be aimed at metadata, rather than fulltext content, and use library centric protocols and mindsets. Some have fairly complex interfaces, with very fine grained toolkits for each possible use scenario. The few that offer data exports offer an either or proposition: you either download a ZIP file of the entire contents of everything in the archive or you get nothing: there is no in between. Though there are some bright spots: the National Endowment for the Humanities has made initial steps towards helping archivists and data miners work together through grand challenge programs like its Digging into Data initiative where a selection of archives made their content available to awardees for large data mining. Yet, one only has to look at Twitter for a model of what archives could do. Twitter provides only a single small programming interface with a few very basic options, but through that interface it has been able to support an ecosystem of nearly every imaginable research question and tool. It even offers a tiered cost recovery model: users needing only small quantities of data (a sip ) can access the feed for free, while the rest are charged at a tiered pricing model based on the quantity of data they need, up to the entirety of all 340M tweets at the highest level. Finally, the interfaces provided by Twitter are compatible with the huge numbers of analytical, visualization, and filtering tools provided by the Google s and Yahoo s of the world with their open cloud toolkits. If archives took the same approach with a standardized interface like Twitter s, researchers could leverage these huge ecosystems for the study of the web itself. For some archives, the bottleneck has become the size of the data, which has become too large to share via the network. Through a partnership with Google, data miners can request from the HathiTrust a copy of the Google Books archive, consisting of around 750 million pages of material. Instead of receiving a download link, users must pay the cost of purchasing and shipping a box full of USB drives, because networks, even between research universities, simply cannot keep up with the size of datasets used today. In the sciences, some of the largest projects, such as the Large Synoptic Survey Telescope, are going as far as to purchase and house an entire computing cluster in the same machine room as the data archive and allowing researchers to submit proposals to run their programs on the cluster, because even with USB drives the data is simply too large to copy.

7 Not all of the barriers to offering bulk data mining access to archives are technical: copyright and other legal restrictions can present significant complications. Though even here technology can provide a possible alternative: nonconsumptive analysis, in which software algorithms perform surface level analyses rather than deep reading of text, may satisfy the requirements of copyright. In other cases, transformations of copyright material to another form, such as to a wordlist, as was done with the Google Books Ngrams dataset, may provide possible solutions. Not everyone appreciates or understands the value web archives provide society and they are constantly under pressure just to find enough funds to keep the power running. This is an area where partnering with researchers may help: there are only a few sources of funding for the creation and operation of web archives compared with the myriad funding opportunities for research. The increased bandwidth, hardware load, and other resource requirements of large data mining projects comes at a real cost, but at the same time, it directly demonstrates the value of those archives to new audiences and disciplines that may be able to partner with those archives on proposals, potentially offering new funding opportunities. USER INSIGHT While some archives cannot offer access to their holdings for legal reasons and instead serve only as an archive of last resort, most archives would hold little value to their constituents if they were not able to provide some level of access to the content they archived. User interfaces as a whole today are designed for casual browsing by non expert users, with simplicity and ease of use as their core principles. As archives become a growing source for scholarly research, archives must address several key areas of need in supporting these more advanced users: Inventory. There is a critical need for better visibility into the precise holdings of each archive. With most digital libraries of digitized materials a visitor can browse through the collection from start to end, though even there one usually can t export a CSV file containing a master list of everything in that collection. Most web archives, on the other hand, are accessible only through a direct lookup mechanism where the user types in a URL and gets back any matching snapshots. Archives only store copies of material, they don t provide an index to it or even a listing of what they hold: it is assumed that this role is provided elsewhere. For domains that have been deleted or now house unrelated content, this is not always the case. This would be akin to libraries dropping their reading rooms, stacks, and card catalogs, and storing all of their books in a robotic warehouse. Instead of browsing or requesting a book by title or category, one could only request a book by its ISBN code, which had to be known beforehand, and it was someone else s responsibility to store those codes. A tremendous step forward would be a list from each archive of all of the root domains that it has one or more pages from, but ultimately having a list of all URLs, along with the number of snapshots and a list of the dates of those snapshots would really enable an entirely new form of access to these archives. This data could be used by researchers and others to come up with new ways of accessing and interacting with the data held by these archives. Meta Search. With better inventory data, we could build metasearch tools that act as the digital equivalent of WorldCat for web archives. Web archives today operate more like document archives than libraries: they hold content, but they themselves often have no idea the full extent of what they hold. A scholar looking for a particular print document might have to spend months or even years scouring archives all over the world looking for one that holds a copy of that document, whereas if she was looking for a book, a simple search on WorldCat would turn

8 up a list of every participating library that held a copy in their electronic catalog. This is possible because libraries have invested in maintaining inventories of their holdings and standardizing the way in which those inventories are stored so that third parties can aggregate and develop services that allow users to search across those inventories. Imagine being able to type in a URL and see every copy from every web archive in the world, rather than just the copies held by any one archive. Specialty Archives. Metasearch would allow federated search across all archives, but this also raises the concern about backups of smaller specialty archives. Larger whole web archives like the Internet Archive still can t possibly archive everything that exists. Specialty archives fill this niche, often with institutional focuses or through a researcher creating an archive of material on a particular niche topic for her own use. Often these archives are created for a particular research project and then discarded when that paper is published. How do we bring these into the fold? Perhaps some mechanism is needed for allowing those archives to submit to a network of web archives and say essentially if you re interested, here you go? They would need to be marked separately, since their content was produced outside of the main archive s processes, but as web crawlers become easier to use and more researchers create their own specialty curated collections, should we have mechanisms to allow them to be archived, to leverage their resources to penetrate areas of the web we might not be able to? Citability. For archives to be useful in scholarly research, a particular snapshot of a page must have a permanent identifier that can be cited in the references list of a publication. The Internet Archive provides an ideal example of this, in which each snapshot has its own permanent URL that includes both the page URL and the exact timestamp of that snapshot. This URL can be cited in a publication in the same format as any other webpage. Yet, not every archive provides this type of access, some make use of AJAX (interactive JavaScript applications) that provide a more desktop like browsing experience, but mask the URL for each snapshot, making it impossible to point others to that copy. TECHNICAL INSIGHT In the modern era libraries and archives have existed decoupled from their researchers: a professional class collected and curated their collections and scholars traveled about to whichever institutions held the materials they needed. Few records exist as to why a given library collected this work rather than that one, and as scholars we simply accept this. Yet, perhaps in the digital era we can do better, as most of these decisions are stored in s, memos, and other materials, all of them searchable and indexable. Web crawlers are seeded with starting URLs and crawl based on deterministic software algorithms, both of which can be documented for scholars. Most web archives operate as black boxes designed for casual browsing and retrieval of individual objects, without asking too many questions about how that object got there. This is in stark contrast to digitized archives, in which every conceivable piece of metadata is collected. A visitor to the Internet Archive today encounters an odd experience: retrieving a digitized book yields a wealth of information on how that digital copy came to be, from the specific library it came from to the name of the person who operated the scanner that photographed it, while retrieving a web page yields only a list of available snapshot dates. Snapshot Timestamps. All archives store an internal timestamp recording the precise moment when a page snapshot was downloaded, but their user interfaces often mask this information. For example, when examining changes in White House press releases, we found that clicking on

9 a snapshot for April 4, 2001 in the Internet Archive would always take us to a snapshot of the page we requested, but if we looked in the URL bar (Internet Archive includes the timestamp of the snapshot in the URL), we noticed that occasionally the snapshot we were ultimately given was from days or weeks before or after our requested date. Upon further research, we found that some archives automatically redirect a user to the nearest date when a given snapshot date becomes unavailable due to hardware failure or other reasons. This is an ideal behavior for a casual user, but for an expert user tracing how changes in a page correspond to political events occurring each day, this is problematic. Archives should provide a notice when a requested snapshot is not available, allowing the user to decide whether to proceed to the closest available one, or select another date. Page Versus Site Timestamps. Some archives display only a single timestamp for all pages collected from a given site during a particular crawl: usually the time at which the crawlers started archiving that site. Even a medium sized site may take hours or days to fully crawl when rate limiting and other factors are taken into account, and for some users it is imperative to know the precise moment when each page was requested, not when the crawlers first entered the site. Most archives store this information, so it is simply a matter of providing access to it via the user interface for those users requesting it. Crawl Algorithms. Not every site can be crawled to its entirety: some sites may simply be too large or have complex linking structures that make it difficult to find every page, or they may be dynamically generated. Some research questions may be affected by the algorithm used to crawl the site (depth first vs breadth first), the seed URLs used to enter the site (the front page, table of contents pages, content pages, etc), where it aborted the crawl (if it did), which pages errored during the crawl (and thus whose links were not crawled), etc. If, for example, one wishes to estimate the size of a dynamic database driven website, such factors can be used to draw estimates on its total size and composition, but only if users can access these technical characteristics of the crawl. Raw Source Access. Current archives are designed to provide a transparent time machine view of the web, where clicking on a snapshot attempts to render the page in a modern browser in a way that reproduces what it originally looked like when it was captured, as faithfully as possible. However, a page might contain embedded HTML instructions such as a <META REFRESH> tag or JavaScript code that automatically forwards the browser to a new URL. This may happen transparently without the user noticing. In our study of White House press releases, we were especially interested in pages that had been blanked out, where a press release had been replaced with a <META REFRESH> tag and an editorial comment in an HTML comment in the page. Clicking on these pages using the Internet Archive interface just forwarded us to the new URL indicated by the refresh command, so we had to download the pages via a special downloading software package so we could review the source code of the page without being redirected. This is a relatively rare scenario, but it would be helpful for archives to provide a view source mode, where clicking on a snapshot takes the user right to the source code of a page, instead of trying to display the page. Crawler Physical Location. Several major foreign news outlets embargo content or present different selections or versions of their content depending on where the visitor s computer is physically located. A visitor accessing such a site will see a very different picture depending on whether she is in the United States, the United Kingdom, China, or Russia. This is actually growing as an issue, as more sites adopt content management systems that dynamically adjust the structure and layout of the site for each individual visitor based on their actions as they click through the site. Analyses of such sites require information on where the crawlers were physically located and the exact order of pages they requested from the site. As with the other

10 recommendations listed above, this information is already held by most archives, it is simply a matter of making it more available to users. FIDELITY AND LINKAGE Fidelity. Modern web archiving platforms capture not only the HTML code of a page, but also interpret the HTML and associated CSS files to compile a list of images, CSS files, JavaScript code, and other files necessary to properly display the page and archive these as well. The rise of interactive and highly multimedia web pages is challenging this approach, as pages may have embedded Flash or AJAX/JavaScript applications, streaming video, and embedded widgets displaying information from other sites. No longer limited to high design or highly technical sites, these features are making their way into more traditional websites, such as the news media. For example, the BBC s website includes both Flash and JavaScript animations on its front page, while the Chicago Tribune s site includes Flash animations on its front page that respond to mouseovers and animate or perform other actions. BBC also includes an embedded JavaScript widget that displays advertisements. Both sites include extensive embedded streaming Flash based video. Many of these tools reference data or JavaScript code on other sites: for example many sites now make use of Google s Visualization API toolkit for interactive graphs and displays and simply link to the code housed on Google s site. On the one hand, we might dismiss advertisements and embedded content as not worth archiving, yet a rich literature in the advertising discipline addresses the psychological impact of advertisements and other sidebar material on the processing of information in the web era. Even digitized historical newspaper archives have been very careful to offer access to the entire scanned page image to allow scholars to access advertisements and layout information, rather than just focusing on the article text. Excluding dynamic content will make it impossible for scholars of the future to understand how advertisements were used on the web. Yet, simply saving a copy of a Flash or AJAX widget may not be sufficient, as technical dependencies may render them unexecutable 20 years from now. One possibility might be creating a screen capture of each page as it is archived, to provide at least a coarse snapshot of what that page looked like to a visitor of the time period. Web/Social Linkage. Many sites are making use of social media platforms like Twitter and Facebook as part of their overall information ecosystem. For example, the front page of the Chicago Tribune prominently links to its Facebook page, where editors post a curated assortment of links to Tribune content over the course of each day. Visitors like stories and post comments on the Facebook summary of the story, creating a rich environment of commentary that exists in parallel to the original webpage on the Tribune site. Other sites allow commentary directly on their webpages through a user comments section. Some sites may only allow comments for a few days after a page is posted, while others may allow comments years later. This social narrative is an integral part of the content seen by visitors of the time, yet how do we properly preserve this material, especially from linked social media platform profiles? CONCLUSIONS AND THE ROLE OF ARCHIVES As web archives mature and expand, a growing question revolves around the role of these archives in society. What should their primary mission(s) be and how can they best fulfill those roles? At their most basic level, I believe web archives fulfill three primary roles: Preservation, Research, and Authentication, in that order.

11 Preservation. First and foremost, web archives preserve the web. They act as the web equivalent of the archive or library, constantly monitoring for new content, requesting a copy of that content, and keeping a copy of it for posterity. In this role, their mission is to acquire and preserve the web for future generations, with access being primarily through basic browsing and retrieval. Some archives, for legal purposes, may not even be able to provide access to their holdings during the lifetime of the organizations providing them content, instead holding that material under embargo for a certain number of years, but ensuring its continued survival for future generations. Research. A unique and emerging use of archives is as a research service for scholars. Very few academics, especially in the social sciences and humanities, have the computational expertise or resources to crawl and download large portions of the web for research. Commercial web crawling companies like Google do not provide their data for research, and thus web archives provide a fundamentally unique and enabling resource for the study of the web that scholars can turn to. Even more critically, many key humanities and social science questions revolve around how ideas and communication change over time, and web archives capture the only view of change on the web. In this role, the secondary mission of archives is to provide access to their holdings that goes beyond the basic browsing needed for casual use or deep scholarly reading of a small number of works, towards providing programmatic tools and access policies that support computational data mining of large portions of their holdings. Authentication. A final emerging use of archives is as an authentication service. Web data is highly mutable, changing constantly, and there is no way to authenticate whether the page I see today is the same as what I saw yesterday, especially if the change is a small one. It took more than five years for changes to White House press releases to be spotted via copies held in the Internet Archive, and even then the discovery was entirely by accident. Third party archives allow authentication of what a page looked like at a given moment. One could even imagine someday a browser plugin that, as a user browsed certain sites on the web (such as government pages, perhaps medical or other pages), would compare each page with the most recent copy stored by a network of web archives, and display an indicator to the user as to whether the page has changed since it was last archived, as well as highlight those changes. In this role, the third peripheral mission of the web archive is to act as a disinterested third party that can authenticate and verify the contents of a given web page at a given moment in time. Wikipedia offers an intriguing vision of what the ultimate web archive might look like. Every edit to every page since the inception of the site has been archived and is available at a mouse click, allowing a visitor or scholar to trace the entire history of every word. Every operation taken on the site and the complete source code to every algorithm used for various automated processes are fully documented and make available, offering complete technical transparently. Finally, a dedicated bulk download page is maintained in which researchers may download a ZIP file containing the entirety of the site and every edit ever performed, which has made Wikipedia a mainstay of considerable social and computer science research. As our digital world continues to grow at a breathtaking pace and more and more of our daily live occurs within its digital boundaries, we must ensure that web archives are there to preserve our collective global consciousness for future generations.

For Big Data Analytics There s No Such Thing as Too Big

For Big Data Analytics There s No Such Thing as Too Big For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.

More information

Handling Inactive Data Efficiently

Handling Inactive Data Efficiently Issue 4 Handling Inactive Data Efficiently 1 Editor s Note 3 Does this mean long term backup? NOTE FROM THE EDITOR S DESK: 4 Key benefits of archiving the data? 5 Does archiving file servers help? 6 Managing

More information


http://www.davidlankes.org http://www.davidlankes.org TITLE: Virtual Service AUTHOR(s): R. David Lankes PUBLICATION TYPE: Chapter DATE: 2002 FINAL CITATION: Virtual Service. Lankes, R. David (2002). In Melling, M. & Little, J. (Ed.),

More information

Convergence of Social, Mobile and Cloud: 7 Steps to Ensure Success

Convergence of Social, Mobile and Cloud: 7 Steps to Ensure Success Convergence of Social, Mobile and Cloud: 7 Steps to Ensure Success June, 2013 Contents Executive Overview...4 Business Innovation & Transformation...5 Roadmap for Social, Mobile and Cloud Solutions...7

More information

The Definitive IP PBX Guide

The Definitive IP PBX Guide The Definitive IP PBX Guide Understand what an IP PBX or Hosted VoIP solution can do for your organization and discover the issues that warrant consideration during your decision making process. This comprehensive

More information

THE TRUTH ABOUT TRIPLESTORES The Top 8 Things You Need to Know When Considering a Triplestore

THE TRUTH ABOUT TRIPLESTORES The Top 8 Things You Need to Know When Considering a Triplestore TABLE OF CONTENTS Introduction... 3 The Importance of Triplestores... 4 Why Triplestores... 5 The Top 8 Things You Should Know When Considering a Triplestore... 9 Inferencing... 9 Integration with Text

More information


THE WEB ARCHIVING LIFE CYCLE MODEL THE WEB ARCHIVING LIFE CYCLE MODEL The Archive-It Team Internet Archive March 2013 Principle authors: Molly Bragg Kristine Hanna Contributors: Lori Donovan Graham Hukill Anna Peterson Introduction 1 Introduction

More information

The Expanding Digital Universe

The Expanding Digital Universe An IDC White Paper - sponsored by EMC The Expanding Digital Universe A Forecast of Worldwide Information Growth Through 2010 March 2007 John F. Gantz, Project Director David Reinsel Christopher Chute Wolfgang

More information

A Step by Step Guide to Email Marketing Success

A Step by Step Guide to Email Marketing Success A Step by Step Guide to Email Marketing Success Using the Wizard Table of Contents Introduction... 3 Quick Start Guide... 4 Step 1: Setup Your Account... 7 How Does Pricing Work?...11 Step 2: Upload Your

More information

The Practical Guide to Email Marketing

The Practical Guide to Email Marketing The Practical Guide to Email Marketing Strategies and Tactics for Inbox Success by Jordan Ayan The Practical Guide to Email Marketing Strategies and Tactics for Inbox Success The Practical Guide to Email

More information

Special Report. Issue 4. Digital Content & Learning Management Platforms

Special Report. Issue 4. Digital Content & Learning Management Platforms c e n t e r f o r d i g i ta l e d u c at i o n s Special Report Issue 4 Digital Content & Learning Management Platforms from the publisher In our final Special Report of 2010, we delve into the critically

More information



More information

Customer Cloud Architecture for Big Data and Analytics

Customer Cloud Architecture for Big Data and Analytics Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people

More information

Predictive Analytics for Business Advantage

Predictive Analytics for Business Advantage TDWI research First Quarter 2014 BEST PRACTICES REPORT Predictive Analytics for Business Advantage By Fern Halper Co-sponsored by: tdwi.org TDWI research BEST PRACTICES REPORT First Quarter 2014 Predictive

More information

Behind the Help Desk: Evolution of a Knowledge Management System in a Large Organization

Behind the Help Desk: Evolution of a Knowledge Management System in a Large Organization Behind the Help Desk: Evolution of a Knowledge Management System in a Large Organization Christine A. Halverson IBM Research 650 Harry Rd San Jose, CA. 95120, USA krys@us.ibm.com Thomas Erickson IBM Research

More information

A Simpler Plan for Start-ups

A Simpler Plan for Start-ups A Simpler Plan for Start-ups Business advisors, experienced entrepreneurs, bankers, and investors generally agree that you should develop a business plan before you start a business. A plan can help you

More information

The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones

The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones The Definitive Guide tm To Monitoring the Data Center, Virtual Environments, and the Cloud Don Jones Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime has

More information

How to Decide to Use the Internet to Deliver Government Programs and Services

How to Decide to Use the Internet to Deliver Government Programs and Services How to Decide to Use the Internet to Deliver Government Programs and Services 1 Internet Delivery Decisions A Government Program Manager s Guide How to Decide to Use the Internet to Deliver Government

More information

Contents. Protecting Privacy. 27 Integrating with Enterprise Systems. 30 Handling Real Time Analytics. 34 Leveraging Cloud Computing.

Contents. Protecting Privacy. 27 Integrating with Enterprise Systems. 30 Handling Real Time Analytics. 34 Leveraging Cloud Computing. 1 Contents Introduction. 1 View Point Phil Shelley, CTO, Sears Holdings Making it Real Industry Use Cases Retail Extreme Personalization. 6 Airlines Smart Pricing. 9 Auto Warranty and Insurance Efficiency.

More information

An introduction and guide to buying Cloud Services

An introduction and guide to buying Cloud Services An introduction and guide to buying Cloud Services DEFINITION Cloud Computing definition Cloud Computing is a term that relates to the IT infrastructure and environment required to develop/ host/run IT

More information

The Essential Guide to Mobile App Testing

The Essential Guide to Mobile App Testing The Essential Guide to Mobile App Testing Tips, techniques & trends for developing, testing and launching mobile applications that delight your users A Free Book from utest The Essential Guide to Mobile

More information

Ad v e rt i s i n g o n the World Wide We b

Ad v e rt i s i n g o n the World Wide We b Ad v e rt i s i n g o n the World Wide We b Barker & Grønne 1996 Ad v e rt i s i n go nthe World Wide We b This is the final report of Christian Barker and Peter Grønnes master thesis. The report is available

More information

Digital Textbook. Playbook. The Digital Textbook Collaborative, February 1, 2012

Digital Textbook. Playbook. The Digital Textbook Collaborative, February 1, 2012 Digital Textbook Playbook The Digital Textbook Collaborative, February 1, 2012 Table of Contents About this Playbook...3 Executive Summary...4 Introduction....6 Making the Transition to Digital Learning...8

More information

NESSI White Paper, December 2012. Big Data. A New World of Opportunities

NESSI White Paper, December 2012. Big Data. A New World of Opportunities NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of

More information

AN INTRODUCTION TO. Data Science. Jeffrey Stanton, Syracuse University

AN INTRODUCTION TO. Data Science. Jeffrey Stanton, Syracuse University AN INTRODUCTION TO Data Science Jeffrey Stanton, Syracuse University INTRODUCTION TO DATA SCIENCE 2012, Jeffrey Stanton This book is distributed under the Creative Commons Attribution- NonCommercial-ShareAlike

More information

AAPOR Report on Big Data

AAPOR Report on Big Data AAPOR Report on Big Data AAPOR Big Data Task Force February 12, 2015 Prepared for AAPOR Council by the Task Force, with Task Force members including: Lilli Japec, Co-Chair, Statistics Sweden Frauke Kreuter,

More information

So You Want to Do Anthropology in Your Library? or A Practical Guide to Ethnographic Research in Academic Libraries

So You Want to Do Anthropology in Your Library? or A Practical Guide to Ethnographic Research in Academic Libraries So You Want to Do Anthropology in Your Library? or A Practical Guide to Ethnographic Research in Academic Libraries Andrew Asher ERIAL Lead Research Anthropologist Susan Miller Resident Anthropologist,

More information

MIS for Microenterprise: APractical Approach to Managing Information Successfully. by Charles Waterfield

MIS for Microenterprise: APractical Approach to Managing Information Successfully. by Charles Waterfield T h e A s p e n I n s t i t u t e MIS for Microenterprise: APractical Approach to Managing Information Successfully by Charles Waterfield Microenterprise Fund for Innovation, Effectiveness, Learning and

More information

SCIENCE @RISK. Toward a National Strategy for Preserving Online Science

SCIENCE @RISK. Toward a National Strategy for Preserving Online Science SCIENCE @RISK Alexander Graham Bell's sketch of a radiophonic interruptor, May 27, 1893. box 205, "Subject File: Drawings by Alexander Graham Bell, 1881-1911." Alexander Graham Bell Family Papers, Manuscript

More information

SAP Business One, The Answer to the Challenges of SMB Business Management Software Selection

SAP Business One, The Answer to the Challenges of SMB Business Management Software Selection SAP Business One Whitepaper Page 1 SAP Business One, The Answer to the Challenges of SMB Business Management Software Selection Contact: Daniel A. Carr dan@cdi-usa.com Phone: 248-347-4600 Date: June 14,

More information