Archiving the Web and Beyond: A Look at Twi8er and Facebook (and some other things too)

Similar documents

Update on the Twitter Archive At the Library of Congress

Practical Options for Archiving Social Media

RESPONSES TO QUESTIONS AND REQUESTS FOR CLARIFICATION Updated 7/1/15 (Question 53 and 54)

Miguel Ortiz, Sr. Systems Engineer. Globanet

SuiteCRM Customer Relationship Management System

Rise of the Machines: An Internet-Wide Analysis of Web Bots in 2014

The State Records Act and Electronic Records. Kris Stenson Electronic Records Archivist Illinois State Archives

FAMILY. Reference Guide. Pogoplug Family. Reference Guide Cloud Engines, Inc. All Rights Reserved.

CTERA Agent for Linux

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Using the Digital Preservation Toolkit Mini - Workshop. November 6, 2015 Ern Bieman, Canadian Heritage Information Network

Legal Issues in Building Social Media Collections

Partek Flow Installation Guide

HootSuite - G- Cloud Services Description

Web Archiving and Scholarly Use of Web Archives

Reference Software Workshop Tutorial - The Basics

Using Internet Archive: A guide created by the Digital POWRR Project

WOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief

Harvard Library Preparing for a Trustworthy Repository Certification of Harvard Library s DRS.

Dropbox for Business. Secure file sharing, collaboration and cloud storage. G-Cloud Service Description

Archive-IT Services Andrea Mills Booksgroup Collections Specialist

FilesAnywhere Feature List

CTERA Agent for Mac OS-X

Customer Site Requirements for incontact Workforce Optimization

TREENO ELECTRONIC DOCUMENT MANAGEMENT. Administration Guide

WaterfordTechnologies.com Frequently Asked Questions

System Administration Training Guide. S100 Installation and Site Management

WHAT IS AN RSS FEED & WHY IS IT SO IMPORTANT?

Alerts. Some Alerts give you unique options for customizing the messages you receive. Calendar events, for instance, allow you to set how far in

mystanwell.com Installing Citrix Client Software Information and Business Systems

Cloud Web-Based Operating System (Cloud Web Os)

CTERA Agent for Mac OS-X

Content Management Systems: Drupal Vs Jahia

PowerPoint Presentation to Accompany. Chapter 3. File Management. Copyright 2014 Pearson Educa=on, Inc. Publishing as Pren=ce Hall

Library Recovery Center

Design and selection criteria for a national web archive

Dell PowerVault DL2200 & BE 2010 Power Suite. Owen Que. Channel Systems Consultant Dell

Cymon.io. Open Threat Intelligence. 29 October 2015 Copyright 2015 esentire, Inc. 1

Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012

PLATFORM. Web Content Management and Digital Marketing for Higher Education. Everything You Need from a Great Enterprise CMS CONTENT MANAGEMENT

InstaFile. Complete Document management System

Collecting and archiving tweets: a DataPool case study

Defense Media Activity Guide To Keeping Your Social Media Accounts Secure

Introduction to WebGL

Archiving Social Media in the Context of Non-print Legal Deposit

Archiving the Web: the mass preservation challenge

Final Report - HydrometDB Belize s Climatic Database Management System. Executive Summary

Seagate Dashboard User Guide

Social Media Case Studies: Archiving Social Media: Mesolithic Online Resources (Mesolithic Miscellany and Mesolithic Research Forum).

Google Trusted Stores Setup in Magento

Using GitHub for Rally Apps (Mac Version)

Lightroom And It s Application In Dentistry

Qlik Sense Desktop. Qlik Sense Copyright QlikTech International AB. All rights reserved.

BROWSER-BASED DEVELOPMENT & NETWORK MONITORING UTILITIES

Using Drobo for Onsite & Offsite backup with Carbon Copy Cloner

CTERA Portal Datacenter Edition

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

Adobe Reader Settings

Workshop on Using Open Source Content Management System Drupal to build Library Websites Hasina Afroz Auninda Rumy Saleque

DMDoc General Concept

Migrate from Exchange Public Folders to Business Productivity Online Standard Suite

REDUCE COSTS AND COMPLEXITY WITH BACKUP-FREE STORAGE NICK JARVIS, DIRECTOR, FILE, CONTENT AND CLOUD SOLUTIONS VERTICALS AMERICAS

NaviCell Data Visualization Python API

Cloud Computing. What is it? Presented by Prof. Dr.Prabhas CHONGSTITVATANA Asst. Prof. Dr.Chaiyachet SAIVICHIT. Source : Montana State Library Archive

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

VIVO Dashboard A Drupal-based tool for harvesting and executing sophisticated queries against data from a VIVO instance

XenData Product Brief: SX-550 Series Servers for LTO Archives

Solution Brief: Creating Avid Project Archives

OBIEE Cloning. Cloning the OBIEE 11g database migration to a new host. Ashok Thiyagarajan ADVANS MARLBOROUGH, MA AND CHENNAI, INDIA

Sean Byrne Head of ediscovery Solutions Michael Lappin Director of Archiving Technology

A. The Treeno Data Center maintains audited advanced security systems equal to the most sophisticated systems of large corporations.

Intelligent Information Management: Archive & ediscovery

ACORD. Lync 2013 Web-app Install Guide

File Share Navigator Online 1

User Guide. Making EasyBlog Your Perfect Blogging Tool

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

Trainer name is P. Ranjan Raja. He is honour of and he has 8 years of experience in real time programming.

Cloud Help for Community Managers...3. About Jive Anywhere...4. Jive Anywhere System Requirements...5. Managing Jive Anywhere...6

Upgrading to Websense Web Security v7.6

Transcription:

Archiving the Web and Beyond: A Look at Twi8er and Facebook (and some other things too) July 23, 2014 Benn Joseph Manuscript Librarian Northwestern University Library

Outline of discussion Digital preservanon management (in general). How (and why) we archive websites at Northwestern University (and how to make the case). Social media archiving and the story of the Twi8er archive at LC. Behind the scenes: how to get your data out. Technology and tools: how do we do it? Sustaining the collecnon: ensuring preservanon and understanding the costs.

h8p://xkcd.com/1360/ Digital preservanon management

Backups vs. Archival Storage Backups keep your computer working and files safe Archival Storage keeps the content accessible for future computers and users

An Archival storage would comprise? Objects = (files + metadata) Storage system may include a variety of formats Metadata: idennficanon and descripnon Makes object understandable Unique idennfier == traceable in Nme

Metadata uniquely idennfies digital objects Unique IdenNfiers can also store descripnve data EX: MFM121587

Arranging digital objects Digital photos tossed in a folder for future archiving, credit: Keri Myers (a volunteer archivist with NDIIPP).

Lots of Copies let us save what remains not by vaults and locks which fence them from the public eye and use in consigning them to the waste of Nme, but by such a mulnplicanon of copies, as shall place them beyond the reach of accident. - Thomas Jefferson, 1791

How many? 2-6 COPIES! At least 2 copies in 2 places Consider File size Personally IdenNfiable InformaNon (PII)

Offline Storage

Online Storage Cloud and collaboranve storage

Dangers! to digital objects Change and loss accidental and intennonal DPOE Baseline Modules: Identify, version 3.0

Dangers! to digital objects Obsolescence as technology evolves DPOE Baseline Modules: Identify, version 3.0

Dangers! to digital objects Inappropriate access e.g., confidennal data DPOE Baseline Modules: Identify, version 2.0, Nov 2011

Dangers! to digital objects Non- compliance standards and requirements DPOE Baseline Modules: Identify, version 2.0, Nov 2011

Dangers! to digital objects Disasters emergencies of all kinds DPOE Baseline Modules: Identify, version 2.0, Nov 2011

Everyday ProtecNon Know where your content is located Onsite and offsite; online and offline (administranve metadata) Know who can have access to it Know who accesses your secure informanon For staff, depositors, users (AuthenNcaNon) DPOE Baseline Modules: Identify, version 2.0, Nov 2011

Web Archiving and how we archive websites at NU Web Archiving Service Product of the California Digital Library SubscripNon- based service Offers support Mildly customizable

WASsup? A li8le background on our Web Archiving Service implementanon 377 GB of data used (1TB available). 231 crawled sites. Planning began in 2010, we began using in June 2011. UA is insntunonal admin, I am the only project admin and we really only have two projects (one that takes in 99% of what we crawl; the other is dark).

Program of African Studies Newsle8er: only being made available online (no longer printed)

Web Archiving Task Force (WATF)

QuesNons remained

New library website launched August 2010 Old (top) and new (right) sites

Legacy student organizanon websites

AdreNUline Circus

Feinberg School of Medicine

Setng up crawl in WAS panel

PRINCIPLE: It is fair use to create topically based collecnons of websites and other material from the Internet and to make them available for scholarly use. LIMITATIONS: Captured material should be represented as it was captured, with appropriate informanon on mode of harvesnng and date. To the extent reasonably possible, the legal proprietors of the sites in quesnon should be idennfied according to the prevailing convennons of a8ribunon. Libraries should provide copyright owners with a simple tool for registering objecnons to making items from such a collecnon available online, and respond to such objecnons promptly.

Social media Terms of Service Twi?er: API available for just about all of it. Does not expressly prohibit crawlers (although robots.txt does) Facebook: You will not collect users' content or informanon, or otherwise access Facebook, using automated means (such as harvesnng bots, robots, spiders, or scrapers) without our prior permission. (API available).

SOCIAL MEDIA archiving and the story of the Twi?er archive at the Library of Congress (with images of what social media may look like).

The Library of Congress Agreement with Twi?er The Library s agreement with Twi8er announced April 14, 2010 provided that: Twi8er would donate a collecnon consisnng of all public tweets from the Twi8er service from its incepnon to the date of the agreement, an archive of 21 billion tweets that occurred between 2006 and 2010. Any addinonal materials Twi8er provides to the Library would be governed by the terms of the agreement unless both parnes agree to different terms in advance of receiving such addinonal materials. The Library could make available any pornon of the collecnon six months axer it was originally posted on Twi8er to bona fide researchers. A researcher must sign a nonficanon prohibinng commercial use and redistribunon of the collecnon. The Library cannot provide a substannal pornon of the collecnon on its web site in a form that can be easily downloaded.

Transfer of Data to the Library of Congress In December, 2010, Twi8er named a Colorado- based company, Gnip, as the delivery agent for moving data to the Library. Shortly thereaxer, the Library and Gnip began to agree on specificanons and processes for the transfer of files current tweets - on an ongoing basis. In February 2011, transfer of current tweets was ininated and began with tweets from December 2010. On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and descripnon. As of December 1, 2012, the Library has received more than 150 billion addinonal tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.

Behind the scenes how to get your data out!

Perez, Sarah. This is What a Tweet Looks Like. 2009.

Twi?er archive download Download twi?er

Twi?er archive download Download twi?er

Facebook archive download Download Facebook informaoon

Facebook archive download

Download Google data Download and migrate from Google products, blogs, instagram, etc.

Download Google data What s included?

Save Page As Save As using web browser

Technology and tools how do we do it?

Tools the Library of Congress is documennng ALL of Twi8er. I work with SF/F authors and encourage them to either do periodic screencaps, prinnng to PDF, or imaging of some sort of their websites and blogs using what capture tools I can find. I also encourage them to submit their URLs to The Internet Archive in the hopes of addinonal snapshots. Lynne Thomas, Northern Illinois University. Wayback machine works for many websites, but not for social media.

tools

Facebook robots.txt.........

Tools All- inclusive tools commonly used by academic insntunons: WAS and Archive- It. WAS works good for simple websites, but not social media. Archive- It might work (Umbra). Neither allow for much customizanon of crawl opnons. What else is there?

Tools Command- line interfaces (wget, curl, and others) are more customizable: (STEP1) mkdir /websitedl/ cd /websitedl/ (STEP2) wget - - limit- rate=200k - - no- clobber - - convert- links - - random- wait - r - p - E - e robots=off - U mozilla h8p://www.archivists.org/

Tools Facebook and Twi8er both have APIs (use curl) Facebook s Graph API named axer the idea of a social graph a representanon of the informanon on Facebook composed of: nodes (basically "things" such as a User, a Photo, a Page, a Comment) edges (the connecnons between those "things", such as a Page's Photos, or a Photo's Comments) fields (info about those "things", such as the birthday of a User, or the name of a Page).

Tools HTTrack (Windows allows for similar commands as wget) Sitecrawler (Mac) Social Feed Manager (developed at George Washington University, a django applicanon for managing mulnple feeds of social media data)

Experimental tools WARCreate, a Chrome plugin WAIL (Web Archiving IntegraNon Layer) Archive Facebook (Firefox plugin)

Tools: subscripoon services Using subscripnon services can be much easier but also much more expensive!

So there are many challenges. Internet Archive. Challenges of CollecNng and Preserving the Social Web. 2013. h8p://blogs.loc.gov/digitalpreservanon/files/ 2014/01/NDSA_CWG_120413_Carpenter.pdf.

What methods will you use for collecnng your data? How oxen will you collect it? Where will you put it? How will you provide access to it?

Sustaining the collecoon Finding aids for web archives (use SAA presentaoon and UT- San Antonio examples)

Sustaining the collecoon Bagit specificaoon Using checksums CreaOng durable digital objects

Sustaining the collecoon How would you talk about the costs of a web archiving or social media archiving project?

Thank you! Benn Joseph b- joseph@northwestern.edu

How does the federal government collect social media? Methods to capture social media records include: Using web crawling or other soxware to create local versions of sites; Using web capture tools to capture social media; Using pla~orm specific applicanon programming interfaces (APIs) to pull content; Using RSS Feeds, aggregators, or manual methods to capture content; and Using tools built into some social media pla~orms to export content.

LC NDSA also gives personal archiving guidance for websites, blogs, and social media. Iden%fy where you have Websites Decide which informaoon has long- term value Export the selected informaoon Organize the informaoon Make copies and manage them in different places