Advanced Archive- It Applica2on Training: Archiving Social Networking and Social Media Sites 1
Agenda Overview of Social Networking/Media sites Why archive these sites? Typical Challenges Best Prac2ces: TwiGer, Facebook, YouTube, Flickr Looking toward the future Ques2ons/Discussion 2
Why Archive These Sites? State Agencies: An increasing number have decided that the content on these sites are a record and need to be archived. "A tweet is a record University libraries: Used to share informa2on with students and alumni and contain important records about a school's culture, student body and campus events. Non Government Non Profit Organiza@ons: Used to record online presence and impact Researchers: Used to preserve valuable social reac2ons and change on topics of interest 3
Archive- It and Social Media Overview Capturing Social media sites is becoming more necessary for Archive- It partners S2ll focused on: Flickr, Facebook, TwiGer, and YouTube On our radar: Vimeo, LinkedIn, Others? Join the Archive- It social media list serve to hear breaking news, including fixes and adjustments within Archive- It 4
Social Media Crawling Notes Content behind log- ins can not be archived currently Feature in 4.8 Release, April 2013 Some parts of sites are not archive- friendly (i.e. complicated javascript, etc.) These sites tend to change both their technical structure and policy quickly and oeen. 5
Scoping Social Media Sites Because of the way many of these sites are structured, scoping crawls correctly is very important if you are archiving these sites. Each site has its own unique structure Not scoping correctly can result in crawling much much more than you intend, or not capturing the content you want to archive. 6
Scoping - Overall Approaches Trial and Error: Try to harvest with a variety of seings and a variety of seeds Quality Review: review archived content thoroughly Collaborate: compare approaches and results with other Archive- It users Document detailed instruc2ons, lessons learned, and best prac2ces for other partners 7
Best Prac2ces Best prac2ces for various social networking and social media sites are documented on the Archive- It Help Wiki: hgps://webarchive.jira.com/wiki/display/arih/ Archiving+Social+Networking+Sites+with +Archive- It 8
Best Prac2ces Be specific with your seed URLs - list only the page you would like to archive as a seed. Do NOT use the larger site as a seed (for example, do NOT use www.facebook.com or www.twiger.com as seeds. DO use: hgp://twiger.com/internetarchive/). Double check your seed: Do you need an ending slash /? Ignore Robots.txt as needed: Some sites block content using robots.txt 9
Best Prac2ces ALWAYS run a test crawl when first seing up these seeds to avoid using more of your document budget than expected. You may need to run more than one un2l you get it right. 10
Best Prac2ces ANer your first crawl Review post- crawl reports (did you crawl too much?) Review archived content in Wayback Did you capture all the areas you expected? Are there any display issues? 11
Reviewing Scoping Rules To the web app! 12
TwiGer Sample URLs Individual user feeds hgps://twiger.com/archiveitorg/ Searches hgps://twiger.com/search?q=web %20archiving&src=typd Lists hgps://twiger.com/smithsonian/smithsonian/ A specific tweet hgps://twiger.com/archiveitorg/status/ 294819565320413184 13
TwiGer - Scoping Expand Scope (using SURTs) to capture dynamically loading content: Individual TwiGer feed: +hgp://(com,twiger,)/i/profiles/show/ BrowardCollege/ Mul2ple TwiGer feeds: +hgp://(com,twiger,)/i/profiles/show/ 14
Links in Tweets Can I archive a url linked to using a url shortener? Yes! Use an Expand Scope rule for hgp://t.co/ - all URLs posted on TwiGer redirect through that domain Note: just the one page that the url shortener link points to will be archived (plus embedded content) 15
TwiGer Examples of Archived Pages 16
Facebook Sample URLs Individual User Profiles Timeline view hgp://www.facebook.com/tonyforsenate/ Pages - Timeline view hgp://www.facebook.com/archiveit/ Events hgp://www.facebook.com/events/265897963430841/ Albums hgps://www.facebook.com/media/set/?set=a. 13499334573.18616.6193904573&type=3 17
Facebook - Scoping Ignoring robots.txt: www.facebook.com qcdn.net akamaihd.net Document limit on www.facebook.com (recommended 2000 for each seed) Note, you cannot limit to *just* capture content from one Facebook account Expand Scope: SURT +hgp://(net,qcdn, 18
Facebook Currently we can capture the ini2al content on a Facebook 2meline, however the dynamically loading content can be difficult to capture due to the frequent changes in the way that content is served by Facebook Our engineers are working on keeping up to date with these changes and we are also inves2ga2ng alternate methods for capturing Facebook pages 19
Facebook Examples of Archived Pages 20
YouTube - Sample URLs Channel /User pages hgp://www.youtube.com/whitehouse Watch pages- individual videos hgp://www.youtube.com/watch?v=5lviuw8vj_e Uploaded Document RSS Feed hgp://gdata.youtube.com/feeds/api/users/whitehouse/ uploads/ Embedded YouTube Videos on other sites: hgp://www.whitehouse.gov/photos- and- video/video/ 2013/01/29/president- obama- speaks- comprehensive- immigra2on- reform 21
YouTube - Scoping For all YouTube content, ignore robots.txt for: youtube.com y2mg.com For Watch pages- individual videos Use One Page Only Seed Type For Channel/User pages Crawl with a document limit or using RSS/News Feed seed type 22
YouTube Viewing YouTube videos: YouTube videos for Watch pages and most embedded YouTube videos will playback normally in Wayback For Channel/User Pages or other pages where videos are not playing back within the page, view videos from the video report or the public video page for that seed. 23
YouTube Examples of Archived Pages 24
Flickr What types of pages can be archived? Photo streams Ex: hgp://www.flickr.com/photos/whitehouse/ Individual photos Ex: hgp://www.flickr.com/photos/whitehouse/ 8390033709/in/photostream 25
Flickr Examples of Archived Pages 26
Other Sites Can sites other than those already men2oned be archived? Yes! There are many more sites out there that can be archived. Please send us sites you are interested in archiving. Other sites men2oned by partners currently are Google+, LinkedIn, Vimeo, and SlideShare. 27
Moving Forward These best prac2ces will change as the sites themselves make changes. Please be sure to check the Help Wiki page for updates We con2nue to focus on working with our partners to improve the capture and display of archived social networking sites The Archive- It team is exploring other capture mechanisms besides using a tradi2onal crawler resource (Heritrix) Headless browsers Hybrid architecture API Partnering with third party soeware Enhance the display and search capabili2es 28
Thank you! Ques2ons? Discussion? Please take our quick survey: hgp://www.surveymonkey.com/s/gz8cwc8 29