This article is more than 5 years old.

This afternoon, Craig, Rebecca, and I sat down for a webinar from ArchiveIt about archiving social media sites. The advanced training session covered the reasons for archiving social media (“a tweet is a record”) and then explored how to add specific seed URLs to one’s ArchiveIt web archive to get the content being created via these social media sites.

Some takeaways from the webinar include:

  • Always run a test crawl to see how many documents and space you are crawling
  • Review test crawl results when adding new seed URLs
  • Be specific with social media site URLs

For Twitter, our instructor noted that we should always remove hashtags from Twitter URLs and that searches cannot be saved easily. Also, she mentioned that one should always turn off Javascript before adding a Twitter URL, or the hashtag that prevents proper crawling will automatically be included.

For Facebook, we were informed that (not surprisingly) ArchiveIt can crawl only public pages and it cannot crawl behind a login page. Through specific directions using ArchiveIt, we can ignore robots.txt which tends to block social media sites from being crawled. They recommend ignoring both facebook.com and fbcdn.net, and also putting a document limit of 2000 for each Facebook seed. Again, the instructors recommend turning off Javascript to allow for proper crawling of the information on the page.

Additional information about how to put scope and document limits when adding social media seed URLs to ArchiveIt can be found on the “Archiving Social Networking Sites with ArchiveIt” page on the ArchiveIt wiki. We’ll be organizing and adding new social media seeds to the ZSR ArchiveIt account!