Web Archives - Notes | GW SCRC

Helpful Resources

Known Platform Issues
- A selection of platforms that the Archive-It team monitors for changes in capture and replay.
Scoping Recommendations for Specific Sites
- Recommendations most relevant to the GW Web Archives program include:
  - Archiving Vimeo Videos
  - Archiving Wix Sites
  - Archiving Wordpress and Squarespace sites
  - Archiving Youtube Videos
  - Archiving sites protected by Cloudflare
  - Archiving Tableau
Archive-It Blog Post on use of Youtube-Dl in Archive-it Stack

Brozzler crawls can not be scheduled. This is problematic as more and more of our regularly crawled sites require Brozzler.
Expanding Crawl to Accept Vimeo Videos
- Add the following seed scope rules:
  - Ignore Robots.txt
  - Expand Scope to include URL if it matches the SURT: http://(com,vimeocdn

President Granberg Inauguration (no longer active)
- Embedded media not compliant with youtube-dl
GW Law Course Catalog
- https://courses.law.gwu.edu/
  - This website runs on the Blazor web framework.
  - Known issue with replay of crawls. Correspondence with Archive-It indicates that this is an issue with replay in WaybackMachine, not with the actual crawls.
    - “The issue in this case boils down to the way that the original site appends unique ID strings to each request for the course information shown. The ID generated by the archived site does not match the original from crawl time, so we end up with pieces missing…I think that it will be necessary to teach our Wayback replay software to ignore these unique IDs and just load the file with the closest matching URL. “
NEA Websites
- Resource Page
  - Need time to better scope this crawl. Issues with pagination on the resource library pages.
Foggy Bottom Association
- Use Brozzler!
- Wix site, Seed scoped w/ Archive-It’s recommendations for WIX: https://support.archive-it.org/hc/en-us/articles/208824546-Archiving-Wix-sites
- False 404 crawl trap results; unresolved.
Beltwaypoetry.com
- Crawl trap w/ social media and contact form embeds (?)
  - Scope rules:
    - block URL if it contains “?share=”
    - block URL if it contains ““/contact-form-7/”