Helpful Resources

Helpful Resources for Web Admins/Website Owners

General Notes:

  • Brozzler crawls can not be scheduled. This is problematic as more and more of our regularly crawled sites require Brozzler.
  • Expanding Crawl to Accept Vimeo Videos
    • Add the following seed scope rules:
      • Ignore Robots.txt
      • Expand Scope to include URL if it matches the SURT: http://(com,vimeocdn

Seed/Crawl Notes

  • President Granberg Inauguration (no longer active)
    • Embedded media not compliant with youtube-dl
  • GW Law Course Catalog
    • https://courses.law.gwu.edu/
      • This website runs on the Blazor web framework.
      • Known issue with replay of crawls. Correspondence with Archive-It indicates that this is an issue with replay in WaybackMachine, not with the actual crawls.
        • “The issue in this case boils down to the way that the original site appends unique ID strings to each request for the course information shown. The ID generated by the archived site does not match the original from crawl time, so we end up with pieces missing…I think that it will be necessary to teach our Wayback replay software to ignore these unique IDs and just load the file with the closest matching URL. “
  • NEA Websites
    • Resource Page
      • Need time to better scope this crawl. Issues with pagination on the resource library pages.
  • Foggy Bottom Association
    • Use Brozzler!
    • Wix site, Seed scoped w/ Archive-It’s recommendations for WIX: https://support.archive-it.org/hc/en-us/articles/208824546-Archiving-Wix-sites
    • False 404 crawl trap results; unresolved.
  • Beltwaypoetry.com
    • Crawl trap w/ social media and contact form embeds (?)
      • Scope rules:
        • block URL if it contains “?share=”
        • block URL if it contains ““/contact-form-7/”