Another Try at Harvesting the Twitter Streaming API to WARC files

In “Harvesting the Twitter Streaming API to WARC files”, I described an approach for recording the Twitter Streaming API in WARC files using record segmentation. The motivation for using record segmentation was that it allowed splitting up a single call to the API — a call that might have a very long duration and involve a large amount of data — into multiple WARC records spread across multiple WARC files.

We just abandoned that approach. Here’s why:

  • There is no support for it in the web archiving toolset. This required that we customize warcprox (for capture) and the warc python library (for reading the WARCs). This conflicted with our goal of writing less code by borrowing existing tools from web archiving.
  • Playback was bothering me. Part of our technical approach for aligning social media archiving with web archiving is to load WARC files containing social media data into a wayback machine for the purpose of playback. However, a monster HTTP response seemed an ill fit and likely to require extensive customization of some wayback implementation.
  • Exports seemed potentially problematic as well. Exporting required reconstructing and reading through monster HTTP responses. This was particularly expensive for exports that were limited to the tweets within a time period.
  • It has become increasingly clear that data collected from the Twitter Streaming API MUST be considered a sample. Some of the existing reasons for this are rate limits in the Twitter Streaming API, inevitable network hiccups or similar operational ailments that will interrupt the stream, and the simple fact that the Twitter Streaming API is a “black box” whose exact operation is unknown (well, to us anyway). If the data collected must be considered a sample, then small interruptions in the harvest should be acceptable as long as they don’t introduce any sort of a sampling bias. Researchers requiring a complete dataset will probably want to purchase it from a data reseller like Gnip.

Given this, we’re trying a new approach: Harvest from the Twitter Streaming API for 30 minutes at a time. At the end of the 30 minutes, close the stream and start a new one. Each 30 minute segment is recorded in a single WARC response record in a single WARC file. The interruption in collecting is only a handful of seconds.

Twitter warns against connection churn: “Clients which break a connection and then reconnect frequently (to change query parameters, for example) run the risk of being rate limited.” However, we’re hoping that 30 minutes between reconnects is reasonable. We’re running tests now to verify.

The upside of this new approach is that each WARC response record is a more manageable size that should play well with existing web archiving tools and be more export friendly. Oh yeah – and I get to throw away a ton of code.