A Peek at 251,077,140 #election2016 tweets

As announced in this earlier post, we collected 280 million tweets with Social Feed Manager related to the 2016 U.S. presidential election and shared datasets of the tweet ids. This week, we loaded those tweets into TweetSets, a proof-of-concept tool that provides Twitter datasets for research and archiving. With TweetSets, users can filter/query existing datasets (such as the election datasets) to create custom datasets. Derivatives, such as a list of tweet ids, can then be generated for the custom dataset and downloaded.

A reminder: To conform with Twitter’s Developer Policies, non-GW researchers will only be able to download tweet ids, not the complete tweet. The complete tweet for a list of tweet ids can be retrieved using a tool like DocNow’s Hydrator.

One of the features of TweetSets is that it generates some basic statistics for a dataset to help with filtering. Here I will share the basic statistics for the election filter dataset, one of the datasets that are part of the election collection. The election filter dataset was collected between July 13, 2016 and November 10, 2016 from the Twitter filter stream tracking election2016, election, clinton, kaine, trump, pence and following @realDonaldTrump and @HillaryClinton.

Here’s a screenshot of all of the stats:

Election stats

Here’s a breakdown of tweets in the dataset by tweet type:

retweet	166,427,994	66%
original	48,430,955	19%
reply	18,771,725	7%
quote	17,446,466	7%

Here are the most prolific tweeters in the dataset:

@HydroElections	110,785
@paparcura	51,061
@TrumpVolume	49,539
@wft2016	47,566
@BLUETROMOS	45,573
@amrightnow	45,244
@StatesPoll	43,117
@MANTRAVADI123	42,938
@Davewellwisher	41,749
@Fb1Marissa	41,124

What? The most prolific tweeter has 561K tweets but only 16 followers?

@HydroElections

The most mentioned accounts are a bit less surprising (although @timkaine didn’t even crack the top 10):

@realDonaldTrump	23,705,518
@HillaryClinton	13,796,034
@wikileaks	2,677,908
@CNN	2,578,418
@FoxNews	1,752,105
@mike_pence	1,578,561
@YouTube	1,468,341
@mitchellvii	1,458,946
@KellyannePolls	1,322,557
@nytimes	1,239,012

It’s a landslide for Trump in the top hashtags:

#trump	8,543,057
#maga	4,383,141
#trump2016	1,927,508
#election2016	1,902,744
#trumptrain	1,901,758
#hillary	1,612,061
#trumppence16	1,560,618
#imwithher	1,535,608
#clinton	1,524,498
#debatenight	1,471,935

Here’s the top URLs contained in tweets:

http://dld.bz/eczmp	467,063
http://dld.bz/ec2cw	434,875
http://iwillvote.com	234,168
http://hillaryclinton.com/locate	144,496
http://trumpvoters.org	112,308
http://vote.gop	107,583
http://theclub.ml	101,048
http://hillaryclinton.com/makeaplan	99,541
http://vine.co/v/i1jqfxvophf	91,103
http://www.donaldjtrump.com	90,423

The top 2 are just link spam.

The vine (http://vine.co/v/i1jqfxvophf) is gone from the live web, but is captured by the Internet Archive. http://theclub.ml now requires authentication but in captures by the Internet Archive it redirects to http://trumprally.club. Ditto with the two http://hillaryclinton.com URLS: gone from the live web but preserved by the Internet Archive. These demonstrate how critical the Internet Archive is for studying this sort of data and suggest that a useful approach may be to use Twitter data to seed web archiving.

Suffice to say, this dataset begs further study. You’re welcome to give it a try by filtering it with TweetSets or downloading it in its entirety.

Note about TweetSets: The servers that it is running on are not provisioned for heavy use, so be gentle. Please let me know about any problems that you encounter, as well as any comments or suggestions.

Thanks to Tables Generator for reducing the pain creating these markdown tables.