Finding Twitter handles with Scrapy

As I’ve built collections using Social Feed Manager (SFM), the most time-consuming aspect has always been collecting lists of Twitter handles. For example, collecting the Twitter handles of every member of Congress required manually searching each member’s website for a Twitter handle and writing it down.

Now, however, I’ve built a tool using Scrapy to collect Twitter handles from a website. You can read more about Scrapy in its documentation.

Scrapy is a Python toolkit for building web scrapers. Here I’ll give a brief summary of the methods I used from Scrapy, and a quick walkthrough of how to use the tool. I’ve written this presuming a minimum knowledge of coding and the basics of how to use the shell, and already have Python installed. You might also use virtualenv since you’ll be installing Python libraries.

Installing Scrapy and Getting Started

Begin by navigating in your shell to the directory you want to build the tool in. Then, install Scrapy by entering: pip install scrapy

Once Scrapy is installed, you can begin by starting a project. Navigate in shell to the directory you want to write the code in, then enter scrapy startproject example where example is the name of the project, which can be anything you want. This will create an entire project directory including everything you need for this scraper.

Alternatively, you can download Handler, my spider, from GitHub, which comes with the same directories.

Notice that there are several different directories inside the project directory, including one with the same name. Inside the directory with the same name are the directory (spiders) and file (items.py) you will need.

Scrapy Basics

Scrapy works, like other web scrapers, by reading web pages and returning specific elements from the pages’ HTML using XPath. Scrapy also supports CSS, so if you are more familiar with selecting elements using CSS selectors, you may want to try that instead. I, however, used XPath code because I found it easier to formulate the XPath requests. (You can see this code on any web page in Chrome by right clicking and selecting Inspect.)

To get a feel for this basic functionality, Scrapy includes a shell which allows you to experiment with commands. Once Scrapy is installed, try entering the command: scrapy shell wikipedia.org

This will initiate a shell after scraping the Wikipedia front page. You could try this with any other web page. The web page will then be stored as response, which is then broken down with other commands.

Let’s extract all of the links in the response. First, enter response.xpath('//a'). //a is the XPath for the anchor tags (<a>) on the page. Notice that most of the results have href attributes (href= followed by a URL), which are links. To get just href attributes of anchor tags, enter: response.xpath('//a/@href'). To get an even cleaner response, adding .extract() will return just the URLs. You can also use .extract_first() to get only the first item provided.

This covers the basics of using Scrapy, but we want a tool that not only scrapes data from a single site, but that can follow those links throughout the website and scrape from other pages. To do this, we’ll need to build a spider.

Building the Handler spider

I will go through the code of Handler as a way of showing how a spider works in general, with a caveat: this is mostly finished code. There are still some bugs that need to be worked out (and I would appreciate any feedback). I copied out the code here for illustrative purposes, so that you can understand how each piece works. Getting them all to work together in the exact way I want has been a little more challenging. While Handler is returning all the data I want, it’s returning it in a big mess, so cleaning it up is a future endeavour. You can find the most updated version of the code on GitHub.

The first file that you’ll need to pay attention to is items.py, which is in the third-level handler directory.

Our spider will use seven different list items, which you can see in the code for items.py below.

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class HandlerItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        link = scrapy.Field() # urls from the start page
        link2 = scrapy.Field() # urls from secondary pages
        domainlink = scrapy.Field() # domain links from the first page, for scraping
        twitterlink = scrapy.Field() # twitter links from the first page
        twitterlink2 = scrapy.Field() # twitter links from the secondary pages
        twitterlinku = scrapy.Field() # cumulative list of twitter links
        handles = scrapy.Field() # twitter links truncated into handles
        pass

All but the actual names with their definitions are already populated from the Scrapy interface, since items.py is created with startproject.

Once the items file is set up, your spider should be built in the spider directory. There, create a python file (like handler.py). After the code, I’ll do a little explanation of how it works, in addition to the code comments:

    # -*- coding: utf-8 -*-
    import scrapy
    from handler.items import HandlerItem # from items.py
    
    domain = 'usnpl.com' # domain where this spider will operate
    start = 'http://usnpl.com' # page where this spider will start
    
    class HouseSpider(scrapy.Spider):
        name = 'handler'
        allowed_domains = [domain]
        start_urls = [start]
    
        def parse(self, response):
            item = HandlerItem() # introduces items into the spider
            item['link'] = [] # creates empty list for each item
            item['link2'] = []
            item['domainlink'] = []
            item['twitterlink'] = []
            item['twitterlink2'] = []
            item['twitterlinku'] = []
            item['handles'] = []
            # extracts all urls from first page
            item['link'] = response.xpath('//a/@href').extract()
            # filters all urls for twitter links and domain links
            item['twitterlink'] = [w for w in item['link'] if "twitter.com" in w]
            item['domainlink'] = [w for w in item['link'] if domain in w]
            # adds collected twitter links to the cumulative list
            for url in item['twitterlink']:
                item['twitterlinku'].append(url)
            # builds second function of spider # see parse2
            for url in item['domainlink']:
                request = scrapy.Request(url, callback=self.parse2)
                request.meta['item'] = item # passes items through to the second function
                yield request
            # truncates twitter links in the cumulative list to handles
            for link in item['twitterlinku']:
                handle = link.split(".com/")[1]
                if "/" in handle:
                    handle = handle.split("/")[0]
                item['handles'].append(handle)
    
            yield item
            print (item['handles'])
    
        # scrapes all domain links from the first page for twitter handles
        def parse2(self, response):
            item = response.meta['item'] # calls items from the first function
            item['link2'] = response.xpath('//a/@href').extract() # scrapes all urls
            # filters urls for twitter links
            item['twitterlink2'] = [w for w in item['link2'] if "twitter.com" in w]
            # adds collected twitter links to the cumulative list
            for url in item['twitterlink2']:
                item['twitterlinku'].append(url)
                print (url)
            yield item['twitterlinku']

I used the domain and start variables as a way of making this tool generalizable. You should be able to input any domain and starting URL there.

Spiders are classes with set conditions. The name, allowed_domains, and start_urls are all changeable.

The first parse() function bears the brunt of the task. First, items are introduced as lists, so that they can be used in the function.

Then, using the response.xpath() method from before to collect links, we collect every link on the page.

We then filter those links into twitterlink, which collects links that include “twitter.com,” which can be truncated into handles, and domainlink, or links from that website, which we use to crawl the website for further handles.

Then, as a way of keeping a running list throughout the spider, we add twitterlink to twitterlinku, a cumulative list of Twitter links.

At this point, we change gears from scraping to crawling. Since in many cases scraping an entire website will take too much time and memory, we’ll only do this to a secondary level: only the first page and any page that the first page links to will be scraped. This is useful for most instances where there is a list of sites that include Twitter links.

The method scrapy.Request() allows us to call parse2(). However, before we yield this request, we use .meta to ensure that the lists of items are maintained as the same throughout the process.

The parse2() method works as a truncated version of parse(). After calling the items through .meta, we scrape all links from the new page. Here we only filter for Twitter links, since we won’t do a second layer of scraping on any links within the domain. We then append those links to the cumulative twitterlinku list. This method will loop over every domain link that was collected on the first page.

Once a cumulative list of Twitter links has been collected, we want to turn those links into Twitter handles, so that they can be added to Social Feed Manager as seeds in a Twitter user timeline collection.

Finally, we split off everything that comes after the .com/ in the link, and then anything that comes before a second slash. In Twitter links, this can only be a handle or certain other top-level terms like “search.” Future versions of Handler should be able to filter those out as well.

Executing your Spider

Once the spider is built, executing a crawl is fairly easy. Navigate to the top-level directory of the project (in this case, the highest handler). Execute the command scrapy crawl handler -o filename.csv where filename is whatever you want to call your output. You can also use JSON or a few other file types for exporting your data.

You’ll now have a list of handles for use with Social Feed Manager.

A major caveat: The completeness of this scrape is limited by the depth of the scrape, the structure of the website you are using, and the reliability of the web administrators, who may not be consistent in how they post Twitter handles. For example, for the House of Representatives, not every member of Congress included a Twitter link on the front page (although most did). Thus, when you are building a collection, keep in mind that the list you scraped may not be a complete list of every Twitter handle that you need – instead, it’s a good start. You’ll need to judge how reliable each scrape is individually.