As I’ve built collections using Social Feed Manager (SFM), the most time-consuming aspect has always been collecting lists of Twitter handles. For example, collecting the Twitter handles of every member of Congress required manually searching each member’s website for a Twitter handle and writing it down.
Scrapy is a Python toolkit for building web scrapers. Here I’ll give a brief summary of the methods I used from Scrapy, and a quick walkthrough of how to use the tool. I’ve written this presuming a minimum knowledge of coding and the basics of how to use the shell, and already have Python installed. You might also use virtualenv since you’ll be installing Python libraries.
Installing Scrapy and Getting Started
Begin by navigating in your shell to the directory you want to build the tool in. Then, install Scrapy by entering:
pip install scrapy
Once Scrapy is installed, you can begin by starting a project. Navigate in shell to the directory you want to write the code in, then enter
scrapy startproject example where example is the name of the project, which can be anything you want. This will create an entire project directory including everything you need for this scraper.
Alternatively, you can download Handler, my spider, from GitHub, which comes with the same directories.
Notice that there are several different directories inside the project directory, including one with the same name. Inside the directory with the same name are the directory (spiders) and file (items.py) you will need.
Scrapy works, like other web scrapers, by reading web pages and returning specific elements from the pages’ HTML using XPath. Scrapy also supports CSS, so if you are more familiar with selecting elements using CSS selectors, you may want to try that instead. I, however, used XPath code because I found it easier to formulate the XPath requests. (You can see this code on any web page in Chrome by right clicking and selecting Inspect.)
To get a feel for this basic functionality, Scrapy includes a shell which allows you to experiment with commands. Once Scrapy is installed, try entering the command:
scrapy shell wikipedia.org
This will initiate a shell after scraping the Wikipedia front page. You could try this with any other web page. The web page will then be stored as response, which is then broken down with other commands.
Let’s extract all of the links in the response. First, enter
response.xpath('//a'). //a is the XPath for the anchor tags (<a>) on the page. Notice that most of the results have href attributes (href= followed by a URL), which are links. To get just href attributes of anchor tags, enter:
response.xpath('//a/@href'). To get an even cleaner response, adding .extract() will return just the URLs. You can also use .extract_first() to get only the first item provided.
This covers the basics of using Scrapy, but we want a tool that not only scrapes data from a single site, but that can follow those links throughout the website and scrape from other pages. To do this, we’ll need to build a spider.
Building the Handler spider
I will go through the code of Handler as a way of showing how a spider works in general, with a caveat: this is mostly finished code. There are still some bugs that need to be worked out (and I would appreciate any feedback). I copied out the code here for illustrative purposes, so that you can understand how each piece works. Getting them all to work together in the exact way I want has been a little more challenging. While Handler is returning all the data I want, it’s returning it in a big mess, so cleaning it up is a future endeavour. You can find the most updated version of the code on GitHub.
The first file that you’ll need to pay attention to is items.py, which is in the third-level handler directory.
Our spider will use seven different list items, which you can see in the code for items.py below.
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class HandlerItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() link = scrapy.Field() # urls from the start page link2 = scrapy.Field() # urls from secondary pages domainlink = scrapy.Field() # domain links from the first page, for scraping twitterlink = scrapy.Field() # twitter links from the first page twitterlink2 = scrapy.Field() # twitter links from the secondary pages twitterlinku = scrapy.Field() # cumulative list of twitter links handles = scrapy.Field() # twitter links truncated into handles pass
All but the actual names with their definitions are already populated from the Scrapy interface, since items.py is created with startproject.
Once the items file is set up, your spider should be built in the spider directory. There, create a python file (like handler.py). After the code, I’ll do a little explanation of how it works, in addition to the code comments:
# -*- coding: utf-8 -*- import scrapy from handler.items import HandlerItem # from items.py domain = 'usnpl.com' # domain where this spider will operate start = 'http://usnpl.com' # page where this spider will start class HouseSpider(scrapy.Spider): name = 'handler' allowed_domains = [domain] start_urls = [start] def parse(self, response): item = HandlerItem() # introduces items into the spider item['link'] =  # creates empty list for each item item['link2'] =  item['domainlink'] =  item['twitterlink'] =  item['twitterlink2'] =  item['twitterlinku'] =  item['handles'] =  # extracts all urls from first page item['link'] = response.xpath('//a/@href').extract() # filters all urls for twitter links and domain links item['twitterlink'] = [w for w in item['link'] if "twitter.com" in w] item['domainlink'] = [w for w in item['link'] if domain in w] # adds collected twitter links to the cumulative list for url in item['twitterlink']: item['twitterlinku'].append(url) # builds second function of spider # see parse2 for url in item['domainlink']: request = scrapy.Request(url, callback=self.parse2) request.meta['item'] = item # passes items through to the second function yield request # truncates twitter links in the cumulative list to handles for link in item['twitterlinku']: handle = link.split(".com/") if "/" in handle: handle = handle.split("/") item['handles'].append(handle) yield item print (item['handles']) # scrapes all domain links from the first page for twitter handles def parse2(self, response): item = response.meta['item'] # calls items from the first function item['link2'] = response.xpath('//a/@href').extract() # scrapes all urls # filters urls for twitter links item['twitterlink2'] = [w for w in item['link2'] if "twitter.com" in w] # adds collected twitter links to the cumulative list for url in item['twitterlink2']: item['twitterlinku'].append(url) print (url) yield item['twitterlinku']
I used the domain and start variables as a way of making this tool generalizable. You should be able to input any domain and starting URL there.
Spiders are classes with set conditions. The name, allowed_domains, and start_urls are all changeable.
The first parse() function bears the brunt of the task. First, items are introduced as lists, so that they can be used in the function.
Then, using the response.xpath() method from before to collect links, we collect every link on the page.
We then filter those links into twitterlink, which collects links that include “twitter.com,” which can be truncated into handles, and domainlink, or links from that website, which we use to crawl the website for further handles.
Then, as a way of keeping a running list throughout the spider, we add twitterlink to twitterlinku, a cumulative list of Twitter links.
At this point, we change gears from scraping to crawling. Since in many cases scraping an entire website will take too much time and memory, we’ll only do this to a secondary level: only the first page and any page that the first page links to will be scraped. This is useful for most instances where there is a list of sites that include Twitter links.
The method scrapy.Request() allows us to call parse2(). However, before we yield this request, we use .meta to ensure that the lists of items are maintained as the same throughout the process.
The parse2() method works as a truncated version of parse(). After calling the items through .meta, we scrape all links from the new page. Here we only filter for Twitter links, since we won’t do a second layer of scraping on any links within the domain. We then append those links to the cumulative twitterlinku list. This method will loop over every domain link that was collected on the first page.
Once a cumulative list of Twitter links has been collected, we want to turn those links into Twitter handles, so that they can be added to Social Feed Manager as seeds in a Twitter user timeline collection.
Finally, we split off everything that comes after the .com/ in the link, and then anything that comes before a second slash. In Twitter links, this can only be a handle or certain other top-level terms like “search.” Future versions of Handler should be able to filter those out as well.
Executing your Spider
Once the spider is built, executing a crawl is fairly easy. Navigate to the top-level directory of the project (in this case, the highest handler). Execute the command
scrapy crawl handler -o filename.csv where filename is whatever you want to call your output. You can also use JSON or a few other file types for exporting your data.
You’ll now have a list of handles for use with Social Feed Manager.
A major caveat: The completeness of this scrape is limited by the depth of the scrape, the structure of the website you are using, and the reliability of the web administrators, who may not be consistent in how they post Twitter handles. For example, for the House of Representatives, not every member of Congress included a Twitter link on the front page (although most did). Thus, when you are building a collection, keep in mind that the list you scraped may not be a complete list of every Twitter handle that you need – instead, it’s a good start. You’ll need to judge how reliable each scrape is individually.