Running scrapy spider programmatically
I wanted to share something that I have been working on for the past few months, which is, running scrapers with the scrapy framework. I understand that scrapy has existed for many years, but it is still so relevant and useful for me and my team. We were hooked to it and started reading the docs daily on how to get it perfect. There are two ways of running a scrapy spider. You can run a scrapy spider from the command line or using a program.
Today, I am going to illustrate how to use the framework by running it by using its Core API. If you are not familiar with how web scraping works and would like to use scrapy to get you started, then you should definitely look into this tutorial.
What you would need to know before we start are:
-
The Scrapy Spider : It is a python class in the scrapy framework that is responsible for fetching URLs and parsing the information in the page response.
-
Your Custom Spider : It extends the scrapy spider class. We implement the method
parse
to be able to parse the page response. In the example belowDmozSpider
is the custom spider. -
The Scrapy item : It is an object that will act as a dictionary to store all the information you want to parse.
-
The Scrapy Selector : To select elements on the page with an xpath selector or a css selector. In older versions of scrapy you had to import the
Selector
class but now you can use the selectors on theresponse
object directly.
I am going to use the example from scrapy tutorial to make it easy to understand.
This is what the spider file DmozSpider.py looks like:
To be able to run this spider solely from scrapy core script:
You can also add a pipeline to insert the item into your database by using the ITEMS_PIPELINE
in the scrapy settings. I will illustrate that in my next blog post and also how you will be able to run 2 spiders parallely here.
I am writing a book!
While I do appreciate you reading my blog posts, I would like to draw your attention to another project of mine. I have slowly begun to write a book on how to build web scrapers with python. I go over topics on how to start with scrapy and end with building large scale automated scraping systems.
If you are looking to build web scrapers at scale or just receiving more anecdotes on python then please signup to the email list below.