Running multiple scrapy spiders programmatically
This is a continuation of my last post about how to run scrapers from a python script. In this post I will be writing about how to manage 2 spiders. You can run over 30 spiders concurrently using this script.
Right now, this is how I have organise my project.
As you can see, I have an additional spider that will be part of this program. Again, remember, we will be able to add over 30 spiders through this script. The logic remains the same. We will focus on how to get 2 spiders running right now.
My little hack here is to make a list of RUNNING_CRAWLERS
and on the spider signal_closed
, which is sent by the spider when it stops I remove them from the RUNNING_CRAWLERS
list. And finally when there are no more spiders in the RUNNING_CRAWLERS
, we stop the script.
You can view my last post in this series about how to implement the scrapy pipeline to insert items into your database here.
I am writing a book!
While I do appreciate you reading my blog posts, I would like to draw your attention to another project of mine. I have slowly begun to write a book on how to build web scrapers with python. I go over topics on how to start with scrapy and end with building large scale automated scraping systems.
If you are looking to build web scrapers at scale or just receiving more anecdotes on python then please signup to the email list below.