Running multiple scrapy spiders programmatically
This is a continuation of my last post about how to run scrapers from a python script. In this post I will be writing about how to manage 2 spiders. You can run over 30 spiders concurrently using this script.
Right now, this is how I have organise my project.
├── core.py
└── spiders # the spiders directory with all the spiders
├── CraigslistSpider.py
├── DmozSpider.py
└── __init__.py
As you can see, I have an additional spider that will be part of this program. Again, remember, we will be able to add over 30 spiders through this script. The logic remains the same. We will focus on how to get 2 spiders running right now.
# import the spiders you want to run
from spiders.DmozSpider import DmozSpider
from spiders.CraigslistSpider import CraigslistSpider
# scrapy api imports
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
# list of crawlers
TO_CRAWL = [DmozSpider, CraigslistSpider]
# crawlers that are running
RUNNING_CRAWLERS = []
def spider_closing(spider):
"""
Activates on spider closed signal
"""
log.msg("Spider closed: %s" % spider, level=log.INFO)
RUNNING_CRAWLERS.remove(spider)
if not RUNNING_CRAWLERS:
reactor.stop()
# start logger
log.start(loglevel=log.DEBUG)
# set up the crawler and start to crawl one spider at a time
for spider in TO_CRAWL:
settings = Settings()
# crawl responsibly
settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
crawler = Crawler(settings)
crawler_obj = spider()
RUNNING_CRAWLERS.append(crawler_obj)
# stop reactor when spider closes
crawler.signals.connect(spider_closing, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(crawler_obj)
crawler.start()
# blocks process; so always keep as the last statement
reactor.run()
My little hack here is to make a list of RUNNING_CRAWLERS
and on the spider signal_closed
, which is sent by the spider when it stops I remove them from the RUNNING_CRAWLERS
list. And finally when there are no more spiders in the RUNNING_CRAWLERS
, we stop the script.
You can view my last post in this series about how to implement the scrapy pipeline to insert items into your database here.
I am writing a book!
While I do appreciate you reading my blog posts, I would like to draw your attention to another project of mine. I have slowly begun to write a book on how to build web scrapers with python. I go over topics on how to start with scrapy and end with building large scale automated scraping systems.
If you are looking to build web scrapers at scale or just receiving more anecdotes on python then please signup to the email list below.