Debugging scrapy memory leaks
Jul 20, 2015 • 1 minutes to read • Last Updated: Oct 25, 2017The only reason I perceive that you will read this is cause you have issues with a Scrapy spider leaking memory. For others I would say, you can bookmark it for when your Scrapy spider starts leaking memory.
Did you know you could telnet into scrapy spiders?
If you have been watching closing at your scrapy console you will see a line that tells you the port that the scrapy spider has a telnet console running on.
2015-07-20 20:32:11-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Well, when you see the port that your spider is listening on then you can telnet into your spider using the command line. The port I will telnet into is 6023.
telnet localhost 6023 # this connects to the port 6023 and use the command prefs() # to list all details about your spider >>> prefs() Live References # scrapy class Memory Time ago HtmlResponse 3 oldest: 5s ago CraigslistItem 100 oldest: 5s ago DmozItem 1 oldest: 0s ago DmozSpider 1 oldest: 6s ago CraigslistSpider 1 oldest: 5s ago Request 3000 oldest: 705s ago Selector 14 oldest: 5s ago
Clearly, the issue is with one of my Request
objects. Quickly going back to my code I saw that I used yield Request
a lot which caused many objects to be created in memory.
Tell me how you solved your memory leaks with Scrapy in the comments below or post a question below if you would like help debugging your spiders.
I am writing a book!
While I do appreciate you reading my blog posts, I would like to draw your attention to another project of mine. I have slowly begun to write a book on how to build web scrapers with python. I go over topics on how to start with scrapy and end with building large scale automated scraping systems.
If you are looking to build web scrapers at scale or just receiving more anecdotes on python then please signup to the email list below.