SQL Alchemy pipeline to add item to DB
In my last post, I talked about how to run 2 spiders concurrently. This post is a brief introduction to how to add scrapy items into database through the pipeline.
The scrapy framework is magnificient when it comes to data processing. There are tons of features that it uses and lets developers configure. Since we are using the core API to run our scrapers right now, we are able to set the pipeline using
The ITEM_PIPELINES
is a python dictionary with the key as the location of the pipeline object and the value as the order in which the key should run. We can assign numbers between 100-1000 to run the pipelines. In this project I am basically creating a database called scrapyspiders for which I set the settings here. I create the connections and create the SQL Alchemy ORM model with the fields id, title, url and date.
I call the SQL Alchemy object in the items pipeline and create a record and insert the new item into the database with the corresponding values from the item.
And finally run using the command:
I am writing a book!
While I do appreciate you reading my blog posts, I would like to draw your attention to another project of mine. I have slowly begun to write a book on how to build web scrapers with python. I go over topics on how to start with scrapy and end with building large scale automated scraping systems.
If you are looking to build web scrapers at scale or just receiving more anecdotes on python then please signup to the email list below.