SQL Alchemy pipeline to add item to DB
In my last post, I talked about how to run 2 spiders concurrently. This post is a brief introduction to how to add scrapy items into database through the pipeline.
The scrapy framework is magnificient when it comes to data processing. There are tons of features that it uses and lets developers configure. Since we are using the core API to run our scrapers right now, we are able to set the pipeline using
settings.set("ITEM_PIPELINES", {'pipelines.AddTablePipeline': 100})
The ITEM_PIPELINES
is a python dictionary with the key as the location of the pipeline object and the value as the order in which the key should run. We can assign numbers between 100-1000 to run the pipelines. In this project I am basically creating a database called scrapyspiders for which I set the settings here. I create the connections and create the SQL Alchemy ORM model with the fields id, title, url and date.
from sqlalchemy import Column, String, Integer, DateTime
from database.connection import Base
class AllData(Base):
__tablename__ = 'alldata'
id = Column(Integer, primary_key=True)
title = Column(String(1000))
url = Column(String(1000))
date = Column(DateTime)
def __init__(self, id=None, title=None, url=None, date=None):
self.id = id
self.title = title
self.url = url
self.date = date
def __repr__(self):
return "<AllData: id='%d', title='%s', url='%s', date='%s'>" % (self.id, self.title, self.url, self.date)
I call the SQL Alchemy object in the items pipeline and create a record and insert the new item into the database with the corresponding values from the item.
from database.connection import db
from database.models import AllData
class AddTablePipeline(object):
def process_item(self, item, spider):
if item['title'] and item['url']:
if 'date' not in item or not item['date']:
date = None
else:
date = item['date'][0]
# create a new SQL Alchemy object and add to the db session
record = AllData(title=item['title'][0].decode('unicode_escape'),
url=item['url'][0],
date=date)
db.add(record)
db.commit()
return item
And finally run using the command:
python core.py
I am writing a book!
While I do appreciate you reading my blog posts, I would like to draw your attention to another project of mine. I have slowly begun to write a book on how to build web scrapers with python. I go over topics on how to start with scrapy and end with building large scale automated scraping systems.
If you are looking to build web scrapers at scale or just receiving more anecdotes on python then please signup to the email list below.