SQL Alchemy pipeline to add item to DB

This post refers to using scrapy version 0.24.4, if you are using a different version of scrapy then refer scrapy docs for more info.
Feb 4, 2015 • 3 minutes to read • Last Updated: Oct 25, 2017

In my last post, I talked about how to run 2 spiders concurrently. This post is a brief introduction to how to add scrapy items into database through the pipeline.

The scrapy framework is magnificient when it comes to data processing. There are tons of features that it uses and lets developers configure. Since we are using the core API to run our scrapers right now, we are able to set the pipeline using

settings.set("ITEM_PIPELINES", {'pipelines.AddTablePipeline': 100})

The ITEM_PIPELINES is a python dictionary with the key as the location of the pipeline object and the value as the order in which the key should run. We can assign numbers between 100-1000 to run the pipelines. In this project I am basically creating a database called scrapyspiders for which I set the settings here. I create the connections and create the SQL Alchemy ORM model with the fields id, title, url and date.

from sqlalchemy import Column, String, Integer, DateTime
from database.connection import Base

class AllData(Base):
    __tablename__ = 'alldata'

    id = Column(Integer, primary_key=True)
    title = Column(String(1000))
    url = Column(String(1000))
    date = Column(DateTime)

    def __init__(self, id=None, title=None, url=None, date=None):
        self.id = id
        self.title = title
        self.url = url
        self.date = date

    def __repr__(self):
        return "<AllData: id='%d', title='%s', url='%s', date='%s'>" % (self.id, self.title, self.url, self.date)

I call the SQL Alchemy object in the items pipeline and create a record and insert the new item into the database with the corresponding values from the item.

from database.connection import db
from database.models import AllData

class AddTablePipeline(object):

    def process_item(self, item, spider):
        if item['title'] and item['url']:
            if 'date' not in item or not item['date']:
                date = None
            else:
                date = item['date'][0]

            # create a new SQL Alchemy object and add to the db session
            record = AllData(title=item['title'][0].decode('unicode_escape'),
                             url=item['url'][0],
                             date=date)
            db.add(record)
            db.commit()
            return item

And finally run using the command:

python core.py

Download Project or View on github