Where/file to process parsed data in Scrapy?

W

weranda2019-01-30 11:27:21

Python

weranda, 2019-01-30 11:27:21

Greetings
As I understand it, there are several main places in Scrapy where you can process the downloaded data:
- in the spider itself;
- in the items file;
- in the file pipelines.
Since yesterday, I have been thinking about where to parse the received data correctly and so far I have come to only one conclusion: in the spider file, only the data that is needed for its operation and small acceptable modifications of the received data should be processed. As a result, two places remain - spider middleware/items and pipelines.
Logically, you can write the same method/function that will work the same way through the spider middleware/items processor and through pipilene. But as I understand it, if you need several handlers for the data received by the spider, then it will be slightly difficult through the items handlers / processors, since you can specify only one handler / processor for each field. And through pipilenes, you can specify several handlers.
But here one concern arises - the increased load of pipelines in comparison with the handler / processor of items. When we call a handler in pipelines, we call the entire array of stored data in one field. And when we process data through a field handler / processor, then we process only one field without additional subsequent calls to all fields of the field ...
In general, if you know Scrapy, please explain in which place what data needs to be processed.
PS
I don't know if you need to specify examples of parsed data, so I'll offer a simple example of several fields:
- page title
- image url
- a large piece of HTML code that needs to be processed later (cleared from unnecessary)
- several tags in a list

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dimonchik, 2019-01-30
@dimonchik2013

in pipelines,
in general, Scrapy goes straight to NoSQL and nowhere else (well, except for exotic things like looking for 404s)
, then from NoSQL (usually Monga) you structure where you need it - in csv, Postgre, etc.