S
S
serjio5682018-12-04 21:59:01
Django
serjio568, 2018-12-04 21:59:01

How to organize the parsing of IM products in python?

I understand that there is a lot of information on the Internet on this topic, but I wanted to answer a few questions, but I did not find the answer.
In general, you need to make a parser that will collect data from several resources (products, prices, delivery times, etc.) and display it in some kind of django view, for example. Products are updated quite often, there may be the same product from one supplier, but in different warehouses. The user, in principle, can wait 10 seconds. Price analytics is not needed, purely information from whom is cheaper. Based on all this, does it make sense to write to the database? And how to organize such a task?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
R
Roman Kitaev, 2018-12-04
@deliro

Parse on demand? Nonsense. Well, unless you use Celery, send the user to a page where he will wait for the result. But the results of parsing in any case to save in the database.

S
sim3x, 2018-12-04
@sim3x

Best kept anyway
Depending on the complexity of the parsing - [ celery ] + ( requests + lxml / celery + scrappy )

S
stratosmi, 2018-12-04
@stratosmi

Based on all this, does it make sense to write to the database?

Under heavy load - yes, for caching.
But it is doubtful that the supplier updated his site so quickly. So that it makes sense to parse as soon as the user wants to know information about the product.
Usually, accounting is not kept on the site, but in a separate database, in some kind of 1C. It is synchronized with the site, for example, once an hour.
And sometimes users just refresh the page. Why immediately make requests to 10 supplier sites.
And users often go between several products, choosing one of them. Why, with multiple visits to the product page every 5 minutes, made - re-apply to 10 supplier sites.
So yes, it makes sense to cache in your database.
Upon receipt of a request - see if the information about the product is rotten. If it is not rotten, then we give it from the base immediately.
If the product from the supplier is convenient for parsing in one pass (for example, you can download one file of the current price list, you don’t need to go through the entire site on all pages) - it makes sense not to wait for a visitor to your site, but to parse in advance and regularly, say once an hour . The result, of course, is stored in your database.
Scrapy has already been written about.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question