24/7 scraper platform and architecture?

A

Andrey_Dolg2020-03-30 22:49:09

Python

Andrey_Dolg, 2020-03-30 22:49:09

Which platform to choose for the parser.
There is a spider that works through multiprocessing and waits for a response from the server in 20-40 processes. Due to the peculiarity of the information needed, the question is no longer in the parsing itself. Since the average required speed is now 9 requests per second (may become more), someone else's experience of choosing a cloud solution for a similar task is of interest. And yes, the idle time of 10-15 seconds accumulates an additional volume of requests that need to be processed. Storing information in runtime is enough, stability is of interest.

Well, in general, one can criticize the approach of deciding how successful multiprocessing is for waiting for a response from the server and where such an architecture can have a bottleneck.

Update...
As far as I understand, serverless cloud is not suitable for this task due to payment by processor time, and it is constantly busy. My theory is that a VPS is similar to a mail server that is a priori designed for a large number of small requests.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dr. Bacon, 2020-03-30
@Andrey_Dolg

1. Multiprocessing for IO Bound is unsuccessful, threads are suitable here, but async is better
2. Separate the process of receiving data from the network and processing this data.
3. But for data processing, just multiprocessing can help
4. And yes, use queues, it seems that according to the description you don’t have them.
As a result, at least there are jobs that receive data and jobs that process this data.