What infrastructure should be for a 24/7 parser updating the database?

G

guruloz2021-02-17 15:49:57

Parsing

guruloz, 2021-02-17 15:49:57

Hello masters.
There is a parser that constantly hangs and makes a lot of curl requests, working closely with the database. Checking for the existence of an entity in the database, adding data, updating, etc.
The problem is that due to the large number of requests, requests to the API are very slow while the parser is running. How to build an architecture so that everyone is comfortable: both the speed of the API and the regularity of data updates?
Will a master-slave link solve the problem? And where can I read about such an infrastructure, for large parser sites, crawlers, in a large amount of data.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

V

Vitaly Karasik, 2021-02-17
@vitaly_il1

The universal answer is "depends".
You need to make a PoC, look at requests, optimize them. Then evaluate the traffic in production and do a load test. Following the footsteps is to optimize and scale.
In terms of infrastructure - if there are a lot of reads, then slave[s] will help a lot.
From an architectural point of view, pushing tasks into and out of the queue helps to smooth out peaks.

G

Gena, 2021-02-17
@brary

Look towards https://amphp.org/ , there is also an http client instead of CURL. I made a daemon on it that works 24/7, if new requests appear, it processes them asynchronously. I also implemented the ability to limit the maximum number of simultaneously outgoing connections both for the entire server and for individual sources.
I have 1 daemon with a communication channel of 100 Mbit / s per hour parsing 4-8 million pages, xs is it a lot or a little ....

R

Romses Panagiotis, 2021-02-17
@romesses

You can build an application architecture so that the API will work primarily in read mode from the DBMS.
And another worker process will receive tasks through the message queue and intensively write to the DBMS.
In the API, instead of blocking the response, the parsing client should immediately send a task to the message queue. Then the connections will not be held for a long time, but will be closed almost immediately after being sent to the queue.
Anything sent to the API to be added to the queue can return a 202 (Accepted) response.
As soon as the worker completes the task, it will update the result of the parsing in the database. In the meantime, when accessing the API, information will be read from the database without any blocking operations.
That is, a small upgrade consists in the scheme:
APIs (write) -> MQ -> Worker(s) -> DB
APIs (read) <-> DB It's
so easy to add any component in case of heavy load.
Well, it is necessary to measure the load in order to know where the narrow neck is.