Which architecture to choose for event handling when situations arise that an unhandled event no longer needs to be processed?

D

Dmitry Labutin2015-11-03 13:28:14

Database

Dmitry Labutin, 2015-11-03 13:28:14

Let's think together about solving this problem.
There is an incoming flow of price lists that need to be updated in the database.
Price goes through two stages of processing.
1. Preprocessing (normalization)
2. Uploading
thousands of price lists to the Suppliers database.
On average, one price list is updated once a day.
But there are situations when one supplier can quickly send two or three price lists and the actual one is the last one. It will not be possible to avoid preprocessing of each of the price lists.
But I would like to load not everything into the database.
For example, imagine such a chronology. These are the prices of one specific supplier.
November 1 8:00 new price
November 2 8:00 new price
November 2 8:10 new price
November 2 8:20 new price
One price is several million lines. Processing and loading into the database is a waste of computing resources and time. Usually, after preprocessing and directly loading into the database, 30-60 minutes pass.
KEY question: how not to load the price lists into the database, which started on November 2 at 8:00 and 8:10, and upload only the one that came at 8:20? Those. there will obviously be a moment when all three prices (8:00 8:10 and 8:20) have already been pre-processed and are waiting (apparently in some kind of queue) when they go to the second stage of loading. And here I wanted not to do idle work, loading prices at 8:00 and 8:10, but immediately download the most relevant one, received at 8:20
At first glance, everything is simple here.
But here are some additional inputs:
- there are a lot of preprocessors and they work in parallel and do not know anything about each other, especially since they do not know about loaders to the database
- there are many loaders to the database and they work in parallel and do not know anything about each other
- IMPORTANT: to have a minimum of locks and no deadlocks
I won't describe our solution now.
I would like to hear from you how you would architecturally solve this problem.
Would you do everything only in the database? Take some queuing system like RabbitMQ as an assistant? Will you do everything in it or will you also need a base? Maybe not a base or a rabbit at all?
If so, I'm not trolling. This is a real task and I want to understand how correctly we solved it.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

V

Vladimir, 2015-11-03
@rostel

RabbitMQ allows you to set the lifetime of each individual job.
This is suitable for a more or less strict frequency of filling tasks.
In your case, before uploading the next price list, clear all queues.
Or shove it for processing through an intermediate queue in the form of a regular database.
In Rabbit, send from it in batches after the processing of the previous batch is completed.
Clear the queue in the database if you need to upload the next price list.

S

Sergey, 2015-11-03
@begemot_sun

You can make a plugin for RabbitMQ. When a job is added to the queue, it will go through the queue and remove jobs from the same provider. Of course, if the supplier is already being processed, then you need to complicate the solution.

A

Alexander Melekhovets, 2015-11-04
@Blast

You can save the price version in the database (timestamp of its appearance, for example) and forward it with the task. When the task comes to the stage of loading into the database, check the version against the current one, and if less, discard it. It is possible that a new price list arrives after the download has started, then to interrupt the download of an outdated price list, you can periodically go to the database during the download process and check the version. Queues can be driven through rabbitmq if it is already in the project, or filed in the database if it is not there and you don’t want to produce entities.