How to track changes in the DBMS during a concurrent write?

B

belesev2018-04-05 17:09:59

PostgreSQL

belesev, 2018-04-05 17:09:59

I need to implement data synchronization in a distributed system, given that there is a concurrent write to data tables. Export from the main database should only read changed (and inserted) rows.
The usual advice in this case: "use triggers", mark rows in the trigger with a timestamp of data changes or a monotonically increasing numerical sequence (revision). And tell the remote system this timestamp/revision.
But the problem with concurrent writes is when the commit happens. An example is described below:

[время 00:00] Транзакция A начинает запись большой пачки данных; помечает строки timestamp [00:00].
[время 00:02]      Транзакция B начинает запись маленькой пачки данных; помещает строки timestamp [00:02].
[время 00:03]      Транзакция B заканчивает запись и происходит коммит.
[время 00:10]            Начинается экспорт в удалённую систему. Уровень изоляции - ReadCommitted, 
                         поэтому только данные из транзакции B попадают в выборку для экспорта, timestamp 00:02.
[время 00:15] Транзакция A заканчивает запись и происходит коммит. 
              Удалённая система никогда не получит эти данные(!)

I want the lines from Transaction A to be marked in the database as later than the lines in Transaction B.
After all, they were committed later: the proof of this is that at the time of export, in fact, they were not in the database yet (just due to isolation).
I like the Change tracking approach in MSSQL, it can solve this situation.
But I have in mind the imminent transition to PostgreSQL, and I need to come up with a general solution.
How to solve such a problem correctly?
The original of my question on StackOverflow is here .

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

M

Melkij, 2018-04-05
@melkij

Take pgq and write to the queue before the commit how to find the data you need. Or send messages via pgq as a trigger immediately when writing to tables.
When exporting, respectively, read messages from pgq. Everything.
Or write or take some ready-made logical decoding. You will be able to read in general a flow of replication
In the general case in general queue is necessary all the same. From the transaction to the queue tables, write the labels of what data has changed, the reader reads this label and receives all the changes, regardless of the commit time.

A

Alexey Sundukov, 2018-04-05
@alekciy

Look towards NOTIFY/LISTEN: https://postgrespro.ru/docs/postgrespro/10/sql-notify

Y

younghacker, 2018-04-06
@younghacker

Offhand, I would make a small table "replication queue" and trigger data from other tables into it.
ID - records primary key
TimeStamp - time of making changes (maybe redundant, it is not clear whether records already changed will be changed)
TableID - identifier of the change source table
RecordID - record identifier from the source table
md5sum - checksum of the modified record (all fields) optional.
On this plate, triggers generate events.
A change has been made - generate the NEW_WORK_FOR_REPLICATOR_ event! ;)
The event launches a handler (replicator) that removes all data from this table. And replicates all records in a row from the tables that are registered here. Upon successful export, the entry in the table is deleted. If the rate of changes in the database is small, then the table will be small. Most of the time it will be empty.
It turns out that the transaction upon successful completion will create a record in this auxiliary table.
The entry will generate an event. The replicator will run to the event, export the source record, and if everything is OK, it will add the record in this table. The number of records in it will return to normal. For obsolete records in the replicator, you can make a garbage collector that uses a checksum that will allow you to check whether this record exists in the target table during export if at some point before the record was deleted, the source database suddenly stopped or the replicator hung or was killed (in life there are always place oomkill). In this case, the garbage collector will make an entry in the logbook for reporting. :)
You can also do without events. An empty table is selected quickly so you can monitor frequently. And if there are records in the table, then there is work for the replicator. Then you don't have to write logic to ignore events while the replicator is running.
Transactions - ReadCommited .