Answer the question
In order to leave comments, you need to log in
How to store historical data in the database and remove duplicates?
There are 500 million sensors (the project is under NDA, so I change the names of the task elements a little, but the essence remains the same), each sensor has such a parameter as "current readings". Indications change rarely, somewhere once every 1-2 months.
Once every 3-4 days, data on readings comes in the form of a huge csv file that looks like this
sensor_id, timestamp, value
You need to store a history of changing readings.
If stored in the forehead, then there will be many duplicates that differ only in date. I want to store only those lines that differ from the previous
one
.
Those if such
data
came
sensor_N1,2021-02-01,100
sensor_N1,2021-03-11,100
then I would like to leave only the lines
sensor_N1,2021-02-01,100
sensor_N1,2021-02-21,115
sensor_N1,2021-03-11,100
Which way to look to implement this? .
So far, such options have been thought up
1) If you select the previous line when inserting, then this is terribly slow (500 million sensors)
2) Before importing data, export the last record for all sensors, use scripts to select the data that has changed and fill in only the changed data
3) Something else?
We look at any base that will solve this problem.
Answer the question
In order to leave comments, you need to log in
It is possible to store two dates for each sensor in the operational table: the date of the beginning of the interval of constancy and the date of the end of this interval. This is the interval in which the value of the sensor does not change.
That is, we load the current data and if the value has not changed for the sensor, then we simply change the end date of the interval to the date of the current load, if the sensor value has changed (different from that stored in the operational table), then the current interval is uploaded to the archive table, and in the operational add a new (change the existing) record for the sensor, whose start date and end date will be equal to the date of the current load, and the value of the current load value of the sensor.
That is, in the operational table, the number of records is always equal to the number of sensors, and in the historical table, the entire range of previous values.
This should protect against the fact that over time the speed of working with the operational table will degrade, from the fact that the number of records will grow there.
If you need to build some kind of report or upload for a period or for past dates, then you will already need to work with a historical table and it will not be very fast, but such operations usually do not need to be performed often.
Since the data is loaded every 3-4 days, it is probably not very critical that the download will not be performed instantly. Here, it is probably easier to do it in several stages: on the first pass, build a list of identifiers of sensors for which the value has not changed, then update the end date of the interval to the date of the current load using this list, then pour rows into the archive table for sensors that are not included in the list and finally change for the changed sensors in the operational table, the value of the counter and the date of the beginning and end of the interval.
All steps can be done in bulk.
If stored in the forehead, then there will be many duplicates that differ only in date. I want to store only those lines that differ from the previous one
And what is the difficulty? A huge CSV file comes to you, take it and clean out the lines with the same readings from it, leaving only the first one, after which insert the resulting one into the database.
Throwing a bunch of unnecessary information into the database, and only then cleaning it out is counterproductive. The simplest script that runs through all the lines will cope with this.
1) If you select the previous line when inserting, then this is terribly slow (500 million sensors)
3) Anything else?
I will offer my own version of the handler through a temporary table, how fast it has not been tested, but it can be optimized:
WITH t AS (SELECT *, row_number() over (PARTITION BY id ORDER BY created_at) as row FROM temporary_data),
d AS (SELECT DISTINCT ON (id) * FROM current_data ORDER BY id, created_at DESC)
SELECT t.*
FROM t
LEFT JOIN t t2 ON t.id = t2.id AND t.row = t2.row+1
LEFT JOIN d ON t.row = 1 AND t.id = d.id
WHERE t.value <> coalesce(t2.value, d.value);
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question