How to store historical data in the database and remove duplicates?

V

vgray2021-05-02 09:27:28

MySQL

vgray, 2021-05-02 09:27:28

There are 500 million sensors (the project is under NDA, so I change the names of the task elements a little, but the essence remains the same), each sensor has such a parameter as "current readings". Indications change rarely, somewhere once every 1-2 months.

Once every 3-4 days, data on readings comes in the form of a huge csv file that looks like this

sensor_id, timestamp, value

You need to store a history of changing readings.

If stored in the forehead, then there will be many duplicates that differ only in date. I want to store only those lines that differ from the previous one .

Those if such data came

sensor_N1,2021-02-01,100
sensor_N1,2021-03-11,100

then I would like to leave only the lines
sensor_N1,2021-02-01,100
sensor_N1,2021-02-21,115
sensor_N1,2021-03-11,100

Which way to look to implement this? .

So far, such options have been thought up

1) If you select the previous line when inserting, then this is terribly slow (500 million sensors)

2) Before importing data, export the last record for all sensors, use scripts to select the data that has changed and fill in only the changed data

3) Something else?

We look at any base that will solve this problem.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

R

Ruslan., 2021-05-02
@LaRN

It is possible to store two dates for each sensor in the operational table: the date of the beginning of the interval of constancy and the date of the end of this interval. This is the interval in which the value of the sensor does not change.
That is, we load the current data and if the value has not changed for the sensor, then we simply change the end date of the interval to the date of the current load, if the sensor value has changed (different from that stored in the operational table), then the current interval is uploaded to the archive table, and in the operational add a new (change the existing) record for the sensor, whose start date and end date will be equal to the date of the current load, and the value of the current load value of the sensor.
That is, in the operational table, the number of records is always equal to the number of sensors, and in the historical table, the entire range of previous values.
This should protect against the fact that over time the speed of working with the operational table will degrade, from the fact that the number of records will grow there.
If you need to build some kind of report or upload for a period or for past dates, then you will already need to work with a historical table and it will not be very fast, but such operations usually do not need to be performed often.
Since the data is loaded every 3-4 days, it is probably not very critical that the download will not be performed instantly. Here, it is probably easier to do it in several stages: on the first pass, build a list of identifiers of sensors for which the value has not changed, then update the end date of the interval to the date of the current load using this list, then pour rows into the archive table for sensors that are not included in the list and finally change for the changed sensors in the operational table, the value of the counter and the date of the beginning and end of the interval.
All steps can be done in bulk.

A

Alexander, 2021-05-06
@avorsa

If stored in the forehead, then there will be many duplicates that differ only in date. I want to store only those lines that differ from the previous one

It seems to me that this is a bad approach.
If data from some sensor is always the same for a month, then this confirms that the communication channel with it is working.
IMHO, all data must be stored.
In your case, use the Time series database - InfluxDB, TimescaleDB and others that are sharpened to work with time series.

K

ky0, 2021-05-02
@ky0

And what is the difficulty? A huge CSV file comes to you, take it and clean out the lines with the same readings from it, leaving only the first one, after which insert the resulting one into the database.
Throwing a bunch of unnecessary information into the database, and only then cleaning it out is counterproductive. The simplest script that runs through all the lines will cope with this.

C

ComodoHacker, 2021-05-02
@ComodoHacker

1) If you select the previous line when inserting, then this is terribly slow (500 million sensors)

Have you tried? If you do it skillfully, then it's fine.
Fill in a portion of the data, say, in 100 thousand lines in the intermediate table. Doing INSERT ... SELECT ... JOIN. Well and to configure indexes correctly.

3) Anything else?

For example. Instead of INSERT do UPDATE. On UPDATE you hang a trigger that makes a copy of the current value if it is different.

M

MaLuTkA_UA, 2021-05-03
@MaLuTkA_UA

I will offer my own version of the handler through a temporary table, how fast it has not been tested, but it can be optimized:

WITH t AS (SELECT *, row_number() over (PARTITION BY id ORDER BY created_at) as row FROM temporary_data),
   d AS (SELECT DISTINCT ON (id) * FROM current_data ORDER BY id, created_at DESC)
SELECT t.*
FROM t
  LEFT JOIN t t2 ON t.id = t2.id AND t.row = t2.row+1
  LEFT JOIN d ON t.row = 1 AND t.id = d.id 
WHERE t.value <> coalesce(t2.value, d.value);

Work example