How to write SQL similar to pandas resample function?

K

Kirill Petrov2020-08-02 08:39:55

SQL

Kirill Petrov, 2020-08-02 08:39:55

Greetings! Let's say we have a sign:

-			v
t	
2020-08-02 08:01:21	3
2020-08-02 08:01:26	4
2020-08-02 08:01:32	1
2020-08-02 08:02:02	6
2020-08-02 08:04:09	2

Code for generating such a table in Python

data = {
    't': ['2020-08-02 08:01:21', '2020-08-02 08:01:26', '2020-08-02 08:01:32', '2020-08-02 08:02:02', '2020-08-02 08:04:09'],
    'v': [3,4,1,6,2]
}
df = pd.DataFrame(data)
df.t = pd.to_datetime(df.t)
df.set_index('t', inplace=True)
df

Now this data can be modified like this:

More Python code

dfResampled = df.resample('1t')
pd.DataFrame({
    'min': dfResampled.v.min(),
    'max': dfResampled.v.max(),
    'last': dfResampled.v.last(),
    'first': dfResampled.v.first()
})

And get the result:

-			min	max	last	first
t				
2020-08-02 08:01:00	1.0	4.0	1.0	3.0
2020-08-02 08:02:00	6.0	6.0	6.0	6.0
2020-08-02 08:03:00	NaN	NaN	NaN	NaN
2020-08-02 08:04:00	2.0	2.0	2.0	2.0

I have a lot of values, and if I process it on the backend side in this way, then a lot of traffic is pulled from the database. Which causes major delays.

/* Затронуто строк: 0  Найденные строки: 146 364  Предупреждения: 0  Длительность  1 запрос: 0,016 сек. (+ 2,250 сек. сеть) */

I tried to write the following query:

SELECT DATE_FORMAT(`date`, '%Y-%m-%d %H:%i:00') AS `t`, MIN(`v`) AS `min`, MAX(`v`) AS `max` FROM table GROUP BY(DATE_FORMAT(`date`, '%Y-%m-%d %H:%i:00'))

But here I don’t have a date line 2020-08-02 08:03:00, since there was no data at that time, and how to get the first and last value. And yet resampling may not be necessary in 1 minute, but let's say in 3 minutes, 5 minutes, etc.
In general, I ask for help on how to write a SQL query correctly. I use MariaDB, but I can switch to any other open source database.
Thanks in advance!

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

dass45, 2020-08-11
@Recosh

You need to start the query with the minutes to which to join your table, then as a result the minutes with missing data will be displayed. Often this is stored in a separate table or a procedure with loops is done.
In the same place, when sampling minutes, you can add a condition for the resampling coefficient (select every minute, every 3rd, every 5th, etc., through the remainder of the division, for example).
The first and last value through an additional join of the same table (in the joined table we select the minimum / maximum time grouped by minute, for this time the value, we join with the main table by time)
Shl edited the answer, because. missed one question in the text