How to count only new data in pandas and not the whole table?

K

Kirill Petrov2020-05-29 14:19:44

Python

Kirill Petrov, 2020-05-29 14:19:44

Let's say we have a table:

import pandas as pd
data = {
    'one': list(range(2, 15, 4)),
    'two': list(range(4, 16, 3))
}
df = pd.DataFrame(data)
df

There is a data processing method:

def compute(df):
  df['rol1'] = df.rolling(3, min_periods=1).one.mean()
  df['rol2'] = df.rolling(3, min_periods=1).two.quantile(0.5)
  df['rol3'] = df.rolling(2, min_periods=1).rol2.min()
  return df

df = compute(df)
df

As a result, we get:

-	one	two	rol1	rol2	rol3
0	2	4	2.0	4.0	4.0
1	6	7	4.0	5.5	4.0
2	10	10	6.0	7.0	5.5
3	14	13	10.0	10.0	7.0

Great, and now after that a new line is added, let's say this:

newData = {'one': 13, 'two': 6}
df = df.append(newData, ignore_index=True)
df

As a result, a line is added, and in the remaining fields NaN

-	one	two	rol1	rol2	rol3
0	2.0	4.0	2.0	4.0	4.0
1	6.0	7.0	4.0	5.5	4.0
2	10.0	10.0	6.0	7.0	5.5
3	14.0	13.0	10.0	10.0	7.0
4	13.0	6.0	NaN	NaN	NaN

Now how do you tell pandas to only count the data for the last row? Because if I call df = compute(df)it again, it will recalculate the entire table. And with big data, this is quite a lot of time, and I would like to work with data in real time.
There is an option to create a copy function and use tail instead of rolling, but I don’t want to copy-paste the same logic. The final algorithm in my program turns out to be complicated and I don’t want to duplicate it either.
Thanks in advance for your reply!

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

Sergey Ilyin, 2020-05-29
@sunsexsurf

Pandas doesn't know how to do that.
I would add a column in which I would write the flag "1/0" and set the flag to 1 for the "old" data, and the function would return 0 for the new data.
And then you just need to apply a filter on the last column
after recalculation, change the flag to 1
UPD: the option "to store a copy and compare the modified table with a copy" is not considered due to your "heavy" calculations (sometimes, it really does not fit in memory)

S

Sergey Pankov, 2020-05-29
@trapwalker

Keep two dataframes:
One the size of the window and when adding a line to the tail, remove one from the head.
The second is cumulative, duda add the counted. This way you will save the same code for bulk processing of accumulated data and for windowed realtime.
But it is also not very efficient, especially with a large window size.
The second option is to implement your own window function, roughly speaking it will be an object with a state where data is pushed line by line and, accordingly, data is retrieved. The object inside stores the window buffer, but it calculates everything itself, without a panda, and clicks out unnecessary lines from the buffer.