Answer the question
In order to leave comments, you need to log in
How to count only new data in pandas and not the whole table?
Let's say we have a table:
import pandas as pd
data = {
'one': list(range(2, 15, 4)),
'two': list(range(4, 16, 3))
}
df = pd.DataFrame(data)
df
- one two
0 2 4
1 6 7
2 10 10
3 14 13
def compute(df):
df['rol1'] = df.rolling(3, min_periods=1).one.mean()
df['rol2'] = df.rolling(3, min_periods=1).two.quantile(0.5)
df['rol3'] = df.rolling(2, min_periods=1).rol2.min()
return df
df = compute(df)
df
- one two rol1 rol2 rol3
0 2 4 2.0 4.0 4.0
1 6 7 4.0 5.5 4.0
2 10 10 6.0 7.0 5.5
3 14 13 10.0 10.0 7.0
newData = {'one': 13, 'two': 6}
df = df.append(newData, ignore_index=True)
df
- one two rol1 rol2 rol3
0 2.0 4.0 2.0 4.0 4.0
1 6.0 7.0 4.0 5.5 4.0
2 10.0 10.0 6.0 7.0 5.5
3 14.0 13.0 10.0 10.0 7.0
4 13.0 6.0 NaN NaN NaN
df = compute(df)
it again, it will recalculate the entire table. And with big data, this is quite a lot of time, and I would like to work with data in real time. Answer the question
In order to leave comments, you need to log in
Pandas doesn't know how to do that.
I would add a column in which I would write the flag "1/0" and set the flag to 1 for the "old" data, and the function would return 0 for the new data.
And then you just need to apply a filter on the last column
after recalculation, change the flag to 1
UPD: the option "to store a copy and compare the modified table with a copy" is not considered due to your "heavy" calculations (sometimes, it really does not fit in memory)
Keep two dataframes:
One the size of the window and when adding a line to the tail, remove one from the head.
The second is cumulative, duda add the counted. This way you will save the same code for bulk processing of accumulated data and for windowed realtime.
But it is also not very efficient, especially with a large window size.
The second option is to implement your own window function, roughly speaking it will be an object with a state where data is pushed line by line and, accordingly, data is retrieved. The object inside stores the window buffer, but it calculates everything itself, without a panda, and clicks out unnecessary lines from the buffer.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question