How to quickly iterate through all rows in Python Pandas?

A

Alex Shilov @Alex2020-09-23 14:35:40

Python

Alex Shilov @Alex, 2020-09-23 14:35:40

Good day, friends. I have over 3 million rows in a DataFrame. I need to take the first customer, look at all their transactions for each month, put all those transactions into another DataFrame, adding some data based on the transactions. And then go through all the clients like this. 3 million lines are all operations, all clients.

How did I implement it

Я формирую ДатаФрэйм по одному клиента и через itertuples пробегаю по всем его операциям. Затем удаляю все, что связано с этим клиентом из ДатаФрэйма, беру другого и все по накатанной. Но этот алгоритм очень медленный. Может подскажите что можно сделать, чтобы обработать таким образом данные значительно быстрее.

Thank you in advance!

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

Z

zexer, 2020-09-23
@zexer

Apparently here you are creating a single row dataframe, but you can create a regular series (pd.Series)

datareport = pd.DataFrame({'ODATE': {0:0},
                           'TENOR': {0:0},
                           'VDATE': {0:0},
                           'R': {0:0},
                           'RC': {0:0},
                           'DE': {0:0},
                           'I': {0:0},
                           'DEB': {0:0},
                           'D': {0:0}}

Next, you attach this one-line dataframe via pd.concat to another one-line dataframe:

datareport = pd.concat([datareport, pd.DataFrame({'ODATE': {0:row.odate},
                                                          'TENOR': {0:tenor(row.ltt)},
                                                          'VDATE': {0:row.vdate},
                                                          'R': {0:bucket_class_old},
                                                          'RC': {0:buckclass(row.odues)},
                                                          'DE': {0:row.ball},
                                                          'I': {0:0},
                                                          'DEB': {0:ball_old},
                                                          'D': {0:0}})])

and after that again to one single-line dataframe

datareport = pd.concat([datareport, pd.DataFrame({'ODATE': {0:row.odate},
                                                          'TENOR': {0:tenor(row.ltt)},
                                                          'VDATE': {0:row.vdate},
                                                          'R': {0:bucket_class_old},
                                                          'RC': {0:100},
                                                          'DE': {0:difference},
                                                          'I': {0:0},
                                                          'DEB': {0:ball_old},
                                                          'D': {0:0}})])

You can also use pd.Series instead.
This code completes in an average of 2.5 milliseconds

ser1 = pd.Series({'ODATE': 0,
           'TENOR': 0,
           'VDATE':0,
           'R': 0,
           'RC': 0,
           'DE': 0,
           'I': 0,
           'DEB': 0,
           'D': 0})
ser2 = pd.Series({'ODATE': 0,
           'TENOR': 1,
           'VDATE':2,
           'R': 3,
           'RC': 4,
           'DE': 5,
           'I': 6,
           'DEB': 7,
           'D': 8})
df = pd.concat([ser1, ser2], axis=1).T

While this code completes in an average of 5.2 milliseconds, the result is similar.

datareport1 = pd.DataFrame({'ODATE': {0:0},
                           'TENOR': {0:0},
                           'VDATE': {0:0},
                           'R': {0:0},
                           'RC': {0:0},
                           'DE': {0:0},
                           'I': {0:0},
                           'DEB': {0:0},
                           'D': {0:0}})
datareport2 = pd.DataFrame({'ODATE': {0:1},
                           'TENOR': {0:2},
                           'VDATE': {0:3},
                           'R': {0:4},
                           'RC': {0:5},
                           'DE': {0:6},
                           'I': {0:7},
                           'DEB': {0:8},
                           'D': {0:9}})
df2 = pd.concat([datareport1, datareport2])

So you can try to win in speed by moving from single-line dataframes to series.

V

Viktor T2, 2020-09-23
@Viktor_T2

You can try a regular database,
but push the data into RAM.
Index by client.
Google "sqlite in memory"

A

AlexShilov, 2020-09-23
@AlexShilov

The only optimization that came to my mind at the moment is to remove the concatenation operation and instead create an array of type:

array_data = []

'''
    Тут находиться код программы
'''

array_data.append([row['odate'], tenor(row(ltt)), row['vdate'], bucket_class_old, buckclass(row['odues']), row['ball'], 0, ball_old, 0])
array_data.append([row['odate'], tenor(row(ltt)), row['vdate'], bucket_class_old, 100, difference, 0, ball_old, 0])

'''
    Тут остальной код программы
'''

pandas.DataFrame(array_data)