How to read files in python with multiple threads? How to speed up the following code?

L

LakeForest2021-12-15 01:37:35

Python

LakeForest, 2021-12-15 01:37:35

In the text column, instead of text, there are paths to files with text, I want to rewrite all the texts in csv, but there are a lot of files. The code is processed longer than a day and still only 70%. Writes that about another 12 hours to wait. I really want to speed it up, but I have no idea what can be done with it ...

c = 0
for i, row in tqdm(df_new.iterrows(), total=df_new.shape[0]):
    if "texts" not in row.text:
        continue
    c += 1
    with open("../../" + row.text, "r") as f:
        text = f.readline()
        df_new.loc[i, "text"] = text.replace("\n", "")
        if c % 10000 == 0:
            df_new.to_csv("df_all.csv", index=False)
df_new.to_csv("df_all.csv", index=False)

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vindicar, 2021-12-15
@Vindicar

Several threads are unlikely to help here due to the nature of python.
Multiple processes - possible (multiprocessing module).
But in general, for starters, you should make sure that the plug is based on the CPU, and not on the performance of the disk.