Answer the question
In order to leave comments, you need to log in
How to distribute arrays of data between threads?
Hello! The task seems simple, but I do not understand how to create it.
There are two arrays of 50k elements. It is necessary to create a third array in which there will be non-repeating elements from those two arrays. In general, you need to remove duplicates. The script is written, but it works in one thread. And it's very long. How to distribute all this data between threads?
Here is my code:
import pandas as pd
import json
from time import sleep
from threading import Thread
def get_data(file_name):
df = pd.read_excel(file_name,sheet_names=0)
data = []
for item in df.to_records(index=False):
data.append(item[0])
return data
if __name__ == '__main__':
test1 = get_data('t1.xlsx')
test2 = get_data('t3.xlsx')
result = []
# with open('result.json','w') as file_json:
for i,ii in enumerate(test1):
for j,jj in enumerate(test2):
print(i,j)
if i == j:
continue
if ii.strip().lower() not in jj.strip().lower():
if ii.strip().lower() not in result:
result.append(ii.strip().lower())
df = pd.DataFrame(result)
df.to_excel('r.xlsx',index=False,header=None)
# file_json.write(json.dumps(ii)+'\n')
Answer the question
In order to leave comments, you need to log in
Not quite to your requirements, but have you tried concatenating files and removing duplicates with pandas? It will most likely be faster. Those. something like this:
# df1 - датафрейм t1.xlsx, df2 - датафрейм t3.xlsx
df = pd.concat([df1, df2])
# перевести все нужные столбцы в нижний регистр и обрезать пробелы
# можно сделать новые столбцы с измененными данными, если исходные данные важны
df['column_name'] = df['column_name'].apply(lambda x : x.lower().strip())
# удаление дубликатов по всем столбцам
df.drop_duplicates(keep=False, inplace=True)
# удаление дубликатов по какому-то определенному столбцу
#df.drop_duplicates(subset=['letter'], inplace=True)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question