How to distribute arrays of data between threads?

L

lemonlimelike2020-08-02 19:37:52

Python

lemonlimelike, 2020-08-02 19:37:52

Hello! The task seems simple, but I do not understand how to create it.
There are two arrays of 50k elements. It is necessary to create a third array in which there will be non-repeating elements from those two arrays. In general, you need to remove duplicates. The script is written, but it works in one thread. And it's very long. How to distribute all this data between threads?

Here is my code:

import pandas as pd
import json
from time import sleep
from threading import Thread


def get_data(file_name):
  df = pd.read_excel(file_name,sheet_names=0)
  data = []
  for item in df.to_records(index=False):
    data.append(item[0])

  return data


if __name__ == '__main__':
  test1 = get_data('t1.xlsx')
  test2 = get_data('t3.xlsx')

  result = []
  # with open('result.json','w') as file_json:
  
  for i,ii in enumerate(test1):
    for j,jj in enumerate(test2):
      print(i,j)
      if i == j:
        continue
      if ii.strip().lower() not in jj.strip().lower():
        if ii.strip().lower() not in result:
          result.append(ii.strip().lower())

  df = pd.DataFrame(result)
  df.to_excel('r.xlsx',index=False,header=None)
            # file_json.write(json.dumps(ii)+'\n')

There is such an idea how to do it, but it’s impossible to bring it to mind: divide the lengths of these two arrays by 10 (number of threads) and then somehow run the array in each thread through the selected areas through the range function

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dmitry, 2020-08-02
@lemonlimelike

Not quite to your requirements, but have you tried concatenating files and removing duplicates with pandas? It will most likely be faster. Those. something like this:

# df1 - датафрейм t1.xlsx, df2 - датафрейм t3.xlsx
df = pd.concat([df1, df2])

# перевести все нужные столбцы в нижний регистр и обрезать пробелы
# можно сделать новые столбцы с измененными данными, если исходные данные важны
df['column_name'] = df['column_name'].apply(lambda x : x.lower().strip())

# удаление дубликатов по всем столбцам
df.drop_duplicates(keep=False, inplace=True)

# удаление дубликатов по какому-то определенному столбцу
#df.drop_duplicates(subset=['letter'], inplace=True)

https://pandas.pydata.org/pandas-docs/stable/refer...