How to optimize gensim corpus building?

S

Saharman2021-05-13 10:09:48

Python

Saharman, 2021-05-13 10:09:48

Good afternoon!

I need to build a corpus from a text file that has 41635827 and an average of 5 words per line. Words are already separated by spaces in order to process them faster. However, the calculations take a very long time. I calculated that for this processing I would need approximately 361 hours. I would like to know how to speed up this code:

class BoWCorpus(object):
    def __init__(self, path, dictionary):
        self.filepath = path
        self.dictionary = dictionary
    def __iter__(self):
        global mydict  # OPTIONAL, only if updating the source dictionary.
        for line in smart_open(self.filepath, encoding='latin'):
            tokenized_list = line.strip().split(' ')
            bow = self.dictionary.doc2bow(tokenized_list, allow_update=True)
            mydict.merge_with(self.dictionary)
            yield bow

mydict = corpora.Dictionary()
bow_corpus = BoWCorpus('sen_list_alll.txt', dictionary=mydict)
for line in bow_corpus:
    print(line)
print('start save corp')
corpora.MmCorpus.serialize('bow_corpus_all_new2.mm', bow_corpus)
print('corp saved')
mydict.save('mydict_all_new12.dict')
print('dict saved')

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dimonchik, 2021-05-13
@dimonchik2013

check that you have created a dictionary,
well, you can try to shove everything into memory, and not line by line, 48 million of 5 words will obviously not weigh in kilobytes, but much less