Answer the question
In order to leave comments, you need to log in
How to speed up the Python word substitution process?
Good day.
It is necessary to replace words in a small array of text (slightly more than 1 million words), according to the dictionary.
Now I do it clumsily:
text = re.sub(r'\bз\b', '', text)
text = re.sub(r'\bзап\b', '', text)
Answer the question
In order to leave comments, you need to log in
You can start simple:
bad_words_re = re.compile(r'\b(з(ап)?)\b')
bad_words_re.sub('', text)
I would add multithreading, give each thread its own part of the "array" to eat. Well, it’s not entirely clear in what form you store all this goodness, if an array means a list, then maybe you can use a generator with a check for entering the list of search words?
Try to abandon regular expressions in favor of replace.
replace is much faster.
str.replace() should be used whenever it's possible to. It's more explicit, simpler, and faster.
In [1]: import re
In [2]: text = """For python 2.5, 2.6, should I be using string.replace or re.sub for basic text replacements.
In PHP, this was explicitly stated but I can't find a similar note for python.
"""
In [3]: timeit text.replace('e', 'X')
1000000 loops, best of 3: 735 ns per loop
In [4]: timeit re.sub('e', 'X', text)
100000 loops, best of 3: 5.52 us per loop
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question