What is the best way to check the text for the presence of words from the database in Django?

S

StasShk2017-04-12 03:24:49

Django

StasShk, 2017-04-12 03:24:49

I'm trying to implement a check of downloaded texts for the presence of words from the black list. The list itself is stored in the database and will be updated promptly. I came up with something like this myself:

for i, wrd in enumerate(text.lower().split()):
  if BadWords.objects.filter(bword=wrd ).exists():
    return ....

But the texts can be quite large and come in large numbers, a faster solution is needed.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

M

marazmiki, 2017-04-12
@StasShk

If we abstract from the database, then the task looks a little easier. Consider that there are two sets: the set of bad words and the set of words in the text. It remains to determine whether these sets intersect. If they intersect, then there is at least one bad word :-)

>>> a = { 1, 2, 3 }
>>> b = { 2, 3, 4 }
>>> c = { 5, 6 }
>>>
>>> a & b
{2, 3}
>>> a & c
set()

Now closer to the applied problem. Since many of the "bad words" are stored in the database (by the way, models are usually called in the singular - BadWord - and not in the plural, as you have), but until changes occur, it can be considered static. Therefore, you can shamelessly take this set from the cache.

# utils.py
from django.core.cache import cache

def get_bad_words():
    return cache.get('bad_words')

and recalculate the cache when creating, editing or deleting entries from BadWords. For example, using signals:

# models.py
def set_bad_words(**kwargs):
    from django.core.cache import cache
    cache.set('bad_words', {w.bword for w in BadWords.objects.all()})

models.signals.post_save.connect(set_bad_words,  sender=BadWords)
models.signals.post_delete.connect(set_bad_words, sender=BadWords)

Now it remains only to convert the incoming text into a set of words
AND an example of use:

# utils.py

def get_words_from_text(text_string):
    return set([w for w in text_string.lower().split()])

and determine if there are bad words (i.e. if the sets intersect):

# utils.py

def has_bad_words(text_string):
    return bool(get_bad_words() & get_words_from_text(text_string))

In general, there is still room for refactoring and improvements (it would be nice to clean punctuation, stop words, extra spaces from the text, move signals to apps.py according to the new application loading rules, or even throw them out altogether), but the idea, I think, is clear.