How can you implement text filtering in python?

V

vseminelybim2021-08-15 16:37:19

Python

vseminelybim, 2021-08-15 16:37:19

In general, I am writing some bot that checks messages for the content of keywords, but I just can’t think of a correct and working filtering setup. For example, a message is given: "This text contains the keywords Owl, Wolf, Fox, which skip the message, but also contains the word Giraffe, which does not skip the message." In this case, the message should not pass validation. That is, even if the message consists of hundreds of positive (in terms of passing) words and one negative, then the message should not go further. I end up with 2 lists (positive and negative words) and a message string. How do I properly search for words from lists in a given string? Probably, you can somehow implement it through loops or comparing lists (if you turn a string into a list), but I'm missing something.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

alexbprofit, 2021-08-15
@alexbprofit

check if the text contains at least one of the forbidden words, for example:

banned = ['жираф']
def text_contains_banned(text):
    for word in banned:
        if word in text:
            return False
    return True

V

Vindicar, 2021-08-15
@Vindicar

Well, firstly, it will be a competition between armor and projectile, because. users will try to bypass your blacklist. So don't expect the solution to be "once and for all".
Secondly, does a message have to contain at least one whitelisted word to be skipped? It's not very clear what you mean by whitelist.
Thirdly, both the message and the filter elements must be pre-normalized. This is not only about bringing to case, but also working with homographs (the simplest case is Russian and Latin "o"), as well as deleting some characters (for example, an invisible space character or combinatorial characters). It can be solved by replacing different characters with a simple homograph before checking, for example, Russian o for Latin - both in the message and when preparing the black and white lists.
Fourth, you need to think about false positives. Roughly speaking, if we don't remove spaces, the user only needs to write "b l" to bypass our system. If removed, "rowing" will give a false positive. If we whitelist "rowing" and make it take precedence over an intersecting word from the blacklist, then there will be a false negative on "fucking game". Of course, the word list depends on the context of the messages, but compiling it will be a long iterative process.
In view of the above -
1. normalize the string,
2. check the occurrence of substrings from the black list,
3. if there are any, check if there are occurrences of substrings from the white list that intersect with occurrences of the black one.
4. remove from consideration all intersecting occurrences of the black list
5. if there are occurrences from the black list - we react to the message as undesirable. Otherwise, we consider it acceptable.