Screening out related posts?

J

jiexaspb2010-09-14 14:16:54

Sphinx

jiexaspb, 2010-09-14 14:16:54

Hello!
In our project, users add material - this is a text string up to 300 characters long.
There are a lot of duplicates. When adding, I would like to make a check: if the added line is similar to 90% of those already added, then do not let it be added.
MySQL is used as a database.
At the moment, the following solution came to mind:
- remove all punctuation marks and spaces from the string
- lower it to lower case
- make the md5 hash of the received
one - add the hash to a separate field in the database
- when adding a new one - check if there is such a thing in the
database the best, is there anything better?
PS There are about 10 thousand records per day, 500 new ones are added. It is possible to use sphinx, but I did not find similar functionality in it.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

G

Gluttton, 2010-09-14
@Gluttton

In my opinion, the existing approach will allow us to filter out not similar records, but identical ones ...
I think that this is an extremely difficult task, if at all feasible, and perhaps this is no longer for the database, but for AI. Let's assume there are two messages:
1. How can I filter out similar records in the database?
2. What is the way to prevent duplicate entries in the database?
They're alike?
In my opinion, it is best to provide a solution to this problem to users, for example, by inviting them to look at a link like “And they looked here” before publishing, in which, for example, in order of relevance, there will be 5-10 links to messages in which the words from the published message were the most common . You can also adapt tags for this and search for messages not only by words, but also by tags (or generally only by tags).
Well, it's a controversy. In practice, this has never happened.

V

Vladimir Chernyshev, 2010-09-14
@VolCh

Make a website for each material, send them for indexing to Yandex, if both are in the index, then you can consider them different :)
But seriously, there are services and programs that allow you to evaluate the similarity of texts (common among SEOs and their assistant rewriters) . I have not met such open source, but you can try to negotiate with the authors or use the service / program as an external service / module.
I would solve the problem myself right off the bat like this:
- make a list of words in the material (possible with the number of words encountered)
- throw out "garbage" (prepositions, conjunctions, "thank you" and "please")
- get a list of "tags"
- we are looking for the material (s) whose list most closely matches the current list (for example, in a loop through the current list, we get the first N materials with this tag and take the most (e) frequently encountered)
- we look at how similar the current one is to the found (e) ( the criterion is set in the settings, for example, if more than 80% matches, then we consider it similar)
- if it’s not similar (less than 80% matches), then we publish it
- if it’s similar, then we send these messages to the user with the question “Did you mean that?” , if the user says "no", then we publish, if "yes", then we do nothing
After the initial launch, we monitor the quality of the filter (at first, you can monitor it transparently for users, marking similar materials only in the database / admin panel) and, if necessary, change the similarity threshold, a dictionary of insignificant words, you can enter the concepts of synonyms and / or trim words to the base (open products seems to have even been described on Habré recently), we take into account phrases, the position of words in the material / sentence ... In general, we are gradually surpassing the algorithms for automatically detecting duplicate content in Google / Yandex, selling them to them and forgetting about users who are too lazy to search for themselves before publishing :)
Another approach is to make a neural network, train it on the existing base and learn in the process, but here I find it difficult to estimate even approximately the resource intensity of both development and analysis itself. Well, or develop a semantic analyzer :)

S

Sandrique, 2010-09-14
@Sandrique

Most likely shingles will suit you - habrahabr.ru/blogs/algorithm/65944/

L

lashtal, 2010-09-14
@lashtal

Hamming distance en.wikipedia.org/wiki/Hamming_distance
Levenshtein distance en.wikipedia.org/wiki/Levenshtein_distance
Damerau–Levenshtein distance en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance