Answer the question
In order to leave comments, you need to log in
Help with the algorithm for comparing offers
Colleagues, I turn to you for help. We have several thousand sentences, among them it is necessary to group those that are similar in semantics. As I see it now: I break all sentences into words, remove the service parts of speech, stemmatize them and find a soundex key for each. Then, using these codes, you somehow need to find the most similar offers. It is with the last step that I have difficulty. Head boils. I would be glad for a hint where to dig further or other ideas for implementing this.
Answer the question
In order to leave comments, you need to log in
Hmm, I'm not an expert, but perhaps searching for fuzzy duplicate texts will help you?
habrahabr.ru/post/65944/
I will not prompt with advice, but it would certainly be useful for the answerers to know in what language the original sentences are. And what about possible synonyms.
What you want is called clustering. There are many different articles about clustering methods - google it. I once did something similar based on the Bayesian theorem, but for this you need to manually select the categories of documents and train the filter on some sample - it worked well.
I don’t see much sense in using soundex, you’re not looking for similar words, but texts. You can read crc32 from words - the accuracy does not greatly reduce, and the calculations accelerate very well.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question