Help with the algorithm for comparing offers

M

Max2012-10-18 15:24:02

PHP

Max, 2012-10-18 15:24:02

Colleagues, I turn to you for help. We have several thousand sentences, among them it is necessary to group those that are similar in semantics. As I see it now: I break all sentences into words, remove the service parts of speech, stemmatize them and find a soundex key for each. Then, using these codes, you somehow need to find the most similar offers. It is with the last step that I have difficulty. Head boils. I would be glad for a hint where to dig further or other ideas for implementing this.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

D

DemiurgeOrion, 2012-10-18
@DemiurgeOrion

Hmm, I'm not an expert, but perhaps searching for fuzzy duplicate texts will help you?
habrahabr.ru/post/65944/

A

Alexey Akulovich, 2012-10-18
@AterCattus

I will not prompt with advice, but it would certainly be useful for the answerers to know in what language the original sentences are. And what about possible synonyms.

U

Urevic, 2012-10-19
@Urevic

What you want is called clustering. There are many different articles about clustering methods - google it. I once did something similar based on the Bayesian theorem, but for this you need to manually select the categories of documents and train the filter on some sample - it worked well.
I don’t see much sense in using soundex, you’re not looking for similar words, but texts. You can read crc32 from words - the accuracy does not greatly reduce, and the calculations accelerate very well.

S

sergeypid, 2012-10-19
@sergeypid

Try just k-means clustering on your keys. Only it is necessary to set a priori the number of classes.