Are there any good algorithms for semantic comparison of texts?

R

redduckrobot2017-11-16 21:41:51

data mining

redduckrobot, 2017-11-16 21:41:51

Hello, I’m trying to figure out how you can compare texts and so far I haven’t found anything sensible, the question is whether there are currently any algorithms for such tasks, otherwise everything looks very painful and sad.
The task is essentially of this kind: there is a source document doc0, a lot of other documents doc_n are input (the subject matter of the texts is motley) and you need to say with some degree of probability that, for example, doc_10 is talking about the same thing as doc_0 (there are very well rewritten texts about the same thing). It is this comparison that is important, I tried LSIin general, the thing is funny, but as for me it is more suitable for grouping documents than their "meaningful" comparison. Shillings, n-grams, etc. are very ambiguous. Can you please tell me if there is such a solution and what kind? And what can be read well on this topic from books?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

X

xdgadd, 2017-11-17
@xdgadd

The task is called Semantic Similiarity. I have not worked with this direction at all, but intuitively I can assume LSTM / CNN and various variations on the theme of Siamese / Triplet networks.

D

Danil, 2017-11-19
@DanilBaibak

Apparently you need doc2vec . Once the model has been trained, the n_similarity method can be used . The idea is to represent the texts as vectors, after which it will be possible to calculate their cosine similarity .

D

Dimonchik, 2017-11-17
@dimonchik2013

see gensim

G

gena09, 2018-11-02
@gena09

Doc2Vec is suitable for a large number of documents (from 5-10 thousand), although much depends on their length. You can increase the dimension of vectors, the number of iterations, reduce the window, but this does not help much for a small number of documents. That is, for a small amount of data, LSI is better.