Answer the question
In order to leave comments, you need to log in
Are there any good algorithms for semantic comparison of texts?
Hello, I’m trying to figure out how you can compare texts and so far I haven’t found anything sensible, the question is whether there are currently any algorithms for such tasks, otherwise everything looks very painful and sad.
The task is essentially of this kind: there is a source document doc0, a lot of other documents doc_n are input (the subject matter of the texts is motley) and you need to say with some degree of probability that, for example, doc_10 is talking about the same thing as doc_0 (there are very well rewritten texts about the same thing). It is this comparison that is important, I tried LSIin general, the thing is funny, but as for me it is more suitable for grouping documents than their "meaningful" comparison. Shillings, n-grams, etc. are very ambiguous. Can you please tell me if there is such a solution and what kind? And what can be read well on this topic from books?
Answer the question
In order to leave comments, you need to log in
The task is called Semantic Similiarity. I have not worked with this direction at all, but intuitively I can assume LSTM / CNN and various variations on the theme of Siamese / Triplet networks.
Apparently you need doc2vec . Once the model has been trained, the n_similarity method can be used . The idea is to represent the texts as vectors, after which it will be possible to calculate their cosine similarity .
Doc2Vec is suitable for a large number of documents (from 5-10 thousand), although much depends on their length. You can increase the dimension of vectors, the number of iterations, reduce the window, but this does not help much for a small number of documents. That is, for a small amount of data, LSI is better.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question