How to determine text similarity?

S

Semyon Semyonov2016-04-24 19:31:32

Algorithms

Semyon Semyonov, 2016-04-24 19:31:32

Let's say we have tweets or article headlines. I would like to understand that these 10 news or tweets refer to the same thing (for example, to a company or event). How it's done? Although this is probably a bit of a stupid question, but at least what is the name of this range of tasks? First time in it.
By the way. I believe that news aggregators do something like this, i.e. they group them somehow, right?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

D

Dimonchik, 2016-04-25
@dimonchik2013

this is done by more than one function
: entities are retrieved, texts are compared, etc.
see https://tech.yandex.ru/tomita/
_

A

Alexander Skusnov, 2016-04-24
@AlexSku

There is also the Jaro-Winkler algorithm .

X

xmoonlight, 2017-09-26
@xmoonlight

Thematic clustering is called - there is a record of synonyms and their "weights" among themselves, depending on the presence of other adjacent specific words in a related chain (publications, comments or one sentence).
This can be done by extracting entities (nouns and proper names: full name of a person, names, etc.) and extracting contextual dependencies.
You can get a close search on such chains here .