Searching for an article in the database and subtracting % similarity

S

Stepan2012-04-27 12:11:31

MySQL

Stepan, 2012-04-27 12:11:31

Is it realistic to do this?

There is a DB with articles. When adding a new one, you need to look in the database to see if there is something similar to it.
Reading everything in turn and comparing is not very profitable from the side of the load.
Is it possible to make some kind of hash on which to search?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

S

sergeypid, 2012-04-27
@sergeypid

For an approximate comparison of texts, there is such a way: to create a zero matrix 30x30, the coordinates of the matrix are the letters of the alphabet. We run through the text and count how many times consecutive pairs of letters were encountered in the text. For example, if we see the letters A and B in a row, we add 1 to the element [0, 1] of the matrix. Then we normalize - we divide all the elements of the matrix by the total number of letters in the text. The result is a hash matrix, we remember it for each article in the database.
For the article being checked, we build the same matrix and subtract it from the hash matrix of each article in the database. We consider the sum of the squares of the resulting elements. We investigate on 20-40 typical articles and derive a threshold value. For short texts (about 100-500 letters) it worked well, try it on your articles!
In theory it has to do with Markov chains, n-grams (2 grams:)

T

TROODON, 2012-04-27
@TROODON

The simplest solution is to raise the sphinx / lucene (elasticsearch, Solr) search engine and index all articles, when adding a new article, send a request to the search engine specifying the body = body, title = title fields and look at the size of the entries.
High occurrence value - Similar article

S

SergeiStartsev, 2012-04-27
@SergeiStartsev

When adding each article, it is possible to draw up (1 time) its map, it can have a different look, it all depends on how relevant the results should be (as the simplest option is to choose the most frequently used words), and then analyze this map.

G

gkirok, 2012-04-29
@gkirok

nilsimsa algorithm
look for perl-Digest-Nilsimsa
create a hash for each article put in the database
comparison seems bit-by-bit I don’t remember used it for a long time
exact (if pieces of text are simply swapped it finds 100% identity)
relatively fast
non-Latin before creating a hash transliterate