X
X
Xp0M0u2010-10-08 09:37:55
Algorithms
Xp0M0u, 2010-10-08 09:37:55

Text string comparison algorithm?

Recommend an algorithm for comparing strings with the principle of work like:
'Ivan Ivanovich Ivanov' = 'Ivanov Ivan Ivanovich'
'Ivan Ivanovich' ~ 'Ivanov Ivanovich'
'Ivan Ivanovich Ivanov goes without pants in the morning' != 'Ivanov Ivan Ivanovich puts on his pants at night'
That is, you need to find the similarity coefficient of the strings, taking into account the fact that the words in the string can be swapped.
UPD: I think I came up with:
For example, let Cij be calculated as a[i] == b[j] ? 1 : 0
a = ['Ivan', 'Ivanov', 'Ivanov']
b = ['Ivanov', 'Ivan', 'Ivanov']
K = (0 + 1 + 0 + 0 + 0 + 1 + 1 + 0 + 0) / ((3 + 3) / 2) = 3 / 3 = 1 — the strings are the same
a = ['Ivan', 'Ivanych']
b = ['Ivanov', 'Ivanych']
K = (0 + 0 + 0 + 1) / ((2 + 2) / 2) = 1 / 2 = 0.5 - similar, but not equal
It seems logical.
Thanks to hamMElion for reminding me to split lines into words %)

Answer the question

In order to leave comments, you need to log in

6 answer(s)
D
dmitryim, 2010-10-08
@Xp0M0u

Additionally, after splitting the string into words, they can be compared using levinshtein(). Then, taking into account the length of the word, get the coefficient of similarity. Thus, it is possible to determine the similarity quite accurately, even if a typo is made in the word, or if it is written a little differently.
Well, an additional bonus is the transliteration of the string and cleaning it from garbage.

H
hamMElion, 2010-10-08
@hamMElion

1. Split both strings into arrays of words (split)
2. Loop searching for elements of one array in another (counting matches = k)
3. Finding the number of matches for the second array from the proportion k1/n1=k2/n2 (n is the number of array elements)
4. Difference |k1-k2| and there is a similarity coefficient

T
tyomitch, 2010-10-08
@tyomitch

Algorithms - even chew an antelope.
Staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html has descriptions and links to implementations. Choose the right one.

D
Dmitry Groza, 2011-07-12
@boxfrommars

according to your algorithm, it turns out that the strings "Jay Jay Johanson" and "Jay Q Johanson" are equal. it is necessary to exclude from the arrays of strings already matched

X
xmoonlight, 2016-02-10
@xmoonlight

How to determine the similarity of two strings?

R
RuWeb, 2019-04-20
@RuWeb

Here is a ready-made online service TextTools.ru

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question