A
A
Alex Murfy2019-05-30 18:39:45
PHP
Alex Murfy, 2019-05-30 18:39:45

How to calculate string similarity?

Good day to all.
There are 2+ arrays (each has about 2k elements), they contain similar data, for example:
1. Russia. Big business. Moscow region, Podolsk, st. Makeeva, 14, apt. 2.
2. Russia - Business > 100 employees. M.O, Podolsk, Makeeva st. 14, 2.
3. RF. Private Business (large) . Moscow (region), Podolsk, Makeeva street, house 14, apartment 2. At the output,
you need to get the percentage of similarity of these lines and select the most similar ones. There are many algorithms and their implementations, please advise the method in which there will be maximum performance and tolerable similarity.
Those. you need to go through all the elements from the first, second and subsequent arrays and find similar ones + write them to a new array.
Thanks in advance to all who responded and a plus in karma :)
P.S. If there are ready-made libraries that solve this problem, I will be glad to provide links.
P.S.S. Suitable solutions are from MachineLearning, neural networks, semantic analysis, or algorithms like Levenshtein (only more efficient, or in combination with it).

Answer the question

In order to leave comments, you need to log in

5 answer(s)
A
Alex Murfy, 2019-05-31
@qxcoder

I found a solution on the toaster that might be perfect for my task. Thanks to everyone who responded) Since my lines have a certain structure (sequence), I think that this option is ideal.
"If you want to do everything yourself, then you need to compile a database of all cities, their synonyms, abbreviations (St. Petersburg, St. Petersburg, St. Petersburg, etc.) and sort through. Then add inaccurate search and error correction." Boris Korobkov
recommended it in a post ( https://toster.ru/q/593468). And thanks to him)

L
Lazy @BojackHorseman PHP, 2019-05-30
Tag

cool. buy. this is a non-trivial task. from the word at all. free solutions are full text search engines

A
Anton Shamanov, 2019-05-30
@SilenceOfWinter

similar_text — Calculates the degree of similarity of two strings

N
nekipelov, 2019-05-30
@nekipelov

It turns out that you can separate the description from the address. Therefore, it will be easier to get geo coordinates (for example, via https://tech.yandex.ru/maps/geocoder/) at the address, and if they are the same, consider the lines similar.
A more complex option: write a simple address parser. Resolve conflicts through a one-to-one address match.
Which line to use from similar ones, apparently does not matter. It can be random, or the longest ... Here you know better.

4
4tlen, 2019-05-31
@4tlen

To begin with, clearly formulate the similarity criterion, without understanding the problem, it is easiest to choose a solution by tossing a coin. If these are addresses, then, as already advised, run them through the geocoder. And, for example, consider similar addresses within a radius of X (to subtract apartments, buildings, etc.).

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question