A
A
and7ey2011-10-18 21:25:59
Computer networks
and7ey, 2011-10-18 21:25:59

How to compare addresses?

Has anyone encountered the task of comparing addresses?
There are two addresses - you need to understand whether they are the same (well, and it is desirable to evaluate how confident we are that they are the same).

We need solutions for two cases:
1) structured addresses (a separate field for each element of the address - cities, streets, houses, etc.);
2) unstructured addresses (written simply in one line, the order of the elements is unknown).

At the same time, it is unnecessary to normalize addresses (i.e., break them into fields, correct errors, etc.) - as in the FACTOR solution.

Answer the question

In order to leave comments, you need to log in

9 answer(s)
W
Wott, 2011-10-19
@Wott

I requested google geocoding in one project and compared the coordinates :)

K
Kindman, 2011-10-19
@Kindman

Here's how I did it:
1) downloaded KLADR (100 MB in 5 DBF files)
2) pulled out all the STREETS from it (together with their settlements) - 860 thousand streets.
it turned out something like this:
020010010030001; road; st; Ataevka; d; Ufa; G; Bashkortostan; Rep; Ufimsky; district
for some streets, some fields remained empty, for example:
010000010000001; Abadzekhskaya; st;;; Maykop; G;;;;
then for any user input I tried to get a list of all suitable streets:
For example, for "Myasoedovskaya street" we get:
380190000930002; Myasoedovskaya; st; Ekunchet; P;;; Irkutsk; region; Taishet; district
500340001040001; Myasoedovskaya; st; Kondrevo; with;;; Moscow; region; Stupinsky; district
520170000770009; Myasoedovskaya; st; cabbage; d;;; Nizhny Novgorod; region; Resurrection; district
remains only to clarify the area.

V
valerijfrolov, 2011-10-18
@valerijfrolov

I came up with the idea of ​​sorting two addresses in the same direction and character-by-character comparison, though I don’t know yet how suitable this method is.

V
valerijfrolov, 2011-10-18
@valerijfrolov

AZ or ZA

K
knekrasov, 2011-10-18
@knekrasov

The problem is solved if the fields are structured (normalization or the exact format of the address line).
At the same time, IMHO, it makes sense to compare the address fields according to the principle of comparing the digits of numbers:
from higher priority fields to lower priority ones or from more general to more detailed ones,
for example Country -> Region / district -> locality etc.

A
Andrew, 2011-10-18
@Morfi

You can feed Google or Yandex, and compare the normalized result.

B
bigbaraboom, 2011-10-19
@bigbaraboom

If you work with databases, then there are solutions for many databases of fuzzy string comparison.
Here is an option for PostgreSQL habrahabr.ru/blogs/postgresql/78566/ one of the algorithms is immediately described if you do not use databases.

A
Andrew, 2011-10-19
@OLS

For unstructured addresses, you can probably try this:
- normalize by case
- divide into tokens (sequences of continuous characters: "st", "lenin", "123", "A", "8");
- sort lexicographically
- calculate the editorial distance over the compared addresses habrahabr.ru/blogs/algorithm/117063/ (considering the entire list of tokens as a “line”, and the “letters” in it are the selected tokens).
For example, after sorting
"123", "8", "a", "lenin", "st" we
compare with "moscow, lenin, 123, 8"
"123", "8", "lenin", "moscow"
editorial distance - 2
compare with "Moscow, Lenin, 123-A, 8"
"123", "8", "a", "Lenin", "Moscow"
editorial distance - 1 It is
highly desirable to be able to remove from the analysis or equate when comparing the constants "street", "street", "pl.", "pr-d", "passage", "city".

A
Artyom Zubkov, 2011-10-19
@artzub

I will say this without human intervention is indispensable.
Of course, you can adjust it to some kind of template, but the probability of error is very high.
I have two projects in which I solved these problems. Addresses in the same set can be, as well as those structured according to the principle of 9 commas (this is the simplest, you can isolate automatically and offer suitable ones for a person to make a decision), as well as those that, as they say: “as God puts on the soul " - this is generally the most unsolvable option =)
What I do:
1. Implemented KLADR into my system. Converted their data into a tree structure in one table and another table with abbreviations.
2. We form our tables directories Regions, Districts, Cities, Streets. Because, KLADR is only a data source; you cannot bind real addresses to it. Since when updating, some records of KLADR can become invalid and their IDs will not be valid. Well, that is, we store addresses in our format.
Search:
1. If the address is parsed by 9 commas or less, then we try to find it in KLADR (not all of them are in KLADR), so our directories are analyzed in the same way. point 3. otherwise point 2.
2. If the address is unparsable. We display a selection dialog from the classifier. The user tries to find the address if you find item 5, otherwise he tries to collect approximately the corresponding one, let's say the classifier does not have the desired street in the desired city or locality in the area, this occurs, select another street and form the address and go to item 3.
3. If possible collect the address or part of it, then we show the address editing dialog. The user corrects and saves, point 4.
4. The program searches in the input set for similar addresses that differ in some way, let's say a house, a building. And he offers to bind them to the newly created one. well, or point 2.
Well, in general, something like this.
But the easiest way is to shift all responsibility on the operator, who must correctly import the addresses. Our conscience to lie down only to help more or less ease his agony.
PS Have you tried to analyze the pharmaceutical assortment =) only a person or a giant knowledge base and smart AI will solve such a problem, your task is from the same category =)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question