Where can I find a suitable address parser?

I

Igor2020-03-04 03:58:49

Parsing

Igor, 2020-03-04 03:58:49

There is a huge list of addresses as strings in the database.
You need to parse a string.

The key task is to put everything on the shelves.
I suspect there are probably ready-made solutions, libraries.

List example.

Addresses

Советский просп., 57, Кемерово
Молодёжный просп., 2, Кемерово
просп. Ленина, 134, Кемерово
Кузнецкий просп., 90, Кемерово
просп. Химиков, 19, Кемерово
просп. Ленина, 124, Кемерово
Кузнецкий просп., 79/2, Кемерово
Кузнецкий просп., 33/1, Кемерово
ул. Кирова, 37, Кемерово
Октябрьский просп., 34, Кемерово
Советский просп., 57, Кемерово
просп. Ленина, 90/1, Кемерово
просп. Ленина, 7, Кемерово
Октябрьский просп., 34, Кемерово
Советский просп., 47, Кемерово
бул. Строителей, 15, Кемерово
Молодёжный просп., 2, Кемерово
Октябрьский просп., 34, Кемерово
просп. Ленина, 135, Кемерово
просп. Ленина, 45, Кемерово
Советский просп., 33, Кемерово
просп. Ленина, 90/1, Кемерово
просп. Ленина, 59А, Кемерово
Октябрьский просп., 34, Кемерово
ул. Сибиряков-Гвардейцев, 11, Кемерово
Октябрьский просп., 34, Кемерово
Октябрьский просп., 34, Кемерово
просп. Ленина, 1, Кемерово
ул. Кирова, 37, Кемерово
Молодёжный просп., 2, Кемерово
просп. Ленина, 103, Кемерово
бул. Строителей, 28, Кемерово
просп. Ленина, 90/1, Кемерово
бул. Строителей, 28, Кемерово
бул. Строителей, 28, Кемерово
ул. Кирова, 16, Кемерово
просп. Ленина, 59А, Кемерово
Октябрьский просп., 34, Кемерово
Октябрьский просп., 9, Кемерово
просп. Ленина, 75, Кемерово
Ленинградский просп., 22, Кемерово
Россия, Кемерово, проспект Ленина
Октябрьский просп., 56, Кемерово
просп. Ленина, 75, Кемерово
ул. Ворошилова, 21, Кемерово
Весенняя ул., 21, Кемерово
Россия, Кемерово, улица Свободы
Октябрьский просп., 34, Кемерово
Октябрьский просп., 34, Кемерово
Октябрьский просп., 30, Кемерово
ул. Свободы, 3, Кемерово
Кузнецкий просп., 33Б, Кемерово
ул. Терешковой, 41Б, Кемерово
Советский просп., 70, Кемерово
Притомский просп., 7/3, Кемерово
ул. Гагарина, 124А, Кемерово
Октябрьский просп., 9, Кемерово
Советский просп., 72, Кемерово
Октябрьский просп., 65, Кемерово
Весенняя ул., 16, Кемерово
бул. Строителей, 55, Кемерово
Кузнецкий просп., 10А, Кемерово
Октябрьский просп., 30, Кемерово
Красная ул., 14А, Кемерово
ул. Сибиряков-Гвардейцев, 26, Кемерово
ул. Кирова, 41, Кемерово
ул. Сибиряков-Гвардейцев, 189/3, корп. 3, Кемерово
ул. Кирова, 41, Кемерово
Октябрьский просп., 30, Кемерово
ул. Тухачевского, 31/3, Кемерово
просп. Ленина, 98, Кемерово
Октябрьский просп., 28, Кемерово
ул. Рутгерса, 32, Кемерово
Ноградская ул., 5, Кемерово
Советский просп., 70, Кемерово
ул. Тухачевского, 22В, Кемерово
просп. Ленина, 49, Кемерово
Кузнецкий просп., 85, Кемерово
Советский просп., 70, Кемерово

Before, I solved a similar problem, but this solution left much to be desired.
How I did it.
Compiled a list of regular expressions.

JSON

[
  {
    "reg_ex": "бульвар\\s+([А-яёъь 0-9\\.\\-]+)(,|$)",
    "type": "бульвар",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\-]+)бульвар",
    "type": "бульвар",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\-]+)пер($|,)",
    "type": "переулок",
    "group": 1
  },
  {
    "reg_ex": "улица\\s+([А-яёъь 0-9\\.\\-]+)(,|$)",
    "type": "улица",
    "group": 1
  },
  {
    "reg_ex": "ул\\.\\s+([А-яёъь 0-9\\.\\-]+)(,|$)",
    "type": "улица",
    "group": 1
  },
  {
    "reg_ex": "(ул).\\s+([А-яёъь 0-9\\.\\-]+)($|,|д\\.)",
    "type": "улица",
    "group": 2
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\-]+)улица($|,)",
    "type": "улица",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\-]+)ул\\.($|,)",
    "type": "улица",
    "group": 1
  },
  {
    "reg_ex": "ул\\.([А-яёъь 0-9\\.\\-]+)",
    "type": "улица",
    "group": 1
  },
  {
    "reg_ex": ",\\s+улица([A-zА-яёъь0-9\\.\\- ]+)",
    "type": "улица",
    "group": 1
  },
  {
    "reg_ex": "переулок\\s+(.*?)($|,)",
    "type": "переулок",
    "group": 1
  },
  {
    "reg_ex": ".*?,([A-zА-яёъь 0-9\\.\\-]+)переулок",
    "type": "переулок",
    "group": 1
  },
  {
    "reg_ex": "площадь\\s+([A-zА-яёъь 0-9\\.\\-]+)($|,)",
    "type": "площадь",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\-]+)площадь",
    "type": "площадь",
    "group": 1
  },
  {
    "reg_ex": "проезд\\s+(.*?)($|,)",
    "type": "проезд",
    "group": 1
  },
  {
    "reg_ex": ",([А-яёъь 0-9\\-]+)проезд",
    "type": "проезд",
    "group": 1
  },
  {
    "reg_ex": ".*?,([A-zА-яёъь 0-9 \\-]+)пр.*?д",
    "type": "проезд",
    "group": 1
  },
  {
    "reg_ex": ".*?,([A-zА-яёъь0-9\\.\\-]+)проезд",
    "type": "проезд",
    "group": 1
  },
  {
    "reg_ex": ".*?,([A-zА-яёъь 0-9 \\-]+)переезд",
    "type": "переезд",
    "group": 1
  },
  {
    "reg_ex": "шоссе\\s+(.*?)($|,)",
    "type": "шоссе",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([А-яA-z0-9 \\.\\-]+)\\s+ш.",
    "type": "шоссе",
    "group": 1
  },
  {
    "reg_ex": ".*?,([A-zА-яёъь 0-9\\.\\-]+)шоссе($|,)",
    "type": "шоссе",
    "group": 1
  },
  {
    "reg_ex": "проспект\\s+(.*?)($|,)",
    "type": "проспект",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-я.0-9\\.\\- ]+)\\s+просп.",
    "type": "проспект",
    "group": 1
  },
  {
    "reg_ex": ",\\s+пр.*?т\\s+([A-zА-я.0-9\\.\\- ]+)",
    "type": "проспект",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\- ]+)проспект($|,)",
    "type": "проспект",
    "group": 1
  },
  {
    "reg_ex": ",\\s+просп.\\s+([A-zА-я.0-9\\.\\- ]+)",
    "type": "проспект",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-яёъь0-9 \\-\\.]+)\\s+пр\\.",
    "type": "проспект",
    "group": 1
  },
  {
    "reg_ex": "дорога\\s+(.*?)($|,)",
    "type": "дорога",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-я.0-9\\.\\- ]+)\\s+дорога",
    "type": "дорога",
    "group": 1
  },
  {
    "reg_ex": "набережная\\s+(.*?)($|,)",
    "type": "набережная",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\- ]+)набережная($|,)",
    "type": "набережная",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9\\.\\- ]+)магистраль($|,)",
    "type": "магистраль",
    "group": 1
  },
  {
    "reg_ex": "квартал\\s+([А-яёъь 0-9\\.\\- ]+)($|,)",
    "type": "квартал",
    "group": 1
  },
  {
    "reg_ex": ".*?аллея([А-яёъь 0-9\\.\\- ]+)",
    "type": "аллея",
    "group": 1
  },
  {
    "reg_ex": ",\\s+аллея([А-я0-9\\.\\- ]+)",
    "type": "аллея",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-я.0-9 ]+)\\s+аллея",
    "type": "аллея",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9 \\-]+)тупик",
    "type": "тупик",
    "group": 1
  },
  {
    "reg_ex": ".*?,([А-яёъь 0-9 \\-]+)парк",
    "type": "парк",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-я0-9 ]+)\\s+просек",
    "type": "просек",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-я.0-9 \\-\\.]+)\\s+тракт",
    "type": "тракт",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-яёъь0-9 \\-\\.]+)\\s+сквер",
    "type": "сквер",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-яёъь0-9 \\-\\.]+)\\s+пер\\.",
    "type": "пер",
    "group": 1
  },
  {
    "reg_ex": ",\\s+(.*(линия)[А-я ]+)",
    "type": "линия",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-яёъь0-9\\-\\. ]+)\\s+линия",
    "type": "линия",
    "group": 1
  },
  {
    "reg_ex": ",\\s+пр\\s+([A-zА-яёъь0-9\\-\\. ]+)",
    "type": "проезд",
    "group": 1
  },
  {
    "reg_ex": ",\\s+посёлок\\s+([A-zА-яёъь0-9\\-\\. ]+)",
    "type": "посёлок",
    "group": 1
  },
  {
    "reg_ex": ",\\s+пл\\.\\s+([A-zА-яёъь0-9\\-\\. ]+)",
    "type": "площадь",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-яёъь0-9\\-\\. ]+)спуск",
    "type": "спуск",
    "group": 1
  },
  {
    "reg_ex": ",\\s+сквер\\s+([A-zА-яёъь0-9\\-\\. ]+)",
    "type": "сквер",
    "group": 1
  },
  {
    "reg_ex": ",\\s+станция\\s+([A-zА-яёъь0-9\\-\\. ]+)",
    "type": "станция",
    "group": 1
  },
  {
    "reg_ex": ",\\s+([A-zА-яёъь0-9\\-\\. ]+)(К|к)вартал",
    "type": "квартал",
    "group": 1
  }
]

And I ran each line through a certain algorithm.
Of course, the accuracy was the same.

Hope there are solutions.
If anyone has come across this, please let me know.

Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Anton Kravchenko, 2020-03-04
@AntonKravchenko

use the service dadata.ru ?

S

Sergey Pankov, 2020-03-04
@trapwalker

In your example, you have very clean neat addresses.
It is not very clear what you meant by "sort it out" and what kind of shelves you need.
The general approach for processing such arrays of loosely structured data is as follows.

The recycling process should be phased. Each stage should make minimal non-crippling changes to as much of the dataset as possible.
Each stage must be transparent. The values affected by the modification can be deduplicated and sorted for the convenience of control (with the eyes). This will make it easy to see any anomalies.
It is necessary to keep detailed statistics of changes at each stage. All outliers of statistics should be checked with the eyes: too short a minimum word or number of words, too long words or sets, too large or small numbers...
At each stage, you need to save the previous state of the dataset or keep the original one and have convenient tools for quickly fully rolling all the stages to the original dataset. Sometimes it makes sense to rearrange the stages, because a previously unnoticed crippling change was discovered a couple of stages ago.
A series of small edits with simple operations is preferable to complex algorithms and regexps. For example, in your case it is much better to replace "st." with a series of simple and understandable replays. to "street", "pl." to the "square", etc.
In order not to fence complex regexps, it makes sense to split the lines by words and replace the words entirely, or add spaces to the beginning and end of the line as one of the steps. This will simplify the regexps and prevent you from making jambs that are not immediately noticeable.
The lines affected by replays can be sorted and deduplicated for viewing with the eyes. Suddenly, you have many lanes of Plava-Laguna, which are also customary to shorten in this way.
If you have a directory of streets, you can mark all records in the processed dataset for which a one-to-one correspondence of the street from the directory was found with a special flag and no longer touch the streets in them. The rest can be sorted and deduplicated again (at the same time, the frequency can be calculated) and cleaned up massively or pointwise, starting from the most frequent cases with replays, regexps, etc.

In your case:

Separate addresses by comma, save in separate fields. CSV is better than DB. Easier. If you find more than two (in your case) commas, immediately fall, shout, count the number of such cases, solve manually, if there are not many of them, make a separate filter with a fix at a separate stage if there are many.
Sort and dedupe (for a second) all three columns separately, look through the resulting sets with your eyes, look for problems. Automate this step, you will need it more than once.
Process the dataset with a series of obvious replays that will reduce diversity. Abbreviations, multiple variants of the same spelling, differences in letter registers will be eliminated.
Pull out the street catalog from FIAS. You can take it in the KLADR format, it may be simpler there in a separate file. Mark the dataset to be cleared with links to the streets found in the directory. View the deduplicated and sorted (alphabetically and by frequency) remaining streets. Look for standard problems that can be fixed automatically, make manual edits for narrow special cases (but programmatically replays in a separate step so that you can reprocess the original dataset).

If your dataset is all about the same as in the example, then for a couple of thousand rubles I will clean it for you, regardless of its size. Almost independently. =)

V

Vladimir Korotenko, 2020-03-04
@firedragon

Search on the github using the phrase FIAS
https://github.com/zabralex85/fias.parser
Of the bonuses, there will be an absolutely exact address, of the minuses, even the optimized database takes 10 gigs.
However, I shrunk the data to 100 kilobytes, but I only needed regions and cities