B
B
beduin012020-02-14 09:54:37
OpenStreetMap
beduin01, 2020-02-14 09:54:37

How to normalize addresses?

OSM tag added because. perhaps the problem can be somehow solved with his help.

The essence of the problem. There are a lot of crooked addresses of the form: Moscow, Lenina St. 15, Magnitogorsk Mira 47. There is a common pattern of errors: somewhere there are not enough spaces, somewhere there is an alternation of city, mountains, city, and so on.

The question is this. How to normalize these addresses bring them to a unified form. Interested exclusively offline solution.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
S
Sergey Pankov, 2020-02-14
@trapwalker

After they discovered the https://dadata.ru/ service, they generally stopped wasting time and money on their own rake crutches. The service is just fire.
For us, the online mode and processing speed are not critical, so we even met the free trial plan.
It seems that they had solutions for installing their software in a closed loop, and this is nothing more than offline you need. The truth here is not free for a ride for sure.
Before dadata, this issue was solved by a terrible pile of filters, regexps with replacements, and human-machine copulation.
The general scheme is suitable not only for addresses, but in general for any dirty data:
1. We save the input dataset in CSV and NEVER change it.
2. Processing is multi-stage. Each stage consists of a filter and a modifier. The filter decides whether the modifier is applicable to each entry. The modifier applies its modification if the filter allows it.
3. A debug exhaust that shows and allows you to quickly view the full changes made.
4. Each step should make the minimum improvement of the same type as many lines as possible. The goal is to reduce the variety of problems, increase regularity, and standardize with each step.
5. With huge input datasets, you can save the intermediate output, but in general, cleaning should look like a pipe of their successive processing steps.
- It often happens that some step imperceptibly breaks data, but you realize it's already too late, when the subsequent steps are implemented and debugged, and rely heavily on the result of the breaking one. Thanks to the grading and immutability of the process, you can always zip the current state with any previous step and replace the necessary pieces with the next filter.
- It often happens that one of the stages, having improved individual records, removes the characteristic features for filtering elements for another stage. Thanks to this incremental process, it is possible to rearrange the steps in places.
- When a step makes a change to a record. the rung must leave its signature in a separate column. Useful for troubleshooting.
Tell us more why it is not considered online. Intrigued.

F
freeExec, 2020-02-14
@freeExec

There are modes of both dividing the address into components and normalization.
https://github.com/openvenues/libpostal

G
granty, 2020-02-14
@granty

The method of successive passes.
1. Make a reference list of cities / towns / farms, etc. (I relied on OKTMO from the All-Russian Classification of Territories )
Decide how to recognize the prefixes of settlements: railway. st / pos. / smt. / d. / s. / X. (they are in OKTMO)
2. Collect a reference list of the correct street names for each city
Streets must be normalized according to a single principle:
Cooperative 5th passage
or
5th Cooperative passage
are determined with abbreviations
lane -> lane
pr. -> passage or avenue
make the first pass with your patterns, if the city and street are correctly defined, enter them (and the remaining house number) into the database.
3. Remove correctly identified from the original base.
See what is left, correct the patterns for the rest and make the next pass.
Remaining in the dry residue - for manual analysis. Get ready to face the fact that some streets / lanes are not on Yandex / Google maps, but in fact they are. I marked such dubious street names in the database with the Trusted=0 flag in order to deal with them later.
They ate based on openstreet / Yandex / Google maps - they have some streets called a little differently:
Street 850 years of Moscow
850th anniversary of Moscow
but you still need maps, because having addresses without reference to geo-coordinates is of little use.
PS: There will be a complete ambush with house numbers anyway - 40/1 can be 40 k1 and 40 copr.1

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question