S
S
skomoroh2014-02-02 01:36:08
Python
skomoroh, 2014-02-02 01:36:08

How to find an address in unformatted text?

There is text in free form, it may contain an address (in an unknown form).
It is necessary to pull out this address and bring it to the desired format and bind it to metro stations.
Addresses may be misspelled, may contain only part of the information (if the city is not specified, then Moscow).
If there was only an address, it would be possible to search immediately through Yandex-maps or 2gis.
If it were always the correct spelling of cities and streets, it would be possible to compare using the directory.
I will be glad to any advice.
Thanks in advance.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
E
Evgeny Fedorov, 2014-02-02
@JekFdrv

Regexexp.

O
ondister, 2014-02-02
@ondister

Something tells me that we need a street directory in general. Various cities. And look for matches on it.

M
maxfox, 2014-02-02
@maxfox

In general, the task is very ambitious. You need to decide what assumptions you can make: skip something, find a deprivation, etc.
Even a person finds it difficult to find an address in a phrase like "yesterday I was at the flower shop."
Another thing is if there is something to cling to: the name of the street is capitalized, there are designations like "st." or "street", the address must have a house number, etc.
If such criteria are suitable, then:
1. Look for numbers.
2. Look for occurrences of "street", "street", "pr", "avenue", "square".
3. Look for words that start with a capital letter in the middle of a sentence.
Then choose lexemes near these positions and run them through Yandex/2gis/FIAS. How to filter and process the results depends on the results themselves. You should not try to write a universal parser, you should focus on the features of the material you are working with.
I recently solved a similar but slightly simpler problem. There was a database in Excel, where one of the columns recorded customer addresses. But they recorded in a very free form, ie. there was rubbish like "red brick house", "entry under the barrier", "call Vasya when we get there", etc. We managed to filter out the garbage, although in about 2-3 out of 100 entries I had to work with my hands (but this is because there might not be an address at all, but there was "Zarya factory" or "Petrovich's cafe").
In general, good luck.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question