How to find postal addresses in free text?

fatum2012-08-18 20:45:35

data mining

fatum, 2012-08-18 20:45:35

I have arbitrary text in the input. There is also an address service where you can check for correctness and uniformity.
How to arrange the processing of incoming arbitrary text in such a way as to find addresses in it? For example: the city of Manganets, st. Kibalchicha 4.
I guess this is best done in several stages.
1. Pull out a piece of text in the area of keywords: street, city, house;
2. Next, I want to automate the process and see the learning system using examples of human learning;
Perhaps there are some other ideas?

Answer the question

In order to leave comments, you need to log in

11 answer(s)

fatum, 2012-08-19
@fatum

So. Let me summarize my findings.
NLP is not exactly what you need. For special cases it works fine, but for a diverse set it is not suitable.
I chose the direction of machine learning.
In this case, I do not have to create templates myself, find dependencies, etc.
Therefore, I signed up for 2 courses at coursera.org:
Machine Learning - www.coursera.org/course/ml
Web Intelligence and Big Data - www.coursera.org/course/bigdata

marazmiki, 2012-08-18
@marazmiki

Sorry.

MikhailEdoshin, 2012-08-19
@MikhailEdoshin

Isolate tokens with regular expressions, that is, the roles of words. You will have roles for special markup words (g., st., d, korp., apt., plus options) and roles for other words, which can be subdivided into famous cities and typical streets (Moscow, Lenin, Mira), numbers , postal codes and a single role UNKNOWN for all other words.
If there are tokens in the text, a famous city, a street, or (depending on the text) even just a number, then this is a signal - perhaps this is an address. You isolate the list of tokens, after which you parse the list. For example, it turned out "KNOWN-CITY STREET UNKNOWN NUMBER". Most likely (99%) behind UNKNOWN is the name of the street. Or UNKNOWN UNKNOWN NUMBER HYPHEN NUMBER - maybe (50%) a short record of the address like "Tygdym, Severnaya, 25-12". The probabilities are conditional, of course :) There will be relatively few such patterns, they are already much easier to parse. In the simplest case, you can create a recognition table "KNOWN-CITY STREET UNKNOWN NUMBER -> KNOWN-CITY STREET NAME-STREET NUMBER"
If there is no rule for the pattern, remember this case so that the developer can improve it later. Plus, you can remember the newly recognized streets and compare which streets are in which city. This is the simplest algorithm, of course, but it will work quite well.

No_Time, 2012-08-18
@No_Time

So there's nothing to apologize for. The essence is reflected extremely accurately, the best solution is proposed, so everything is ok =)

parkee, 2012-08-18
@parkee

Someone doesn't know the regular expressions well ;) All this is singled out, including all the variety of phone recording formats. Although, of course, not with 100% accuracy. The index, by the way, is generally words. In all the UK, for example. And the city / street / phone number can be merged into data about the same place, if they are within the same or neighboring proposals, as an option. In general, everything here is strongly tied to a specific text. There is no universal answer. The mountain is not so often reduced, but it can also be filtered by a dictionary, although again it depends on the text / task / volume.
And, yes, we will not forget about regular expressions;) Regular expressions are involved in one way or another in all natural language processing systems. If you have time, you can revisit that NLP course nlp-class.org Everything should clear up.

Juggler, 2012-08-18
@Juggler

Do not take it for advertising, I just use it myself - ahunter.ru

Alexander Khmelev, 2012-08-19
@akhmelev

In theory, without a database of cities and addresses, it is generally quite difficult to solve the problem.
For example, "Lugansk Oboronnaya 39" differs little from the usual text, if it is not known in advance that it is about the city and the street.
At the same time, Google finds what address we are talking about: my.jetscreenshot.com/226/20120819-yhtt-39kb
Conclusion: if the application is unloaded, you can simply dig up the Google Maps api, otherwise you only need your own database.

fatum, 2012-08-18
@fatum

In general, how to isolate a string similar to a postal address - I understand, but the question is more in the general algorithm of the task.
I don't need exact specific solutions. I'm looking for an exemplary, but understandable approach to solving this problem.
So I know that there is such an advertisement that itself looks for keywords on the page and makes them links. This is a bit of the same opera, but much more complicated)))

Nikolai Turnaviotov, 2012-08-18
@foxmuldercp

well.
then here you are, for example, Mr. Blabla, * words of the text * st. Blabla Blabla, 01101, v.123-12-12 There
are not so many generally accepted abbreviations, you can try. but d is a mountain, a city, after Blabla in the name of the street there may not be a dot / comma, the index is like 5 characters around the world, but phone numbers can be written at random.
And why do you need it, exactly?

fatum, 2012-08-18
@fatum

Yes, you forget about regular expressions)
Crop the text to a minimum, I understand, it's not a problem.
But further. Is it really possible to train the system to find the address or will it have to sort through everything?

fatum, 2012-08-19
@fatum

If there is to express - write in any case.
Thanks