Answer the question
In order to leave comments, you need to log in
Formatting (normalizing) an email address to search for duplicates?
The database contains 100,500 records with postal addresses (Moscow, prospekt Mira, 10-204). You need to find duplicates. How can this be done? Are there ready-made solutions, or will it be necessary to reinvent the wheel?
Answer the question
In order to leave comments, you need to log in
cheap labor will save you :)
but in general you can normalize the addresses through dadata.ru, and then try to match
only houses, buildings and apartments, you will have to tinker
You can do it in the initial absence of any normalization in a very strange way - find the api maps and compare the returned coordinates by hash, for example.
Try using the Yandex.Maps API by comparing the output of a geocoder. Documentation .
Well, cheap child labor will save you.
Hire freelancers for a dollar an hour and let them sort it out.
I once had a similar task a long time ago (12-13 years ago). I brought to a normalized form in several passes
1) Converted everything to one case (upper)
2) Replaced all "pr.", "ave." and "prospect" for one thing. I did the same with apartments, houses, buildings and other things.
3) Based on all this, I have already made 3 norms. form.
4) I figured out the addresses that could not normalize
ON the base from 25000-30000 addresses, I spent 2 or 3 days.
It is clear that the solution is blunt and may not be entirely effective, but the alternative was to manually reload all this information, which somewhat did not suit me :)
The Papyrus system (www.petroglif.ru) can do this. But your problem is custom: import-parsing-export. For little money, you can easily do it.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question