P
P
psyloss2013-07-01 20:16:39
Algorithms
psyloss, 2013-07-01 20:16:39

Formatting (normalizing) an email address to search for duplicates?

The database contains 100,500 records with postal addresses (Moscow, prospekt Mira, 10-204). You need to find duplicates. How can this be done? Are there ready-made solutions, or will it be necessary to reinvent the wheel?

Answer the question

In order to leave comments, you need to log in

6 answer(s)
L
lyalius, 2014-05-28
@lyalius

cheap labor will save you :)
but in general you can normalize the addresses through dadata.ru, and then try to match
only houses, buildings and apartments, you will have to tinker

D
dilix, 2013-07-01
@dilix

You can do it in the initial absence of any normalization in a very strange way - find the api maps and compare the returned coordinates by hash, for example.

M
mrstrictly, 2013-07-01
@mrstrictly

Try using the Yandex.Maps API by comparing the output of a geocoder. Documentation .

P
Puma Thailand, 2013-07-01
@opium

Well, cheap child labor will save you.
Hire freelancers for a dollar an hour and let them sort it out.

A
Alexey Ostroverkhov, 2013-07-02
@sharptop

I once had a similar task a long time ago (12-13 years ago). I brought to a normalized form in several passes
1) Converted everything to one case (upper)
2) Replaced all "pr.", "ave." and "prospect" for one thing. I did the same with apartments, houses, buildings and other things.
3) Based on all this, I have already made 3 norms. form.
4) I figured out the addresses that could not normalize
ON the base from 25000-30000 addresses, I spent 2 or 3 days.
It is clear that the solution is blunt and may not be entirely effective, but the alternative was to manually reload all this information, which somewhat did not suit me :)

A
antonsobolev, 2013-08-10
@antonsobolev

The Papyrus system (www.petroglif.ru) can do this. But your problem is custom: import-parsing-export. For little money, you can easily do it.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question