Z
Z
z6Dabrata2012-09-13 01:09:55
PHP
z6Dabrata, 2012-09-13 01:09:55

How to calculate duplicates when updating database via API?

Good afternoon,

My project is updating a database on a 3rd party server via an API. Several tables, several million data each. Each entry has a unique identifier.

Unfortunately the API is not very smart and only supports adding new records and completely overwriting the table.

New information arrives daily (several tens of thousands of records).
The problem is that among them there may be duplicates of already existing records and there is no mechanism in the API that would allow this to be determined.
If you just add records, they will be duplicated.

Question: What is the best way to avoid duplication of information?

As an option: Keep a copy of the database on your server and use it to calculate duplicates.

But I want something more elegant.

Thank you.

Answer the question

In order to leave comments, you need to log in

5 answer(s)
V
vsespb, 2012-09-13
@vsespb

I can't give specific advice. Nothing but a copy of the database does not climb into the head. Whether this is a good practice or not, I cannot say. Maybe good. You need to know all the details of the API - who, why, where, etc.
I can only advise you to ask the API provider to fix their API or clarify if you are using it for other purposes. Or let him give you advice.

E
ertaquo, 2012-09-13
@ertaquo

Try to read the hash of each entry and store it in a separate indexed field. Although if the database is already large, then updating it will take quite a lot of time.

T
tolyjan, 2012-09-13
@tolyjan

Remove duplicate records like
DELETE u1 FROM users u1, users u2 WHERE u1.id > u2.id AND u1.name = u2.name;
the WHERE condition can be edited to suit your needs, if the name field is duplicated, then the condition must match u1.name = u2.name
This is quite an elegant and efficient way.

A
AlexeyVD, 2012-09-13
@AlexeyVD

In a good way, if the data in the tables on the server should not be duplicated, then it would be logical to create unique keys there for the required fields, and then no duplicates would already get there.
Accordingly, if you need to avoid errors when inserting duplicate data, then you need to fix the API and use INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE ... depending on what you need.

E
EugeneOZ, 2012-09-13
@EugeneOZ

You can store in your key-value store a list of IDs that you added in the form of 'key:hash', where hash is the hash of the data of the new entry, and key is a constant prefix. Make them expire for a week so that they delete themselves (or at another time) and before adding, check if there is a key with the name "key:hash". If it doesn't exist, add it. This is not 100% protection against duplicates, but you can weed out a very large percentage, I think.
100% protection would be possible if it were possible for an entity to add a hash field and ask the API before adding if there is a record whose hash field is equal to the hash of the data. Then it would be possible to first check in your key-value (as I described) and, if missing, check in the API (so as not to make unnecessary requests).

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question