How to synchronize large tables?

Vasily Zhukov2012-05-16 19:08:31

Java

Vasily Zhukov, 2012-05-16 19:08:31

There is a table in base, or almost empty, or from about 2000000 lines.

There is a file with a "new version" of this table with about 2,000,000 lines (about 100Mb in size). Those. the primary key is there and it will be synchronized.

The table structures are the same, the primary key is always a number, in the remaining columns there are numbers and strings of limited length (up to 20 characters). The rows in the file are not sorted by the primary key.

Synchronization goes from the file to the database, i.e. as a standard, you need to:

1. Mark lines from the database that are not in the file as deleted (there is a deleted column);
2. Lines from the database that have updated their values in the columns from the file, update;
3. Add new lines from the file that are not in the database to the database.

It is necessary to make synchronization as fast as possible with the least memory usage in Java.

Now there is a quick solution using hash tables, but it eats an exorbitant amount of memory: HashMap, trove and other implementations looked. That is, the file is loaded into the HashMap and then everything is simple.

There is another solution that eats little memory using FileHashMap when the map values are saved to disk, but it is very long.

It is necessary that it be both fast and not eat a lot of memory, i.e. maximum about 150Mb (in fact, the entire file in memory can be loaded into a byte array).

What other options are there?

Answer the question

In order to leave comments, you need to log in

6 answer(s)

Vasily Zhukov, 2012-05-17
@vip_delete

I came up with the following solution:
1. load the file into memory as an array of bytes, where each K-byte is a table row.
2. sort this array with a sorting algorithm that does not require additional. memory, i.e. sorts in place or uses logN of memory. In practice, I chose a modified (for sorting strings by K-bytes) quicksort from Arrays.sort, which is also described in the article “Engineering a Sort Function”. It eats logN of memory, but it works mega fast (2000000 array sorts in 250ms).
3. make a request select id, a, b, c from mytable order by id (we load not everything at once, but in batches using fetchSize).
4. Now we will run over the first sorted array from the file, this will be index i, and over the lines from the query, this will be index j. For ease of explanation, it's easier to imagine that there are two sorted arrays.
5. i = 0, j = 0
5.1 if A[i].id == B[j].id, then (if A[i] != B[j], add B[j] to temp_update); i++, j++;
5.2 if A[i].id > B[j].id, then B[j] is added to temp_insert; j++;
5.3 if A[i].id < B[j].id, then add A[i] to temp_delete; i++;
6. we execute 3 requests to these temporary tables. the table in the database is synchronized
Total: memory O (1), and will work just as fast as with map.
If the file is as big as, for example, a billion records, then first we sort the table in the file using mergesort, and then go to point p3.

denver, 2012-05-16
@denver

Why not just:

START TRANSACTION;
UPDATE mytable SET is_deleted=1;
INSERT INTO mytable (field1, field2, ...) VALUES
('field1', 'field2', ...), ('field1', 'field2', ...), ('field1', 'field2', ...), ...
ON DUPLICATE KEY UPDATE
is_deleted=0,
field1=VALUES(field1),
field2=VALUES(field2),
...
;
COMMIT;

Writes

Arktos, 2012-05-16
@Arktos

1. Load data into an array and sort the array (if sorting is needed?). It will be both faster and less memory
2. Why upload the entire file at all, and then synchronize? We loaded one line, synchronized it, loaded the second one, and did it again. If you think this will take longer, then explain how you quickly synchronize a DB table with an in-memory hash table?

ztxn, 2012-05-17
@ztxn

Why do hashing at all, keep a set on the client, if you have a primary key?
We are trying to insert a line, we get an exception, which means that such a line already exists, we need to update it. Of course, the big question is how to properly handle this exception, given that it can be raised by different systems.
I liked the advice from the answer above to update the remote flag first. Or rather, at first I didn’t like it, but as an alternative to uploading a complete set of records to the side of the database in order to subtract the sets later, this may even turn out to be quite reasonable.

Vasily Zhukov, 2012-05-17
@vip_delete

Total, decisions do not approach, a jamb in mass putting down of a flag deleted to all records. On the 2000000 table, almost all values are not deleted, so setting the flag changes almost all rows, it takes about 5 minutes, and synchronization has not even started yet :) In the map solution, everything would have ended by now.

voilesik, 2012-05-21
@voilesik

Perhaps even so, it is worth loading the file into a buffer table (the structure of which will be identical to the target table), and then merge using the database tools.
If the key is a positive number, then the buffer may not be needed. By simply loading the file into the negative range of the target table's key, no buffer is needed. The balance between merge speed and commit size (number of modifications per transaction) can be adjusted using relatively small processing blocks. I think this method might work for you.
those. for example if you have a table like:

create table t_mytable (
  id integer,
  field1 type1,
  ....,
  CONSTRAINT pk_mytable PRIMARY KEY (id));

and there are some values:

one	value1	value...	...
2	value2	value...	...
3	value3	value...	...

you load data from a file into it by inverting the key, for example:

-3	valueC	value...	...
-one	valueA	value...	...
one	value1	value...	...
2	value2	value...	...
3	value3	value...	...

Well, then there are a lot of options how to remove the missing keys, and update the positive part of the keys with new values. If interested, I can write, but I think you can write it yourself.