S
S
Sergey Sokolov2013-02-28 03:26:18
PHP
Sergey Sokolov, 2013-02-28 03:26:18

How to compress an ordered array of unique natural numbers ogr. range?

The web application compares pairwise sets of positive integers.
Each set does not contain repetitions within itself, any of the numbers is not more than 210 million (28 bits).
There can be from 1 to 5 million of them in a set.
Comparing sets A and B, you need to get sets “unique for A”, “unique for B” and “common core”. In particular, it is easy to answer the questions "Is there a number N in the set S?"
The implementation, alas, is in php and so far on shared hosting. I quickly implemented it by loading the hosting MySQL: for each set, a temporary table with a single index column. In most cases, the tables are larger than what engine=Memory can fit, and on disk tables this is not fast at all, but it works.
How to effectively keep such a set, so that the comparison of two sets is performed quickly, taking the minimum footprint from memory?
It occurred to me to write each set with a bit mask of 2 ^ 28 bits (32Mb) long. Of the 210 million bits, there are only 5 million ones, the rest are 0: they can be written as a number of zeros in a row, for example. Very much like a bicycle. Tell everyone, except me, a well-known algorithm that is effective for compressing binary data in the special case of "many zeros in a row"?
I read about Huffman coding , it looks like it will be inefficient for finding each of the 5 million numbers of the second set inside the first one.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
MikhailEdoshin, 2013-02-28
@sergiks

In general, this is run-length encoding. You can find the intersection and difference without unpacking, looking at both sets in parallel, but checking an arbitrary number can also be done only by looking, not very efficiently.

M
MikhailEdoshin, 2013-02-28
@MikhailEdoshin

Considering that there are few numbers, it is easier to store them in a sorted array - 5 million 32-bit numbers will take 19 MB, intersection and difference are also found by parallel scan in O( N + M ), checking for an element occurrence by binary search is O(log N ) .

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question