Algorithm for finding numbers without pairs from a large data stream?

A

ArtemE2013-07-04 17:21:34

Algorithms

ArtemE, 2013-07-04 17:21:34

Good afternoon!
Today the KPI-Open Olympiad was held and there was an interesting task:
Limits: 8 sec. 8Mb

Condition
There are N natural numbers (N is paired from 2 to 10,000,000) modulo no more than 1,000,000,000. Almost all of these numbers have a pair. But two numbers - without a pair. It is necessary to find those same two numbers without a pair.
I pay attention to restrictions.
Example
Input
6
4 8 4 7 9 9
Answer
7 8
Example2
Input
8
7 7 7 5 5 5 5 6
Answer
6 7

The solution with BitMap does not go through memory. It is impossible to keep all the numbers in memory. We think that the solution is based on the xor of all numbers among themselves and the selection of two numbers that, if they were xor with the xor of all numbers, would give 0.
The fact is that not only two numbers are suitable for xor - but some kit. In order to weed out unsuitable ones, most likely, it is necessary to introduce a rule for verification. Perhaps there is an opportunity to find out the sum of these numbers (which are without a pair) or their product?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

J

jcmvbkbc, 2013-07-04
@ArtemE

uint32_t xor_all = 0;
uint32_t xor_bit[32] = {0};

....

for (i = 0; i < n; ++i) {
    xor_all ^= in;
    for (j = 0; j < 32; ++j) {
        if (in & (1 << j))
            xor_bit[j] ^= in;
    }
}

for (j = 0; j < 32; ++j) {
    if (xor_all & (1 << j)) {
        out1 = xor_bit[j];
        out2 = xor_all ^ xor_bit[j];
        break;
    }
}

The idea is that unpaired numbers must differ by at least one bit.
In addition to the common xor, we will drag one xor for numbers that have the i-th bit set.
At the end, the common xor will give us different bits, one of which we will find one of the unpaired numbers, and then the second.

T

turboNOMAD, 2013-07-04
@turboNOMAD

I would use a set.
The algorithm is as follows: for each number, we check whether it is in the set. If not, add, otherwise remove. This check goes for O(log n) for std::set in C++, or even for a constant for hash containers. (I don't know which language can be used according to the problem statement).
The only problem is if the numbers and their pairs are not located pseudo-randomly, but for example, first all the unique ones, and then all their pairs. In this case, when passing the first half of the array, the size of the set will only increase and we can fly with a memory limit.

N

nicolausYes, 2013-07-04
@nicolausYes

+1, posted the same thing. Depending on the compiler, C++ can use either C++11 containers (unordered_set) or from the tr1 namespace on older compilers, if present. In fact, when solving it head-on, you will have to store at least half of the values, or store it somehow more cunningly.

I thought a little: the memory problem can be circumvented if it is allowed to bypass the numbers not sequentially, but in random order.
You can also use the "divide and conquer" principle, i.e. build, as described above, sets of elements without a pair for small pieces of the original array, and then look for a symmetric difference (union without intersection) of these sets.

So you still save them in an array? Immediately save to the structure used when reading (set, unordered_set, unordered_map - tr1 definitely has it).

A

Arktos, 2013-07-04
@Arktos

We will need to count this array 2 or 3 times. First, we will store these numbers modulo 10^5. Let's get a bool-array of 10^5 elements - how many times each module was encountered. Consider 2 options
1) The required numbers modulo 10^5 are different. Then we will have 2 non-zero cells. We know the modules of the desired numbers by 10^5. The task was reduced to the following - to find from the array of numbers one number that does not have a pair. Solved by xor. We read the array again and calculate 2 xor for the found modules.
2) The desired numbers are modulo 10^5 (all cells are zero). Then they are not equal modulo 10^5+1.