What data structure should be used to count unique ip in a huge amount?

P

PerseforeComplete2021-07-17 20:42:12

Algorithms

PerseforeComplete, 2021-07-17 20:42:12

Given a file with ip addresses. ip can be repeated. The weight of the file is many times greater than the amount of RAM. We need to count the number of unique ip. A simple solution that does not take into account the scope of the task is to drive all ips into a hashtable and the number of elements in it will be the answer. The only problem is that such a hashtable will not fit into the RAM. You need to use some other data structure. What? And in general, what to read on the topic of highload tasks and algorithms?

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

A

Andrey Shatokhin, 2021-07-18
@PerseforeComplete

If you can afford loss of precision, take a look at hyperloglog. It is possible to reduce memory consumption many times over.
Such a structure is implemented, for example, in redis . There it will take 12 KB with an error of 0.81%.

R

Rsa97, 2021-07-17
@Rsa97

For a bit table, 256*256*256*256/8 = 512MB of memory is enough.

I

Ivan Shumov, 2021-07-17
@inoise

You don’t have to listen to me, but for the sake of laughter, this is all decided by working with the file system) For each ip, we create a file of the same name, and then we just list the directory)

W

Wataru, 2021-07-17
@wataru

As already said, it all fits perfectly into memory in a bitmap. But if it didn’t fit (for example, these are not 32-bit ip addresses, but 48-bit MAC addresses), then you would need to use some kind of external sorting and get all the addresses sorted. And then in one pass it is easy to count the unique ones.
You can sort in different ways. For example, read in chunks as much as fits in memory, sort as you like, write to disk. Then the resulting sorted pieces can be combined, as in merge sort.
You can also use radix sort.
If there are also restrictions on the use of disk space, and it does not fit into memory, then you can use the Bloom filter. Run it for as long as you have enough memory there. Take a lot of hash functions. Well, then in one pass, check if the read address is already in the filter. If not, add and increment the counter. But this is a probabilistic method and it may miss something due to false positive triggers of the bloom filter.

S

Saboteur, 2021-07-18
@saboteur_kiev

Given a file with ip addresses. ip can be repeated. The weight of the file is many times greater than the amount of RAM.

How many RAM?

We need to count the number of unique ip.
A simple solution that does not take into account the scope of the task is to drive all ips into a hashtable and the number of elements in it will be the answer.

There are sorting algorithms that do not need to load everything into memory. It will work for a long time, but sooner or later it will create a file where everything will be sorted. And the number of unique IPs in the sorted data is already at the school level.