Answer the question
In order to leave comments, you need to log in
What data structure should be used to count unique ip in a huge amount?
Given a file with ip addresses. ip can be repeated. The weight of the file is many times greater than the amount of RAM. We need to count the number of unique ip. A simple solution that does not take into account the scope of the task is to drive all ips into a hashtable and the number of elements in it will be the answer. The only problem is that such a hashtable will not fit into the RAM. You need to use some other data structure. What? And in general, what to read on the topic of highload tasks and algorithms?
Answer the question
In order to leave comments, you need to log in
If you can afford loss of precision, take a look at hyperloglog. It is possible to reduce memory consumption many times over.
Such a structure is implemented, for example, in redis . There it will take 12 KB with an error of 0.81%.
You don’t have to listen to me, but for the sake of laughter, this is all decided by working with the file system) For each ip, we create a file of the same name, and then we just list the directory)
As already said, it all fits perfectly into memory in a bitmap. But if it didn’t fit (for example, these are not 32-bit ip addresses, but 48-bit MAC addresses), then you would need to use some kind of external sorting and get all the addresses sorted. And then in one pass it is easy to count the unique ones.
You can sort in different ways. For example, read in chunks as much as fits in memory, sort as you like, write to disk. Then the resulting sorted pieces can be combined, as in merge sort.
You can also use radix sort.
If there are also restrictions on the use of disk space, and it does not fit into memory, then you can use the Bloom filter. Run it for as long as you have enough memory there. Take a lot of hash functions. Well, then in one pass, check if the read address is already in the filter. If not, add and increment the counter. But this is a probabilistic method and it may miss something due to false positive triggers of the bloom filter.
Given a file with ip addresses. ip can be repeated. The weight of the file is many times greater than the amount of RAM.
We need to count the number of unique ip.
A simple solution that does not take into account the scope of the task is to drive all ips into a hashtable and the number of elements in it will be the answer.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question