Answer the question
In order to leave comments, you need to log in
Counting unique values with minimal error?
There are files that contain data on almost a billion parts, each part has characteristics. It is necessary in a relatively short time (10 seconds), with a minimum error, to determine the number of parts with certain (include/exclude) characteristics. Characteristics can be represented as ID.
Are there any cases for solving such problems?
I tried to solve the problem through Spark and its cache, but this solution is not suitable in terms of speed.
Answer the question
In order to leave comments, you need to log in
There are filesYou just need to create a database from them and index all the necessary fields.
It all depends on the format of storage/representation of this data. There should be its own custom format, compact (to reduce memory access) and convenient only for quick scanning (passing through all records), and for nothing else. I would write it under Cython or Numba with a compact Numpy data representation. With such a large number of small records and, in general, a trivial algorithm for processing them, the main bottleneck in terms of performance is not the CPU, but access to RAM, so little depends on the "cunning" of the counting algorithm itself (what tricks can there be?) , but the compactness of the data structure (even at the expense of not very convenient access to it) will play a decisive role.
Take a look at HyperLogLog - you can count the approximate number of elements in the stream.
https://m.habrahabr.ru/post/119852/
The implementation is in Redis
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question