Counting unique values with minimal error?

S

SilentSokolov2016-10-23 09:39:40

big data

SilentSokolov, 2016-10-23 09:39:40

There are files that contain data on almost a billion parts, each part has characteristics. It is necessary in a relatively short time (10 seconds), with a minimum error, to determine the number of parts with certain (include/exclude) characteristics. Characteristics can be represented as ID.
Are there any cases for solving such problems?
I tried to solve the problem through Spark and its cache, but this solution is not suitable in terms of speed.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

X

xmoonlight, 2016-10-23
@xmoonlight

There are files

You just need to create a database from them and index all the necessary fields.
Since we started talking about our own base: a binary tuple and then, a binary incompletely connected "tree" of them.
This will allow you to search from any input at once the entire list of objects.

N

nirvimel, 2016-10-23
@nirvimel

It all depends on the format of storage/representation of this data. There should be its own custom format, compact (to reduce memory access) and convenient only for quick scanning (passing through all records), and for nothing else. I would write it under Cython or Numba with a compact Numpy data representation. With such a large number of small records and, in general, a trivial algorithm for processing them, the main bottleneck in terms of performance is not the CPU, but access to RAM, so little depends on the "cunning" of the counting algorithm itself (what tricks can there be?) , but the compactness of the data structure (even at the expense of not very convenient access to it) will play a decisive role.

V

Vlad_Fedorenko, 2016-10-23
@Vlad_Fedorenko

Take a look at HyperLogLog - you can count the approximate number of elements in the stream.
https://m.habrahabr.ru/post/119852/
The implementation is in Redis