What is the best way to store a large array of numbers on disk with very fast lookups?

A

Andrey Pavlenko2019-02-11 19:43:47

Database

Andrey Pavlenko, 2019-02-11 19:43:47

Imagine that there is an array like:
{"1": [1,5,11,15,22,34...], "5": [55,44,22,67]}
That is, many keys (1 , 5, 1005...) there are about 20 million numbers inside one key, and the total number of numbers can be up to 800 million.
It is necessary to permanently store this information (that is, the option exclusively with RAM disappears) and at the same time organize a quick search for the occurrence of a certain number in these huge arrays (as I understand it, indexing should help). That is, I should be able to quickly find out if pure 505050 is in an array of 20 million numbers.
What solution would you recommend to use?
A MySQL table is currently being used, but it is quite slow and is already showing poor performance. Redis is faster, but when the number gets to even 40 million, it starts to take 3.5 GB of RAM, and we haven’t even got to 100 million ...
As I understand it, there must be some faster and more specific databases that fit I would for this rather narrowly focused task, but the session of googling in English did not help me much.
Yes, of course, it somehow needs to be made to work with PHP (so if there is some kind of library for working with such storage, it will be a huge plus) :-D
Advise in which direction should I look for a solution? I would be grateful for the name of the repositories that will help me solve this problem. The priority, in fact, is the speed of the search, since if there are few records, then many search operations will occur. You can allocate about 10 GB of RAM to a task (but as I said, according to my calculations, Redis can not cope with this).
Again. Let's take Redis as an example. There is a big_array1 key, it stores 20 million digits in SETS format.
The most frequent request we make is:
sismember big_array1 1234567
That is, we find out if the number 1234567 is stored in big_array1.
This function for Redis works fine, but it takes too much RAM.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

R

Roman Kitaev, 2019-02-11
@deliro

If false positive errors are acceptable (for example, with a chance of 0.1%), then you can use the Bloom filter. So you will cram 800 million values into 1.34GB. The smaller the error, the more space is needed.

A

Artem Cherepakhin, 2019-02-12
@AltZ

By means of a relational database, and if the insertion speed is not critical, then you can try to reorganize the table with the lines
{"1": [1,5,11,15,22,34...], "5": [55,44,22, 67]}
to the form {1:["1"], 5:["1"], ...., 22:["1", "5"]}, i.e. now the first is the value to check, then the array of keys - swap the keys with the values.
Do 1 index.
We will eat less resources + search speed, access to the former entry keys ("1", "2", ...) becomes more difficult.
You can try to create 2 tables: the first is just the number to be searched (search rate O (1) + the height of the tree in the index), and the second with entry keys and a foreign key to the first.