Storage method for 2M records

mariofag2011-06-25 10:46:25

Database

mariofag, 2011-06-25 10:46:25

The task is as follows: you need to store about 2 million different records of the form “domain name; 3-byte number", it is necessary to be able to perform the operation "get N random (but different) records" with N ≤ 2 million.

It is desirable that the storage be easily accessible from PHP and Python. The storage should consume as little RAM as possible and be fast.

What would you suggest to use in such a case?

Answer the question

In order to leave comments, you need to log in

9 answer(s)

NiGHt_LEshiY, 2011-06-25
@mariofag

BerkeleyDB

Alexander, 2011-06-25
@0lympian

Those. I take it it won't refill? If not, then you can make your bike based on flat files with fixed field lengths. It is fast access by conventional seek. If necessary, [s]sprinkle with salt[/s] into groups of N records and store in separate files named according to (id / N), so the file system will partially solve random search issues. If you develop the idea even further, you can try to split it into folders (for example, it stores the squid cache).
And if this business will change regularly, then SQL is better not to come up with anything. 2 million records - not so much, especially since you don't need to select by keys.

ComodoHacker, 2011-06-25
@ComodoHacker

The storage should consume as little RAM as possible and be fast.

If "fast", then all data must be in RAM. How much will it take? Let's assume that the average length of a domain name is 30 characters (I'm sure that's a lot). If they are stored simply as text, 30 bytes are needed. Another 3 bytes per number, 1 byte per string length, for a total of 33 bytes per entry. For two million entries, about 63 MB will be required. So much RAM will have to be allocated for fast work. This means that we can only minimize the memory overhead caused by the choice of one or another "engine" for our storage. Hence the conclusion: the best storage is a simple array loaded from a simple text file. We generate a sequence of pseudo-random numbers and make a selection from the array by indexes.
If memory is really critical, then you can think about a more compact representation of domain names (TLD dictionary, dots, etc.)

Kindman, 2011-06-25
@Kindman

In fact, the task is divided into several smaller subtasks:
1) storage of 2 million domain names.
2) generation of a non-repeating sequence of 2 million pseudo-random numbers.
One more clarification is also needed:
“domain names” in the repository itself must be unique, or can they be repeated?
In other words, can two or more three-byte numbers correspond to one domain name at the same time?
and, are "blank" values allowed for a domain name?
The answers to these questions greatly affect the size of the memory.

smartly, 2011-06-25
@smartly

Balance speed/space.
It can be added to the first advice that records are stored compactly in each individual file, with variable field lengths.

el777, 2011-06-25
@el777

take any SQL-base and don't suffer.
Do you need to reinvent the wheel or solve a problem?
2M short records is nonsense, I have 6M in one table with 30 columns - it works with a bang.
Of course, now you can easily implement this on files. But think a little ahead - for sure, you will need to add something else, make some additional features, simply add a registration date or something else. So what? Rewrite everything? Convert all files, etc.? Why is this?
All this has already been done for you by the database developers.

Dmitry Dedukhin, 2011-06-29
@Demetros

I don't know about other databases, but the only way I know of in MySQL is to get a random sample:
SELECT * FROM table ORDER BY rand() LIMIT N;
This is _very_ slow because the entire table is scanned ("Using temporary; Using filesort" with all the consequences).

Kindman, 2011-06-28
@Kindman

Unfortunately, a “simple” array that is completely loaded from a file every time the script is run is the slowest solution of all possible, since a lot of time will be spent just parsing the contents of the file and creating a dynamic array structure.

debugger88, 2011-06-28
@debugger88

noSQL? redis.io/ for example. Very fast and easy.