Answer the question
In order to leave comments, you need to log in
Is there a "database" in which it would be possible to sample N random records from a set filtered by conditions?
Can you please tell me if there is a "database" in which it would be possible to make a selection of N random records from a set filtered according to the conditions?
Pseudocode:
SELECT id, name, param1, param2
FROM table
WHERE param1 = some_value
AND param2 > another_one
ORDER BY RANDOM()
LIMIT @N
Answer the question
In order to leave comments, you need to log in
If you need an honest selection of truly random and evenly distributed 10,000 rows from a five-million table, then nothing better than SQL will cope with this task. Is that a self-written base, sharpened specifically for this case. You just need to place indexes on the desired fields to facilitate the search.
My advice: add memory to the server, index the fields. Everything will be fine, use the standard RAND(), optimizing it to an acceptable value.
For rare cases, it also helps to select an array of IDs by condition and then select the required number of random IDs from the array.
And why can't you just first make a selection (filter by condition), and then from it, select how many random records you need in order?
Maybe this article will be useful akinas.com/pages/en/blog/mysql_random_row
Then, it seems to me, the simplest, fastest and most correct solution is to write your own application (or service, if it is more convenient) that will store all this data in a raw file and issue the desired list in one pass. Then it will be possible to execute such requests "relatively quickly" - in less than one minute, and regardless of how many requests will be executed at the same time.
And the selection will be as fair as possible. In general - you can take it as an interesting task for an interview)
Then, given that queries can be anything, you will need to read the entire table for each query, at least in order to filter (at least you have a self-written one, at least oracle, at least mysql) - i.e. read 5GB (this is the slowest point in the entire system). If we take the speed of sequential reading (and random reading + SSD will not help you, it seems to me), then the reading time will be about 5GB / 150MBs, i.e. about 30 seconds. If we assume that records rarely change, then the whole thing can be additionally compressed (depending on real data) - store records of non-fixed length - perhaps another 15 (or more) savings will be squeezed out - a total of 25 seconds. You can apply RAID and speed up a couple more times (or more - depending on the implementation) - in total we get 15 seconds per request. No standard caches will help with sequential reading, but you can try to simply fill all the free memory with pieces of data - and access them not from the disk - this gives another 2GB of memory and, at a glance, savings of 30-40 percent. In total, it seems to me that 10 seconds is a pretty good guaranteed and achievable result on almost any filters and queries. If there is pre-known filtering data, then you can try to optimize in a different way.
I don't know how to do it in MySQL however I solved this problem in sql server . So the idea is this - an additional field is added to the request in which the unique number function is calculated, and this field is indicated in the order, despite your selection conditions. Next, set top 10000. That's it. True, I don’t know how fast it will run, although I think 10000 rows is not much for a query in SQL Server...
Example
SELECT TOP 10000 [StudentKID]
,[SKPersRecN]
,[SKSurname]
,[SKName]
,[SKSecondName]
,[FacultyID] , NewId() as t
FROM [StudentsKredit]
where FacultyID='IUB'
order by t
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question