ORDER BY `probability`?

AntonioK2011-08-29 12:12:57

random numbers

AntonioK, 2011-08-29 12:12:57

Good afternoon, Habr!
There are about ten thousand records in the table. It is necessary to select N records randomly, but in such a way that the probability of a particular record getting into the sample conditionally corresponds to a formula where one of the fields of the record is taken into account as a variable.
For example: records are addresses of restaurants. It is necessary that restaurants that have a flag “there are always long queues for the toilet in the restaurant” fall into the sample on average half as often as all the others.
If all records were "equal", the probability for each would be equal to the quotient of one and the number of records in the table. But if, according to the condition, the probability is multiplied by a coefficient, the rule “the sum of the probabilities for all records is 100%” ceases to be true, and this is bad - you cannot use the estimated probability to predict the number of samples of each record per day, based on the statistics of the number of requests during this time.
Where to dig? The model is extremely popular - both in advertising networks (more often show the most CTR-based banners), and in games (cheaper loot drops more often from monsters).
For the first time I seriously feel the lack of a mathematical education, doing programming for the web =(

Answer the question

In order to leave comments, you need to log in

15 answer(s)

lashtal, 2011-08-29
@lashtal

I would do extra. a field in the table where the calculated coefficient is entered (including random), and sorted by it. To make the sort random, recalculate the field every n minutes.

Grigory Peretyaka, 2011-08-29
@Peretyaka

I don't know how to do it right, but here's what I came up with:
Use the standard random.
As mentioned above, make a rating for each field according to any of your formulas.
Then, connect the table several times so that the records are duplicated and for each next table, prescribe an increasing minimum rank in the conditions. Thus, there will be more duplicates with records with a higher rating, respectively, and the chance of their selection by random is higher.
I don’t know if I managed to convey the idea, if you write something, I will try to describe in more detail with examples.
In general, I usually ask questions about the database on sql.ru, there are more specialized specialists there.

zibada, 2011-08-29
@zibada

stupid solution , if there are few records and it doesn’t scare you to do a full scan for each select:
store in each record:
- own priority (any positive number)
- the sum of the priorities of all previous elements (in the order of addition, for example, by id)
and store somewhere the sum of all priorities (in fact, it can be obtained from the record with the maximum id)
can be normalized, but then it will not work to make a quick insert.
select is reduced to generating several random numbers from the range [0, sum_of_all_priorities), and selecting elements where the range [sum, sum+your_priority) includes at least one of the selected numbers.
insert - simple appending to the end, updating the sum of priorities
delete - simple deletion
update (priority change) - delete + insert if the priority has increased, a simple update if it has decreased.
there is an unpleasant special effect that the select may return fewer elements than necessary if several numbers fall into the same range or into the "hole" from the removed element.
you can either choose immediately with a small margin, or choose additionally as needed.
periodically (after a large number of deletions) it makes sense to rebuild priorities again.
total - all operations for amortized O (1), except for the select, with which everything is sad.
to optimize it, you can use the spatial index (don't know about support in modern databases), because. we have a request for whether a point belongs to a segment.
honest and fast but complicated solution:
build a tree, in the leaves of which there are id from the table and priorities, in intermediate nodes - the sums of priorities in subtrees.
select one element (for simplicity, a binary tree is considered below):
generate a number from 0 to the sum of priorities (stored in the root), go from the root:
- if the number is less than the sum of priorities in the left subtree, then to the left
- otherwise - to the right, and subtract from the number the sum of priorities in the left subtree
- when we reach the leaf - return its id
update/insert/delete - normal tree operations with updating the sum of priorities in all intermediate vertices.
(you need to be able to quickly search for a leaf by its id, for this you can make the tree sorted by id, or you can not bother and do it through a hash, then the order in the tree is any, which greatly simplifies everything - you don’t need to rebalance and you can insert it in place of any remote ).
the performance of all operations is O(logN), the performance of the samples is O(KlogN), where K is the sample size.
there is also a problem that one element can be selected several times, in order to overcome this, you can delete the selected elements immediately after the selection (well, or just reset the priority), and insert it back at the very end.
how to store all this?
well, if everything fits into memory, then it’s great, we hang a separate demon, which, at the start, builds a tree according to the original plate and let’s go.
if not, either in the database (but the efficiency of working with trees on relational databases is a big question), or in files, i.e., in fact, write your own database engine ...
but this is already an obvious overkill =) there are probably ready-made solutions that everyone is already doing it.

Roman Gogolev, 2011-08-29
@romka777

Offhand comes to mind:
You have 10 thousand records, you need to show, for example, randomly ten.
Restaurants with a long queue should be shown, for example, twice as often.
We make a selection - we randomly pull out 20 restaurants with a long queue and twice as many (40) without a queue.
We choose 10 of them randomly.

Sergey Abramov, 2011-08-29
@gnusy

each record we add a field with its weight (for example, the ves field). in the future, we select the records schematically as follows: where sum(ves) is the sum of the weights for the field in the table. a record with a weight of 1 is guaranteed to be in the sample 3 times less than a record with a weight of 3.
select top 10 * from [table] where rand()<ves/sum(ves)

shai_xylyd, 2011-08-30
@shai_xylyd

10,000 is very little, he can take everything and already count in his favorite language.
You can read the report in advance by krone and show only the result.

DanielWolf, 2011-08-29
@DanielWolf

You can try using sphinx and tweak the ranking system to suit your needs.
http://habrahabr.ru/blogs/sphinx/62287/

AntonioK, 2011-08-29
@AntonioK

For starters: the crutch of the species is now working

AntonioK, 2011-08-29
@AntonioK

For starters: a crutch of the form is now working But, firstly, it is wildly inelegant, and secondly, it does not allow you to calculate the probabilities for all records so that the total is 100%, and based on these probabilities to build forecasts.
SELECT *, (`flag` + RAND() * 0.33) as probability /* Powered by gypsy magic (bigger RAND() coefficient gives less effect) */ FROM `table` ORDER BY `probability` DESC

DmitryGushin, 2011-08-29
@DmitryGushin

Another "cattle-option". You do selection not in N records, and in N*coeff from the table. In your example, a sample of 2N. In the sample, get random entries with an equally likely hit for all restaurants. And then from this sample (knowing the value of the coefficient) you pull out the records you need in the right proportion ...

AlexeyK, 2011-08-29
@AlexeyK

it is better to do any such action not in the database, but by post-sorting mechanisms,
or, as mentioned above, include the notorious toilets in the rating, assigning them a certain “weight” so that the selection is as you need

burdakovd, 2011-08-29
@burdakovd

In the general case, unsolvable =)
Consider the degenerate case: there are only N records in the table. Then the probability for each record to be selected will be equal to 1.0, absolutely regardless of the given weights.

m08pvv, 2011-08-29
@m08pvv

The first thing that comes to mind with such a construction of the problem is to choose an appropriate distribution function and enter the numbering of restaurants so that those that need to be shown more often have a higher probability of occurrence.
Only now I don’t know how the self-written random will be screwed ...

shavluk, 2011-08-29
@shavluk

1. Determine the maximum range Rand, Let it be [0..M] (most likely M=1)
2. Number of entries N
2. For each entry, we introduce the concept of rating R
3. Calculate the sum of all ratings S=Sum®
4. Enter for each line has two fields, the beginning and the end of the range D1, D2
5. Starting from the first record and up to the last, we split
D1(1) = 0; D2(1) = M*(R(1) / S)
D1(2) = D2(1); D2(2) = D1(2) + M*(R(2) / S);
D1(i) = D2(i-1); D2(i) = D1(i) + M*(R(i) / S);
6. And then
select top 10 * from [table]
order by case when rand() between D1 and D2 then 0 else 1 end

mihaild, 2011-08-30
@mihaild

Another Indian solution.
Let the weights of records x_1, x_2, ..., x_n.
For each entry, add the fields "beginning" and "end". The i-th entry will start at x_1+...+x_{i-1} and end at x_1+...+{x_i}-1. Now we generate a random number R from 0 to x_1+...+x_n-1 and select a record that has a start <= R <= end. We choose as many times as necessary.
You will have to check for duplicates, but in the absence of strongly outweighing records, they will be quite rare.