Answer the question
In order to leave comments, you need to log in
Sort by similarity?
Hey!
There is a base, let's say a million people. Each can have up to 1000 in theory and up to 100 real parameters in 0/1 format. That is, either the user has something, or not. It is necessary to make a quick sort in the search results by similarity to the initiator of the search request (which also has these parameters). For example, the "finder" has a, c, d, f, k, m, r, s, y, z from the alphabet, you need to sort the results so that the first one is the one with the maximum number of similar letters.
It is clear that comparing the author with each of the million in the search results any database will be bent, so I'm looking for smarter ways to do this.
Engine on Rails, MySQL database, search apparently on Sphinx.
The lamer himself is in these matters, so if the question is completely stupid - just poke where to read, you will be rewarded :)
Answer the question
In order to leave comments, you need to log in
Suggestion 1 : Since the movie fields are bits, it makes sense to store them as numbers. The maximum integer in mysql is an 8 byte BIGINT. That is, if there are a thousand films in total, then one and a half to two dozen such numbers will be required in each entry. Let N be the number of such numbers -1, userF0, ..., userFN - these numbers in the record of the selected user. Then searching for 10 similar users in a table with fields (user_id, f0, ..., fN) will look like this:
SELECT user_id FROM таблица
ORDER BY (BIT_COUNT(f0 & userF0) + ... + BIT_COUNT(fN & userFN)) DESC LIMIT 10;
Disadvantages of the approach: when querying, all records will be run, when adding new films, you need to call ALTER TABLE. I can't vouch for speed either. SELECT tmp.user_id FROM (SELECT user_id FROM таблица1
UNION ALL
SELECT user_id FROM таблица2
UNION ALL
...
UNION ALL
SELECT user_id FROM таблицаN) AS tmp
GROUP BY tmp.user_id ORDER BY COUNT(tmp.*) DESC LIMIT 10;
Cons of the approach: a large number of subqueries, grouping. SELECT user_id FROM user_films
WHERE film_id IN (film_id0, ..., film_idN)
GROUP BY user_id ORDER BY COUNT(*) DESC LIMIT 10;
Cons of the approach: grouping. Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question