O
O
overbeat2011-03-23 17:19:06
MySQL
overbeat, 2011-03-23 17:19:06

Sort by similarity?

Hey!
There is a base, let's say a million people. Each can have up to 1000 in theory and up to 100 real parameters in 0/1 format. That is, either the user has something, or not. It is necessary to make a quick sort in the search results by similarity to the initiator of the search request (which also has these parameters). For example, the "finder" has a, c, d, f, k, m, r, s, y, z from the alphabet, you need to sort the results so that the first one is the one with the maximum number of similar letters.
It is clear that comparing the author with each of the million in the search results any database will be bent, so I'm looking for smarter ways to do this.
Engine on Rails, MySQL database, search apparently on Sphinx.
The lamer himself is in these matters, so if the question is completely stupid - just poke where to read, you will be rewarded :)

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
Dzuba, 2011-03-23
@overbeat

Suggestion 1 : Since the movie fields are bits, it makes sense to store them as numbers. The maximum integer in mysql is an 8 byte BIGINT. That is, if there are a thousand films in total, then one and a half to two dozen such numbers will be required in each entry. Let N be the number of such numbers -1, userF0, ..., userFN - these numbers in the record of the selected user. Then searching for 10 similar users in a table with fields (user_id, f0, ..., fN) will look like this:

SELECT user_id FROM таблица
ORDER BY (BIT_COUNT(f0 & userF0) + ... + BIT_COUNT(fN & userFN)) DESC LIMIT 10;
Disadvantages of the approach: when querying, all records will be run, when adding new films, you need to call ALTER TABLE. I can't vouch for speed either.
Suggestion 2 : create 1 table with users and as many tables as there are movies, in each of which to store a list of id users who have selected a movie. Then the search for similar users will be reduced to:
SELECT tmp.user_id FROM (SELECT user_id FROM таблица1
    UNION ALL
    SELECT user_id FROM таблица2
    UNION ALL
    ...
    UNION ALL
    SELECT user_id FROM таблицаN) AS tmp
GROUP BY tmp.user_id ORDER BY COUNT(tmp.*) DESC LIMIT 10;
Cons of the approach: a large number of subqueries, grouping.
Suggestion 3 : create 1 table with users (users) and 1 table with user-films (user_films), i.e. with records about user preferences of the following type (user_id, film_id). Then for the list of films of the selected user (film_id0, ..., film_idN), the search for similar users will be reduced to:
SELECT user_id FROM user_films
WHERE film_id IN (film_id0, ..., film_idN)
GROUP BY user_id ORDER BY COUNT(*) DESC LIMIT 10;
Cons of the approach: grouping.
Although with an indexed film_id field, it may not be very slow.

A
AlexeyK, 2011-03-23
@AlexeyK

The base won't fold because it will look for Sphinx, according to what you say.
I can only suggest writing 0/1 values ​​as letters in a separate field, something like a hash "acdghjxyz" for each user, and then search using a simple string similarity algorithm.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question