Finding similar data in Mysql?

A

Aizharik2015-12-25 16:08:48

MySQL

Aizharik, 2015-12-25 16:08:48

Good evening!
I'm trying to find a more suitable way to search for similar profiles.
Each user has his own profile.
There are only 4 test fields in the table -
this is the full name and a short about yourself,
and the rest of the int fields from 0 to 3, to 2 and either 0 or 1,
enum('N','Y')
and profile rating (not displayed and calculated inside the site) from -1000 to 1000,
that is:
Table:

name | last name | about | location | col1 | col2 | col3 | col4 etc.
-------------------------------------------------- ----------------------
Vasya | pupkin | text | loc_id | 0 | 1 | 3 | 0 etc.

I can also pull out a certain number of profiles that the user likes.
The first thoughts were to take the most popular profile by location and make a selection based on it. Then it seemed to me that this was not a very good idea, I began to study Sphinx and realized that I needed to practice with it for at least a few days in order to use it normally, and Mysql should be enough for the beginning.
Now, when I wrote the question, I got the idea that it is possible in Mysql to average the search results by excluding less similar ones from the result and leaving similar ones (to each other)? Or maybe I'm not digging there?
Help advice?
And is it worth it to turn out from Sphinx?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

O

Ololesha Ololoev, 2015-01-01
@alexeygrigorev

This problem is called "k-nearest neighbors" and there are many ways to solve it.
In general, they all boil down to this:

First you need to define a distance function for the elements from the table (i.e. users). This can be either Euclidean distance or any other (for example, cosine or jaccard for words from the description)
Then for each of the users we find the N closest ones based on this function
Comparing all users to everyone is expensive, so indexes are often used to speed up this process. At the database level, this can be R-tree or quad-tree, and at the application level, Locality Sensitive Hashing.