Similarity search in relational and NoSQL databases?

P

pancakes2012-06-04 10:30:02

NoSQL

pancakes, 2012-06-04 10:30:02

Even while studying at the magistracy, I came across a whole class of tasks, which boils down to finding similar objects. This is also called similarity search or nearest neighbor problem. The number of applications in which this method can be applied is amazing: spell check, document comparison, pattern recognition and search for similar images , recommender systems and cluster data analysis ... I found a number of libraries and descriptions of algorithms for solving the similarity search problem, such as Kd -trees, perceptual hashes , etc.

Me interests, whether it is possible to solve similar problems by means of relational databases. So far I've only been able to find an add-on for PostgreSQL, which allows efficient search by similarity. What other commercial or non-commercial products exist that offer universal solutions for solving the similarity search problem, and is it possible to find somewhere a comparative analysis of their advantages and disadvantages?

And finally, I will give an example of a specific problem that I would like to solve using similarity search methods:

Let's say there is a web service, like a job exchange, that allows people to find the right job for them, and allows employers to quickly find the most suitable employees for the available vacancies. Job seekers post their resumes, where they indicate data about themselves in a structured form (fill out forms), indicate their skills, etc. Employers, on the other hand, post information about open vacancies. The service, upon request, can find for the selected vacancy

Given:

An array of data about potential employees (let's say about IT specialists) - their resumes containing the name, age and set of skills, indicating the degree of mastery of each skill (for simplicity, a number from zero to one hundred units of "experience").
An array of data about open vacancies. For each vacancy, the desired characteristics of the employee are indicated, for simplicity: a set of required skills.

To find:

Find the k most suitable resumes for the selected vacancy.
Find the k most suitable vacancies based on the selected resume.

Thoughts on the solution:
Since the problem of similarity search is being solved, it will be necessary to calculate a measure of the distance between data elements. In our case, there are two classes of data elements: resume and vacancy. For simplicity, both of these classes of objects can be represented as skill vectors (comparison by such parameters as age, gender, etc. will be omitted for simplicity and in order not to be accused of discrimination).

Next, you need to determine the full list of available skills list, since the number of available skills will determine the dimension of the vector representing the resume / vacancy. Skills that are missing in the resume / vacancy, in the vector will correspond to elements equal to zero.

Further, one can compare resumes and vacancies by comparing the received skill vectors corresponding to them, similar to how documents are compared in the Vector Model .

The distance between two elements can be defined as the angle between two corresponding vectors:

where p _i and p _j are the vectors representing the resume / vacancy,
w _k,i and w _{k, j} is the value of the k-th skill of the compared vectors,
n is the number of all possible skills .

The point, of course, is not in creating such a web service, it just seems to me that this task is a good example on which you can check how well one or another DBMS copes with similarity search. If you know of existing products (relational or NoSQL databases) that can effectively solve this problem, let me know.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

Z

ztxn, 2012-06-04
@ztxn

No, not that, then full compliance with the requirements will be equal in rank to complete non-compliance. Still need to think

Z

ztxn, 2012-06-04
@ztxn

I'm not sure that I correctly understood the task, the terms of the subject area little known to me are used in the formulation. But the most suitable resumes for the vacancy in Rsubd are selected somehow like this

select навыки_вакансии.вакансия
       ,требования_резюме.резюме
       ,count(*) as ранг
from навыки_вакансии,требования_резюме
where  навыки_вакансии.показатель = требования_резюме.показатель
group by навыки_вакансии.вакансия, требования_резюме.резюме
order by ранг desc