What data storage/processing technology should be used to implement the object similarity algorithm (recommendations)?

K

kunya2014-04-22 16:39:07

MongoDB

kunya, 2014-04-22 16:39:07

Hey!
Faced a performance problem when implementing the recommendation algorithm.
Input data:
There are about 60,000 records with user ratings, of the form {user_name, item, score} as a CSV file on disk, as well as in MongoDB.
Task:
For each pair of items, calculate their similarity based on user ratings.
Problem: Performance. I tried to work with the database and pull out the scores for each pair, but because all possible pairs n (n-1) / 2, it turns out too long; Lots of database access. Then I thought about loading everything into memory and calculating there, but it’s not clear how to scale with an increase in the number of estimates (what if all the data does not fit in memory?). I write in Node.js, if that matters.
Maybe there are special technologies for such a task? Prompt, in what direction to look:
- more effective implementations of the algorithm, with a small number of calls to the database?
- write some tricky MapReduce / Aggregate?
- look at databases based on graphs?
- something other?
Thank you!

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Andrew, 2014-04-23
@kunya

In terms of theory, I would suggest a "k-Nearest Neighbors" method between buyers. At the same time, keep the value of k somewhere in the range of 20-50, and offer weighted proportionally to the degree of proximity those products that were also positively evaluated by k neighbors.
In terms of practice, I would create an auxiliary table where for each buyer his k neighbors are stored (which is why I propose to limit k to about 30) and the normalized distance to them. This table will be enough to offer recommendations with high speed.
And the table itself should be recalculated in a separate stream at lightly loaded times once a week or once a month (depending on your turnover). Let it take an hour or two. Upon readiness - pour into the battle table of neighbors.
If you are interested in my opinion about the distance formula between buyers or the weighted average offer algorithm - ask ...

_

_ _, 2014-04-22
@AMar4enko

What is similarity? Linear relationship of one score to another?
You can do a MapReduce to do the initial calculation and update the ratings, push the rating into an item and search for a range.
The main thing is not to do a complete recalculation each time.

M

mir0shnik, 2014-05-22
@mir0shnik

PredictionIO