Designing "Event Feed" for a social project?

Dunadan2012-03-23 16:59:42

Social media

Dunadan, 2012-03-23 16:59:42

The problem statement is as follows:
1) Each user has friends (“the maximum” option is everyone is friends with everyone)
2) Each user can enter an unlimited number of groups
The user can generate content independently (similar to a personal “Wall”, like on VKontakte and FB), so content can be generated by components within the project.
I would like to make a summary feed of updates as statuses / thoughts of the user, his friends, internal events and group events and display them in a single list with the ability to scroll through.
The MySQL DBMS is running in the backend of the system, but for me personally, it is quite obvious that the “usual” relational approach to solving the problem is generally not applicable due to the restrictions imposed - if there are, for example, 2500+ friends, then one WHERE grows to monstrous proportions.
Therefore, I look in the direction of noSQL solutions, but initially for myself I just want to understand the very algorithm of the operation of such a tape, and then select the tools.
Should I apply denormalization - play around with maintaining a separate feed for each user (but if you leave groups or remove a person from friends, updating this list can take a huge time)?
Maybe there are some articles about or recognized implementation options?

Answer the question

In order to leave comments, you need to log in

7 answer(s)

egorinsk, 2012-03-23
@egorinsk

On all Facebooks and VKontakte, feed data is generated by a daemon in a compiled language, which also caches a bunch of information in memory. Master? Well, you don’t know C, you can try to write in Node or Java, well, or in PHP as a last resort, if you enable scaling, it will also be OK for the time being.
The algorithm of work is approximately the following: 1) select the id of the user's friends 2) select the latest events for each of them 3) weed out those protected by privacy 4) glue the lists, sort them by date and send them to the browser.
And on MySQL, with an increase in load, the number of friends, the server will quickly put all this. Let's make the simplest calculation: the average user makes 10 events per day, if there are at least a few thousand users, the event table will quickly jump up to millions of records. And all sorts of constructions like JOIN, firstly, load the server, and secondly, they do not shard.

Anatoly, 2012-03-23
@taliban

There is such a thing as a cache. The cache can be of different types, it is not only a selection stored in a file at hand. A banal example of karma on Habré. I'm sure that they store all the data in their database, who poked a plus sign or a minus sign to whom, but we see only 3 (+4- 1). A kind of cache of the necessary data in a separate place. Make connections separately specifically for this tape, duplicate data, do not be afraid, duplicates sometimes help in performance. Let's start by making an activity table:
id, user_id, event_desc, event_date
And enter data here for each activity, this impossibly simple table can display all user activity in a primitive form with one request. Further develop the idea as convenient =)

Panacea, 2012-03-23
@Panacea

I did not understand, to be honest, the task. It seems to be nothing complicated.
In pseudocode it looks like
select news.*
from news, users_users as friends
where friends.left_user = :currentUser
and friends.right_user = news.user_id
union
select news.*
from news, groups
where groups_users.user_id = :currentUser
and groups_users .group_id = news.group_id
Or did I miss something in the conditions?

Zelgadis, 2012-03-24
@Zelgadis

Here is another example where the RDBMS has not rested and you need to use something like MongoDB / CouchDB or Neo4j (if the data is strongly related).
First, make an array of all ObjectiveIDs with which the user is associated (one request), and then take all the events where the event author is located in this array once. Or through DBRef we find that the event has a connection with the current user. The first option will be faster but less code,

denver, 2012-03-25
@denver

Facebook and Twitter use denormalization, and they don’t “clean up” the feed after leaving the group / friends. Trite - what is received in the feed is received. In addition, the FB feed seems to be limited in time, which is reasonable. Twitter also had a problem with a huge feed of users who had 20k+ subscriptions, such people were simply banned at one time. In general, we put it in as it is more convenient to get it, there should be few options to get it. Normalization is not for large volumes. Yes, and denormalized mysql is not for large, unless it is broken into instances (say, a maximum of 100 users with all their data / tapes in one database).

mrspd, 2012-03-25
@mrspd

Recently I was interested myself, ran into this article: myforce.ru/tyeoriya-i-praktika/delaem-lentu-obnovlenij-na-mongodb-php/

vitalybaev, 2012-03-26
@vitalybaev

A look towards Redis and from the Sorted set. Each set is a tape of some person, as a score - ID of events (time is possible), radish can handle about 200,000 requests per second, and sorts on the fly, can make requests for normal limits / offsets or requests of the form ID < x
The object itself is a serialized array (well, or whatever is more convenient for you), and the number object itself is already taken from the memcache, for example. Actually, as comrade denver wrote, use denormalization.
The described method successfully works on a fairly visited resource, and easily copes with the case when a person with 30,000 followers creates a post.