Social network wall: data sampling

excalibur2013-03-30 10:54:15

PHP

excalibur, 2013-03-30 10:54:15

We will talk about the social network wall, for example, as on VKontakte or Facebook, i.e. a wall post can contain different types of content: news, blog post, multiple images, poll, etc.

I've settled on the following database structure for now:

Wall table:

id | description
1 | "This is the description of the first entry on the wall"
2 | "And this is the second post on the wall"

Table "wall_element":

id | wall_id | components | element | element_id
1 | 1 | image | item | 1
2 | 1 | image | item | 2
3 | 1 | image | item | 3
4 | 2 | catalog | item | 1
5 | 2 | catalog | item | 2
6 | 2 | catalog | shop | 1
7 | 2 | catalog | shop | 2

Examples of records, like examples of table structures, are abstract. I think everything is clear with the “wall” table. The "wall_element" table contains a list of attached different content types that are added to the wall post. The fields in the table mean the following:

wall_id is the id of the post on the wall,
component is the name of the component. My components are, for example, a catalog, image, news, blog ...
element is a structural unit of the component, for example, in the “catalog” component there can be several such structural units, for example, “item” or “product” is the product itself , "shop" is a store, "catagory" is a product category, etc. Also with the "image" component, which is responsible for loading and displaying photos.
element_id is the id of the structural unit of the component, i.e. id of the product, store, photo, blog post, news, etc. Structural units of a component also have their own tables, such as catalog_product, catalog_shop, image_item, blog_item, news_item, etc. which contain information about them.

Question: How to pull information from the database correctly? One request is not enough here, and if it is possible somehow, then you don’t want to, since the request will be heavy. From the "wall_element" table, I will get the id of, for example, all the photos, or all the videos that are attached to the wall entry. What kind of requests are best to pull out these id? And what queries to pull data from the image_item table, i.e. How can I retrieve the photo data?

PS Of course, I know how to get everything out, but other points of view and preferably more experienced developers are needed.

UPD: All the same, I’ll write as I see it:

1) I’ll make the first query, with which I pull out 10-20 records from the “wall” table - the query is very easy, even with several million records there should be no problems.

2) In a cycle for these 10-20 records, I pull out data from the “wall_element” table. In each iteration, I will receive a list of photos, blog posts, news, etc. Now we need to process this data, i.e. give them to each component they belong to and get html back. But you need to somehow group those same 20 photos - and get information from the “image_item” table or 20 requests (very easy, since the request will go by id) or somehow more elegantly with one request ... This is how I collect in one iteration html of different structural units of the component…

I.e. there are pluses here that the requests are easy, and the minuses are that there will be a lot of requests for rendering the page ... Caching, of course, has not been canceled, but first you need to do everything as correctly as possible ...

Answer the question

In order to leave comments, you need to log in

8 answer(s)

Dmitry Guketlev, 2013-03-30
@Yavanosta

UNION is not so bad. Especially UNION ALL where it is not required to select unique records. At the very least, it will be no slower than executing each request separately, due to the fact that it can theoretically be executed in parallel on several cores.
But if it doesn't fit, let's get away from the stove. You postulate the following "we can't select everything in one query so as not to interfere with the model".
This means that the minimum number of requests is equal to the number_of_data_types. To do this, you need to execute separately the queries that I have in UNION, but then manually sort this mess by dates in the code and group by wall_id. This is a bad way.
I offer this option:

SELECT
  w.id,
  w.description,
  COUNT(i.id) i_cnt,
  COUNT(bp.id) bp_cnt
FROM wall w

INNER JOIN wall_element we_i
        ON we_i.wall_id = w.id
INNER JOIN image i
        ON i.id = we_i.element_id

INNER JOIN wall_element we_bp
        ON we_bp.wall_id = w.id
INNER JOIN blog_post bp
        ON bp.id = we_bp.element_id

ORDER BY w.timestamp

GROUP BY w.id, w.description

We get a list of posts, and the number of content items associated with each post. Next, we share the responsibility in the code as follows:
1. The main code executes this request, and runs through the results. Looks at what is in a particular entry. For example, it sees that the picture and the product are from the store
2. The main code calls the appropriate renderers, passing them the wall_id record identifier.
3. The renderer itself (with its own request) already gets the data that it needs from where it wants to, into the model that it likes. This completely decouples renderers from each other and from the calling code.
In total, we have requests: number_of_posts_on_the_wall * average_number_of_content_units + 1. I think that on average, 1 post on the wall will have a link to 1 unit of content (for example, five photos that the image renderer can pull out with one request).
We have, on average, the number of_records_on_the_wall + 1 by request. It is quite acceptable.

jarvis, 2013-03-30
@jarvis

I would do this: (read to the end)
1) create a wall table with the fields id, author_user_id, author_user_name, owner_user_id, wall_content, where
-id is the identifier of the entry in the table
- author_user_id is the identifier of the user who created the post
-author_user_name is his name, enter here same, in order not to make an extra request to the user tables
-owner_user_id - the identifier of the owner of the wall
-wall_record_content - the content of the record. is a json with the fields title, description, images, videos, polls and whatever you have there. example

content: {
title: SuperProPost,
Mysuper_Puper_Images: [{
Photo_link1, Photo_link1

}],
Mysuper_Puper_Video_links[{
videolink1, videolink2....
}]
и т.д,
}

The structure will actually be a bit more complicated, but I think you can handle it.
And then, in the script itself, process this json and form a template. When writing and changing the script, generate this json and write it to the database.
As a result, everything will be received in one request, the load on the database will be minimal. No jones.

Select  * from wall_1234(лучше использовать шардинг чтобы не было по 20 млн записей в одной таблице) where owner_id=Owner_id_from request Limit 0, 20 (к примеру)

Allocate a separate wall_groupId table for every 50 users. Thus, all the data for the wall will be received, attention, in ONE query from the table for 100-300 thousand records, for which the database will thank us
. I hope the idea is clear, correct it to your needs.

jarvis, 2013-03-31
@jarvis

From the tables you don't have to pull for the whole wall, it's too hard.
About "caching"? What's the problem with caching the response to my request? The same memcashed will perfectly cache the results of queries

Select  * from wall_1234 where owner_id=Owner_id_from_request Limit 0, 20

Okay, let me explain the mechanism in more detail. The “wall_record_content” field is the whole point. It is extremely necessary for us and it is it that allows us to display data very easily. But for this we will have to pay in that it will be a little more difficult to enter them. But this is not scary, in the social network they read much more often than they write.
Begin.
The user creates a post on a friend's wall.
Preparing a request to the server. First ~~, boil the water,~~ include the user's personal data as the author (name and id in the application) and the friend's id for the id_owner_user field in the request. Great, we are great, but that's not all.
The user then enters a description. Wow. We also insert this data into the description parameter of our request. Move on.
The user attaches photos and videos. In order to be able to view the photo, we need the address of a small picture (hereinafter I will call it a “preview”) and the address of the page where we will view the photo in full size. Let's say we have several ways to attach a photo: drag-and-drop if the photo has already been added to the server, or immediately from the local computer (or phone)
If the photo was previously uploaded to the server (that is, this data is already recorded in the database), then simply using ajax request we get this data from the database. (It may not be necessary if we already have the necessary data for the photo on the page). The received data is inserted into the Post_images array parameter.
If not, then FIRST WE UPLOAD THE PHOTO TO THE SERVER using ajax request (it will be easier with it) and enter the photo data into the images table. When it's done, we return the data with the ajax request. We include them in the Post_images parameter. Not bad.
Now we have a parameter that contains data about the photos that are attached to the post.
Same with video and audio. Super!
Now we have a fully formed Post request with parameters author_user_id, author_user_name, owner_id, post_description, Post_images, Post_videos. We send to the server. And don't forget about CSRF protection. We love our users and care about them.
On the server, we receive the request data, process it and write it to the database as I wrote in the first post.
An example of json-a that will get into the database.

content: {
description: Срочно оцени эти супер-крутые фотки! ,
Mysuper_Puper_Images: [{
Photo_link1{
preview:"images.site.com/dsfgdfgsdg.jpg"
link:"www.site.com/viewphoto/345345345345"
},
Photo_link2{
preview:"images.site.com/dsfgdfgsdg.jpg"
link:"www.site.com/viewphoto/345345345345"
}

}],
Mysuper_Puper_Video_links[{
videolink1{}, videolink2{}
}]
и т.д,
}

That's all, now we have a user's post in the database.
Friend Pavel Nedurov goes online. First of all, he goes to his wall to see what pretty girls have written to him during the day. When loading his wall, we form a request

Select  * from wall_1234 where owner_id=Owner_Павел НедуровID Limit 0, 20

Our script receives from the database (or memcashed) the result of this request. We process this data and display it in our template, not forgetting to escape everything to prevent xss attacks. We love users!
As a result, the page will display:
1) the text of the entry
2) small pictures, by clicking on the preview, we go to the page where you can view the large version.
3) small pictures with a frame from the video, by clicking we go to the page where you can view the video (hmm, somewhere I already heard this)
Pavel liked the post of his friend! But here's the bad luck, pretty girls didn't write anything today. Don't worry bro, your time will come. And this concludes our program. In the next transfer times we will make a news feed. Stay in touch.
As a result, thanks to my method, it is very easy and fast to display records, because we get all the data in one request. And easy to cache. And indexes can be used too. Pictures are stored in the images table, videos are stored in videos. Everything can be easily viewed. But with the implementation of the input of the post, you will have to sweat a little, but the result is worth it.
PS Don't take anything personally, it's just my style.

Nikita Gusakov, 2013-03-30
@hell0w0rd

Use JOINs or what do you want to hear?

Dmitry Guketlev, 2013-03-30
@Yavanosta

How can you pull in "those 20 photos that you added to the wall" if you do not have a user ID anywhere.
If just all the photos on the wall, then like this:

SELECT
  w.id,
  w.description,
  we.*
FROM
  wall w
INNER JOIN wall_element we
  ON w.id = we.wall_id

Is this what you wanted?

Nikita Gusakov, 2013-03-30
@hell0w0rd

What you write in the question, and if you do so, is Hindu code. You don't have to solve SQL problems with PHP or another language. Especially requests to a DB in a cycle/recursion, still somehow.
You should have both with classes - each class is responsible for its own - and in databases, each table is responsible for its own. In the picture table you store everything you need according to the pictures, in the video table, you understand. But in the table of posts there should be a connection with all the components of the post, as well as a connection with the author.
Martin Gruber - Understanding SQL - they say that the translation is not the best. However, I really liked the book, and I recommend it to you.

Dmitry Guketlev, 2013-03-30
@Yavanosta

By the way, sorry for the offtopic, but I just looked at your site briefly. First looked at this:
excalibur.com.ua/blog/php-c3/%D0%B3%D0%B5%D0%BD%D0%B5%D1%80%D0%B0%D1%82%D0%BE %D1%80-%D0%BF%D0%B0%D1%80%D0%BE%D0%BB%D0%B5%D0%B9-%D0%BD%D0%B0-php-i22.html
You are trying determine the one that works faster, and both INSIDE the loop do count($arr) and strlen($chars). Didn't you think that if you take out the calculation of the length of the array and the length of the string in a loop, then your function will speed up by about $length times? And after that you reproach someone for lack of experience?

Stepan, 2013-03-30
@L3n1n

If I understood correctly.
With one request through joins you pull out all the data. A field which, for example, description= NULL indicates that this is not a record, but a record element (photo, video ...). In the loop, when you process a request, you immediately see how to render it.
That is, the whole idea is that you reduce the result of the request to one format, which you already direct with the code where you need it.