T
T
TitanFighter2017-10-04 23:23:17
Database
TitanFighter, 2017-10-04 23:23:17

Who can tell you about theoretical questions on the architecture of such sites as social networks, dating sites (interested in back-end: database, image storage)?

Good day.
When at work they give boring projects that I decided once or twice, in which there is not even a need to use some kind of technology that I have not used before, then you turn sour. Therefore, for self-development, sometimes I write something for myself, which helps to develop (thanks to this, you can find either a new job or an increase in salary - checked :)). I would like to try technologies that use highly loaded sites.
I have been interested in this topic for a long time, I have already read a bunch of information from different sources (Habra, StackOverFlow, Medium and much more), and everything is clear individually, but I can’t put everything together into a coherent picture.
I seek advice, preferably from your practice, and, first of all, I am interested in the development approach, when technologies and structure are laid at the design stage, which can then be scaled up (it is better to ask those who have already gone through this in advance than to redo / rewrite something later ). At a minimum, this item includes: replication and sharding, since I read that some projects "growing up" were very painful.
My biggest problem at the moment is which DBs to use and in what cases? I have also read and googled a lot on this issue. All answer "depends on requirements". That's why I created this topic, which sets out these requirements :)
For example, here on this sitethe user cases of a bunch of databases are painted, and as soon as I opened it, my eyes ran wide. Take it, use it all.
It's time to get down to specifics. We take any social network (VK, Facebook, etc.) + some dating sites (Mamba, Badu).
Questions:
1. In which database is it preferable to store user profiles? SQL\NoSQL? Postgresql, Mongo, options? Maybe even ElasticSearch?
2. In which database to store frequently changing information, such as:
a) likes
b) a feed of top users (usually at the top of the site)
c) a constantly changing sorting of users who pay money to be the very first.
Redis? I know that before restarting the server, you can save everything to disk, but what should I do when the server freezes? The user paid the money, the server hung - the user is not very happy :) Then maybe Cassandra? If, nevertheless, the Radish, then how to determine how often (maybe even, by what methods) you need to "back up" information, saving it in the main database?
d) in which database to collect data about the user, for example, to ensure security (from which ip comes in, how often does he navigate through different pages - what if he opens 100 pages at a time?), data for analytics?
3. Chats. I confess, I haven't googled yet. Where to store chats?
4. Geo data, or rather search for users by coordinates in a certain area, how far are they from each other, etc. PostGIS? Maybe elasticsearch?
5. Search for users by age, gender, zodiac sign, school, city, again, by geo radius and a bunch of other characteristics. elasticsearch? Worked with Solr, but didn't like something about it. It doesn’t pull me to him, there is no such “oh, this is a cool product” inside ... Maybe he didn’t work much.
Regarding p4 and p5, or rather ElasticSearch ... is it needed, or can it rest on the main base? If needed, where exactly? Here the question arises: If, for example, the main database is used for profiles, then does it make sense to upload these profiles to ElasticSearch, thereby getting another profile storage system (duplication), or is it worth pouring profiles directly into Elastic (i.e. use it as the main database)? If you store profiles both in the main database and in ElasticSearch, then by what method should you synchronize (how often?)? For example, I worked in one project in the Python \ Django \ Haystack \ Solr bundle, so the Haystack documentation says that it’s better not to index Solr in real time, since real time seriously loads the system and the best solution is to do periodic indexing .
6. What would you use for image storage? Facebook Haystack Image Data Store?
I'm sorry that I wrote a lot, but "what question you ask, you will get an answer." I wanted to show the complex and depth of questions in order to get a more "deep" answer or something.
In short:
1. What database would you use for "main"\"static" data?
2. What DB would you use for dynamic data? If in-memory - how would they "back up" in case of a freeze? Replica? Is it worth it to periodically merge data into some kind of database on disk? If "yes", how to determine this time?
3. Where to save chats?
4. What system should be used for storing geo data and for searching by user requests?
5. Do you need a system like ElasticSearch at all? If so, where does the line occur, "what to take from the database, what to take from the Elastic"?
6. Where\how to store photos?
And the common thing for all points: I want to work with systems that do not cause a big headache when the project "growing up" (replication, sharding).
Thank you!

Answer the question

In order to leave comments, you need to log in

4 answer(s)
A
Andrey Shatokhin, 2017-10-05
@TitanFighter

1. Any relational database. MySQL, PostgreSQL or whatever you know how
2. Redis Cluster, not as a primary trusted store. just like cache.
3. like everyone else - in a relational database
4. All searches in ElasticSearch. All filtering with geo - also there. And don't forget about lookup filters. (this is a type of join's)
5. ElasticSearch is also not a trusted repository. Refresh the mapping a couple of times and you'll see why.
6. Glusterfs/Ceph - if you want to store it yourself. amazon s3 - if you have enough money for it.

S
Stalker_RED, 2017-10-05
@Stalker_RED

You wrote a huge wall of text, but did not even try to read how it was done by the very giants that you list? These are not secret techniques, almost everyone has dev blogs, many give presentations at conferences, a lot of lectures on highload.
And here is the "deep" answer: if you think that there is a magic universal recipe like "all chats are stored in %databasename%", and "all photos are stored in %storagename%", then you are mistaken. We will have to take into account the specifics of the project, compare different approaches on real data, and so on.

V
Vyacheslav Uspensky, 2017-10-05
@Kwisatz

For half of the optimization questions, google YAGNI.
For database, take PostgreSQL or Oracle. Both have the tools to solve all your questions. Naturally base costs a lot of money.
Don't look at NoSQL at all. The only use case for NoSQL is in fact tables (like the user's feed) when the data itself is more important than the relationships between different entities.

A
Anton Tikhomirov, 2017-10-12
@Acuna

I'll tell you about storing pictures. In VK, everything is really simple here: there are just a bunch of their own servers in order not to clog the hoster's channels with traffic, the database stores direct links to these pictures, tied to the id of each post. The beauty of this method is that when moving to new servers, you can create the same folder structure on them and no links need to be changed. Mere mortals, of course, do not use VPN or even dedicated servers to store anything in principle - traffic will clog all hoster channels already at 1000 hosts a day, and the place is unimaginably expensive. Storage systems such as Amazon S3 or Google Cloud Storage are used - the amount of occupied space and the amount of downloaded outgoing traffic are paid (viewing photos by users is, in fact, downloading them). You save 10 times. Seriously.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question