Answer the question
In order to leave comments, you need to log in
Which DB for messenger to choose?
Hello!
I decide which database to use for the messenger. I understand that my messenger will be used by one and a half people, but I write it for practice, for studying technologies unknown to me and in order to put it in my portfolio, so I pretend that I will have a million simultaneous users. I try to design the side server so that it scales easily by simply adding new servers.
In general, the server side code is almost all ready. Used Dart (weird choice I know), WebRTC, Redis, RabbitMQ and Postgres. Group chats work, video calls work, everything scales ok, except for the database. As far as I know, Postgres does not scale to multiple nodes without dancing with a tambourine. I am not a database engineer, I don’t know much about this topic and I don’t think I can figure out how to properly scale Postgres. Too stupid :) I used PG just because I don't know much about it, and to be honest, I thought it scaled out of the box, so I didn't even google anything.
Now I'm looking for other options. At first drew attention to Cassandra. Easy to scale and used in many similar projects. You can use the group id as a partition key, then all messages from the group will be stored on one node. But there are no joins in Cassandra. When a user comes online, you need to pull unread messages from many different groups that he is a member of, which means that since there are no joins, you will have to make a bunch of different requests, which, of course, is not an idea.
Then Google led me to CockroachDB. Scales easily, Postgres syntax, everything seems to be perfect. Not the fastest database, but I don't need super fast inserts, because if the user is online, then the message is sent to him before it is written to the database.
What can you advise? What to choose? Maybe some other option? There are so many different databases that my head is spinning.
And, please, I don’t need advice like that I don’t need any of this, I can easily get by with one Postgres instance, premature optimization is stupid, first launch it in production, and only then, if Postgres suddenly stops coping, you will think about scaling, etc., etc. . I know that I don't need any of this. This, one might say, is just practice, training in the implementation of a more or less complex architecture.
Answer the question
In order to leave comments, you need to log in
Popular NoSQL solutions do not use joins as this leads to network trips to various shards anyway. Even if it is hidden from the user. Accordingly, in a distributed system, you will not get joins as in Postgre. Moreover, if you try to shard Postgre, you will have a problem of joins between shards there, and such joins will have to be abandoned.
The problem is that you approach data storage in NoSQL in the same way as in RDBS. This is not true, and it is quite acceptable for distributed systems to store redundant data. For example, you can write a new message to many shards, to the shard where group messages are stored and to shards with users. This can be done by an event that is generated when a message is created, then it goes to rabbitmq, and from there to subscribers who write the message to the necessary shards.
Thus, you can always read with the user his new messages from one shard. They do it in different ways, the main idea is to simplify the collection of data as much as possible for one or another screen of the application.
For example, in social networks, it is advised to collect a news feed. The system accepts a post from a user, and then the service spreads this post in the background for all users (shards) that can see this post. Accordingly, displaying the news feed becomes a trivial task. You also need to be prepared for the fact that distributed systems adopt event-based consistency instead of transaction-based consistency. Simply put, not all users will see a new post in their feed instantly, but after a while and for large projects like facebook or amazon, this is ok. Because of this, sometimes on facebook you can update the feed with a frequency of a second and at some point get a new post whose date of addition was 1 minute ago.
Databases can be selected any popularwith out-of-the-box sharding support that you like or are more familiar with. Familiar with cassandra, use it well, know mongo, take it. If you don't know anything, read the pros and cons of both systems and decide for yourself what suits you best.
Nobody said anything about CockroachDB. After all, it meets the requirements of the author of the question. And it really works with joins and transactions
Master-slave is useful in case of failure of one server, but it has its limits in terms of load; and then what will you do? A more or less complex architecture involves embedding the database server into the application architecture so that it scales horizontally - for example, using sharding. Therefore, I recommend choosing a system with which you know how to work and which can replicate. But implement it in such a way that you can scale by adding servers to your service, and not to the database server.
Absolute nonsense.
Those. yes, for chats i would probably use something else like cassandra or even dynamodb.
But your thoughts have no basis. Postgresql is the most versatile choice, it's great for sharding/scaling/replicating and other scary words.
The most important thing to do is to stop saying SCALING. You have used it more times than all the inhabitants of our country in the history. It was the first.
Second - about scaling, read this answer of mine and this discussion
Hello.
This is an architecture level question, not a programmer or developer level question.
And frankly, I do not quite understand what exactly you want?
Learn DevOps practices? This is one approach. And a deep study of scaling a certain huge base.
If you show how cool you are as a developer, then this is not what is required of a developer.
To solve a specific problem - to withstand 1m user online, even in chats, at least in something else, then it's quite ordinary here.
Everyone comes to highload and devconf (an excellent master class from Borodin there) and talks about a single approach: this is a spot architecture. Badoo example. They have mysql and 450M users.
In manychat, these are called galaxies. These have close to 200M chat users daily. And they have postgresql
PG has two cool open source forks that support horizontal scaling to hundreds of nodes and petabytes of data: Citus DB for OLTP workloads (QPS~=TPS seems to be your case) and Greenplum for OLAP (few QPS, many TPS, analytics, joins, windowed functions, etc.). Both are fully PG compatible in syntax.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question