What database to choose for bigdata?

A

alexdora2022-01-16 13:25:57

PostgreSQL

alexdora, 2022-01-16 13:25:57

Good afternoon everyone

Faced a dilemma during development. There is incoming data of 100-150k rps (Read as over 100,000 individual inserts per second). Now it all works like this:
Everything is divided into more than 800 tables in the database (the table itself is a kind of pointer to the data pool, like an index), indexes are used inside the tables for a while (divided by year / day)
And all this in mysql ...
Even then, on the first versions, they implemented a certain buffer that combines 100 inserts into 1 large one and sends it to the database, because mysql simply didn’t take it out, to put it mildly, if you start pouring inserts one at a time. But then there were fewer requests. Now it's time to update and decided to come up with something different.
Splitting tables - was the first update to get rid of one index and put everything not in one large table, but several. This added convenience, reduced space and speeded up the system (less indexes - faster insert).
Total: we decided to leave mysql, in our humble opinion - it is not suitable for the task.
The main Wishlist:
1. Reduce the size of the occupied data
2. Get rid of the self-written buffer and just insert
3. Clustering (we now have 3 different mysql databases of 12TB in the "manual" mode, in the configs the servers where one or another data pool is stored are written with handles )
4. Sampling on basis is single big requests. For example: give me data from such and such a pool for such and such a period of time. The speed of the request should not go beyond the limit of absurdity == up to a second, it's okay. Selects are made large, but there are only a few of them. Basically, pieces of data are sent for recalculation at night.

We asked friends of friends, they said that such tasks are being solved: Cassandra or Postgre.

I read on the topic of Kasandra, I like everything (a certain auto-cluster), but I still don’t understand what’s going on with the indexes, namely a 128-bit key. If I understood everything correctly, then this immediately crosses out point 1. And it is not clear that with the speed of inserts. In appearance, it is made in order to conditionally have 1000 different clients who read and write. We don't have such clients. we have a service that writes this data. There is an application that makes specific requests for reading.

Postgre, I have never worked, but I know what it is. Mb who works with her will simply comment on how this miracle behaves under the incoming conditions. Namely, how it digests individual inserts in large quantities

In general, if any thoughts are under such a task, I will be glad to any comment. And then ideas already appear to do everything in the file system, and store pointers in mysql :) What will be the most economical and possibly the fastest. But write a separate driver. Oh, how I don’t want to

Unsubscribed in a separate post. Thanks to all

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Alexander Filippenko, 2022-01-16
@alexdora

Sounds like a task for ClickHouse, it seems like they recently fixed the performance with single inserts without a buffer. Even if not, there is a scheme with recording all events in Kafka and subtracting from there using the Materialized view to the ClickHouse table. This scheme is exactly working and will withstand the indicated loads. Plus excellent data compression.
Pure PostgreSQL performance is not much different from MySQL, but with the addition of TimescaleDB, the insert becomes very fast, maybe it will be enough for you. There is a full-fledged SQL and the ability to edit data without problems. Compresses well.
ScyllaDB has already been written about.

L

lonely_guy, 2022-01-17
@lonely_guy

Tidb. Mysql compatible, sharding their box
works great in htap scripts

U

Ustas4, 2022-01-18
@Ustas4

Don't throw slippers. Oracle will help you. Use partitions instead of an index. For insertion there is a bulk insert