Recommend a suitable database?

Megas2022-02-23 15:21:23

MySQL

Megas, 2022-02-23 15:21:23

Good afternoon.

I would like to get the opinion of database experts or someone who has already worked with any databases for Big Data.

Given
There is a certain analytical system that works on several sites. It saves some statistics for each page, for each element (for example, how many times the element was visible, how many times it was clicked, etc., but in general this is not so important). All data is aggregated in memory, and then once a day they are dumped into a table in the database. Analysts can then generate a report for a specific period for a specific site + specific page + specific segment. Or for a certain period for a certain site + a certain page + a separate element.

The table structure is something like this:
1. Site ID
2. Page ID
3. Element ID
4. Date (day)
5. User segment ID (all users are divided into a number of different segments)
6+. All other data fields

We are currently using Amazon Aurora (MySQL 5.7) with InnoDB engine
Fields 1-5 are the primary key.
About 80M rows are written to the table every day (but this is not the final goal, in the future it is possible to write much more).
The recording is optimized and occurs in blocks of 1000 rows in each INSERT query.

Problems of the current approach
1. When the table is empty, then writing this amount of data takes several hours, but after a month, due to an increase in the index, the speed drops to about 12 hours. Those. if there is 2 times more data, then it will no longer be possible to invest in a 24-hour window, when the recording of the previous day has not yet ended, but the next day must already be written.
2. Report generation speed leaves much to be desired. For a page with a large number of elements, generating a report for a month can take 2-3 minutes, which is too long.
3. Sometimes you need to delete data for a specific site, because of the huge amount of data, such a procedure blocks not only the table itself, but the entire database, in which there are many other tables that are constantly being read / written. We tried to solve this problem by partitioning, but it did not really help.

What will be the proposals?

Answer the question

In order to leave comments, you need to log in

7 answer(s)

rPman, 2022-02-23
@rPman

which are somehow aggregated when generating a report.

this is almost the most difficult task for databases, 80m of records, all the more
Partition right on the day .
Remove transactions, what for you here innodb when myisam is enough , it is faster for writing, you have a database write once read ... also once.
At you there the database rests accidentally not in work with a disk? you can take several disks in the cloud, they will be independent, scatter tables over them (myisam regularly supports symlinks), which can give a speed increase at times only due to this, even if they are ssd, for example, separate index storage from data or separate old data from today's .
While processing analytics, you can tune the file system and disable flush for table files(for example, ext4 data writeback and you can turn off the journal) - it greatly speeds up writing, especially if there is a lot of ram, this includes a big risk of data loss / corruption when resetting the OS, but on the other hand, the probability of this is very small and, as I understand it, the data is in the database and so are written from some other storage, i.e. if there is a problem with the server, processing for the current day is simply restarted.
Remove indexes for writing, that's all, first let data be inserted without indexing them, then create an index (it's an order of magnitude faster) and only then build analytics.
General analytics should not work with the data itself, but with their daily squeeze(perhaps as a result, you won't have to store them) consider these self-made indexes. Roughly speaking, if the request for analytics contains count,max,min,.. then it is enough to add the daily values and for global ones, count already by them ... of course, if requests with conditions and complex groupings, then you have to think, but everything is solved .. rough example, you need to calculate the aggregation by the hour, so in the indexes and write the daily values by the hour, and if necessary, page by page, then for each page for each day you count, then you aggregate by these results.

AndromedaStar, 2022-02-23
@AndromedaStar

You have a task that can be solved using OLAP.
Therefore, you need to dig in this direction, there are a lot of solutions.

Vladimir Olohtonov, 2022-02-24
@sgjurano

Recently, Clickhouse has been actively used for OLAP workloads - it has a rather high entry threshold, but it is free and the performance is worth it.

Alexander Nesterov, 2022-02-23
@AlexNest

Maybe you should look towards noSQL (mongoDB, firebase, etc.)?
https://habr.com/en/sandbox/113232/

Viktor Taran, 2022-02-25
@shambler81

You don't really need a database.
For some reason, when interviewing, for some reason, programmers in 100% of cases have a dead end in the question of what mysql is for
, and when you ask why exactly it is used, the issue with transactions almost never comes up.
And that's why they use it.
So here you do not need transactions. Stupidly 1 insert
As a result, the usual log file is enough for you, which can be conveniently parsed, so that this does not require any software at all. and that's enough.
Rotation of logs, etc., etc., in general, this was invented before you.
Also, for your purposes, both Yandex and Google have the appropriate tools (especially Google with reports and similar crap, I don’t understand why you even need to keep this data locally)

Adamos, 2022-02-23
@Adamos

If fields 6+ are not actually sampled, but data is only written and read, you can try to collect fields 3 and 6+ into one blob (JSON, for example). The load on the record will decrease, there will be more fuss in the reports, but fewer calls to the database ...

Victor P., 2022-03-03
@Jeer

For bigdata, hive is often used, for example, a data lake in the tax office, pick it up in this direction.
But in general, according to your requirements, you need to choose not databases, but OLAP solutions