Answer the question
In order to leave comments, you need to log in
Recommend a suitable database?
Good afternoon.
I would like to get the opinion of database experts or someone who has already worked with any databases for Big Data.
Given
There is a certain analytical system that works on several sites. It saves some statistics for each page, for each element (for example, how many times the element was visible, how many times it was clicked, etc., but in general this is not so important). All data is aggregated in memory, and then once a day they are dumped into a table in the database. Analysts can then generate a report for a specific period for a specific site + specific page + specific segment. Or for a certain period for a certain site + a certain page + a separate element.
The table structure is something like this:
1. Site ID
2. Page ID
3. Element ID
4. Date (day)
5. User segment ID (all users are divided into a number of different segments)
6+. All other data fields
We are currently using Amazon Aurora (MySQL 5.7) with InnoDB engine
Fields 1-5 are the primary key.
About 80M rows are written to the table every day (but this is not the final goal, in the future it is possible to write much more).
The recording is optimized and occurs in blocks of 1000 rows in each INSERT query.
Problems of the current approach
1. When the table is empty, then writing this amount of data takes several hours, but after a month, due to an increase in the index, the speed drops to about 12 hours. Those. if there is 2 times more data, then it will no longer be possible to invest in a 24-hour window, when the recording of the previous day has not yet ended, but the next day must already be written.
2. Report generation speed leaves much to be desired. For a page with a large number of elements, generating a report for a month can take 2-3 minutes, which is too long.
3. Sometimes you need to delete data for a specific site, because of the huge amount of data, such a procedure blocks not only the table itself, but the entire database, in which there are many other tables that are constantly being read / written. We tried to solve this problem by partitioning, but it did not really help.
What will be the proposals?
Answer the question
In order to leave comments, you need to log in
which are somehow aggregated when generating a report.this is almost the most difficult task for databases, 80m of records, all the more
You have a task that can be solved using OLAP.
Therefore, you need to dig in this direction, there are a lot of solutions.
Recently, Clickhouse has been actively used for OLAP workloads - it has a rather high entry threshold, but it is free and the performance is worth it.
Maybe you should look towards noSQL (mongoDB, firebase, etc.)?
https://habr.com/en/sandbox/113232/
You don't really need a database.
For some reason, when interviewing, for some reason, programmers in 100% of cases have a dead end in the question of what mysql is for
, and when you ask why exactly it is used, the issue with transactions almost never comes up.
And that's why they use it.
So here you do not need transactions. Stupidly 1 insert
As a result, the usual log file is enough for you, which can be conveniently parsed, so that this does not require any software at all. and that's enough.
Rotation of logs, etc., etc., in general, this was invented before you.
Also, for your purposes, both Yandex and Google have the appropriate tools (especially Google with reports and similar crap, I don’t understand why you even need to keep this data locally)
If fields 6+ are not actually sampled, but data is only written and read, you can try to collect fields 3 and 6+ into one blob (JSON, for example). The load on the record will decrease, there will be more fuss in the reports, but fewer calls to the database ...
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question