Aggregation of a large number of records from the database?

L

lloyd672018-08-01 09:29:04

Python

lloyd67, 2018-08-01 09:29:04

In general, the essence is this, there is raw data from users (date, which section was opened, what action was performed, the duration of being in sections, etc.) that are continuously written to a table (now it is MySQL). Approximately 200 million records and it is growing steadily, by the end of the year there will be 400-500 million.
Based on these data, it is necessary to build various reports for weeks, months, quarters, years, etc. (Sum, AVG, Count, Count Distinct, etc.).
It is clear that requests do not go directly, tk. very heavy, at first the data is aggregated into other tables, the problem is that if the aggregation of certain data in the context of a month takes several minutes, then in the context of a year it takes several hours or even days (queries to MySQL, there are indexes, they do not save here).
What solutions can there be? In general, what is usually used to collect and analyze such statistical data? Is it generally appropriate to use MySQL to store a constant stream of data (I think not)?

Reply

Answer the question

In order to leave comments, you need to log in

6 answer(s)

A

Alexey Cheremisin, 2018-08-01
@lloyd67

I recommend elasicsearch. Throw your data into it (in your case, logstash is also suitable). Beat indexes either for months or for weeks, organize them by years/months/days through aliases. You can make reports either through kibana, or pull aggregated data from your applications yourself. It is convenient to rotate indexes, archive and delete old ones.
And there will be happiness.
Well, yes, if you only have the analytics of this data, then the muscle is not needed here at all!

A

asd111, 2018-08-01
@asd111

Take Yandex clickhouse. It is just for reports and large volumes and requests go directly. With it, you can search through a billion records in 5-20 seconds (core i5, ssd, 16Gb RAM). Acceptable time for reporting.
https://clickhouse.yandex/

S

sim3x, 2018-08-01
@sim3x

Why is raw-data stored in muscle at all?
This is normally thrown off into a text editor and pressed by the archiver
. And info is put into the muscle in 3NF
. Then the requests will take seconds (on adequate hardware and settings)

L

lega, 2018-08-01
@lega

I have already described one of the solutions on the toaster a couple of times (with the expectation of an increase of up to 2000 million records per day), in short: along the sections, you can use nosql (mongodb with sharding, although one server is probably enough for you)
* pack chunks (save up to 95% of space)
* then after the completion of the periods, tasks are launched that fill the "cache" - build reports in all sections + intermediate results , so that the user can get the result instantly when he clicks on the interface.
I made a solution in python, where the calculation took a long time - I did c ++ inserts, as a result, the calculation was performed ~ x70 times faster, and python pumped over 10 million records per second. in one thread, taking into account pumping out from the database

S

Supme, 2018-08-09
@Supme

So that's the metrics. Influxdb, Prometeus, ClickHouse, etc.

S

santaatnas, 2018-08-01
@santaatnas

Look towards Hadoop and / or Clickhouse and you will be happy.