A
A
Andrey Nikiforov2016-02-28 18:13:29
MySQL
Andrey Nikiforov, 2016-02-28 18:13:29

What is the best way to store data for quick access?

Hello!
Initial data: there is a table in mysql with 31 million rows, weighing 40GB. With all the necessary indexes, of course. This table grows by 100k records per day.
There are no problems with recording: records are inserted evenly within 12 hours, and the server copes with this.
The problem arises when it is necessary to read from the table. The table stores site check data, and the main queries are the accumulation of data on a specific site. Due to the size of the table, even indexed queries are not fast.
There was an idea to store only the current state of checks in mysql, and dump the archived data to another storage. But I still need to be able to do analytics on archived data.
Advise the storage that is best suited for storing such volumes of data and analytics on them. Or give advice on speeding up the current one.

Answer the question

In order to leave comments, you need to log in

6 answer(s)
W
werw, 2016-02-28
@werw

If you really need to have access to the accumulated data really quickly, then you also need to accumulate them constantly and gradually over the course of a day - storing them in a special table.
It is possible to do this directly when updating the data in the primary table or retroactively - depending on the nature of the data, the algorithm and the requirements for the availability / efficiency of data extraction.
If the requirements are not strict, that is, the data needs to be obtained occasionally and not very quickly, and therefore it makes no sense to fence a separate table, then look more closely at indexes and queries. Maybe the wrong indexes are being used? Since 40 G for modern iron is not a big problem. What does the server say about the "query plan"?
With strict speed requirements, you can aggregate directly in RAM, for example, using Tarantool, it will be quite fast. For sure, the aggregated database is several times smaller than the main one, that is, with a 40 GB database, it is not a problem to allocate 4 GB for storing aggregated data in RAM for current servers.

W
Walt Disney, 2016-02-29
@ruFelix

on the forehead,
write data into daily or weekly tables, when filling them, transfer the data to a common table and immediately lay it out in tables containing already aggregated data.
Accordingly, the results will consist of two queries: a simple select on already aggregated data + the aggregating query itself on the daily table.
Those. if you, in any way, get rid of requests for the entire dataset online, then this scheme will work for you for a very long time (until the hardware resources are completely exhausted), regardless of which database you will use.

I
Igor, 2016-02-28
@unitby

Try to use partition or build logic in another way (for example, write to a new table every day, use merge tables)

D
Dimonchik, 2016-02-28
@dimonchik2013

transfer part of the logic to the insert - update the flags/fields that should be updated one way or another on the trigger, you can also summarize there,
but it’s generally strange that there is already a problem for only 31 million records

Z
Zakharov Alexander, 2016-02-29
@AlexZaharow

It was somehow the case, I worked with a test set of millions of rows on one table. Tried Elasticsearch. In terms of the speed of aggregation, it is approximately at the level of the commercial version of MSSQL (I didn’t know that the free version and commercial MSSQL differ greatly in performance, and ElasticSearch made selections no slower than commercial MSSQL). But it is not easy to enter the ElasticSearch aggregation.

D
Draconian, 2016-02-29
@Draconian

There is a suspicion that you need to either add indexes or check existing ones, because selects on indexed fields should be performed very quickly at such sizes.
Otherwise, Walt Disney gave the right advice: divide this table into two - archive (ideally - partitioned) and operational, in which to store data for a certain period, transfer these data to the archive after the expiration of the period.
You can additionally have a "super-operational" table in which to store the already aggregated data you need, which are updated by triggers after inserting information into the operational table. Thus, after updating the operational table, you will already have all the analytics for the current period (day/week, etc.).
As for archive analytics, from my experience, the customer always asks for operational, fast-running reports, and aggregated reports by months / years from the archive, which would just be built. :-)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question