How to store a lot of statistical data in the database?

D

dom1n1k2016-03-09 16:13:45

Database

dom1n1k, 2016-03-09 16:13:45

The situation is hypothetical.
Let's take a service like Yandex Metrics or Google Analytics. Information flows to them on millions of sites and trillions of events. How do they store this information in the database?
It is clear that keeping a separate entry for each hit or click is unrealistic. These are places of a breakthrough, and it takes a long time to build reports. The data must somehow be consolidated and stored in a processed form. Probably with several levels of detail - by week, month, and so on. Everything seems pretty self-explanatory (well, on a global level, if you don't go into details).
But here's what haunts me - when building a report, they can choose anytime interval. Like, look at the distribution of browser versions in the context of the Windows operating system from November 11 to December 26. How do they do it? It turns out that they still store raw data too? Or "almost raw", with minimal processing.
Is it possible to read a theory on this topic somewhere?
It is precisely such an applied question that is of interest - how to maintain the ability to view statistics for an arbitrary time interval, while saving iron resources as much as possible.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

S

s0ci0pat, 2016-03-09
@s0ci0pat

Real-time data is written to the database, and reports are built based on data from storages, where they are already partially aggregated in cubes, from which various slices can be built.

D

Dimonchik, 2016-03-09
@dimonchik2013

www.ozon.ru/context/detail/id/19383907
if you worked with big data (from 10k per day, at least 30-50 is better), you should have seen obvious discrepancies like 2+3 = 4 and 3+2 = 6 , i.e. count - he counts, but not to the extreme molecule

D

doktr, 2016-03-10
@doktr

Homogeneous data is scattered across multiple servers in a cluster. Relatively speaking, on one server the data is for February 2011, on the other - for March 2014, so there is not much difference how old the statistics are needed - everything is pulled out on request for about the same amount of time. The data is organized using MapReduce or other similar technology. If such a huge amount of information was stored in relational databases , then queries would be terribly slow and instead of fractions of a second they would work out in hours.

H

Here_and_Now, 2016-03-17
@Here_and_Now

In the premium version of Google Analytics, you can upload data at the hits level to them in the cloud. I don't know if they only include this feature for premium users, but it seems to me that Google stores information at the hit level for the entire GA.
Yes, in GA itself, one or another pre-aggregation technique is used for reporting speed (the answers above are most likely correct), but Google uses this information in its core business.
Well, the volume there is small in comparison with their services (YouTube).
Also, MapReduce, out of the walls of Google? As far as I know from articles on habré and techkrach, companies like Google and Facebook use their storage systems, which are a generation ahead of their open source counterparts. While they use them, they are closed and you can hear about them only at conferences and in scientific papers that come out of the walls of the company. Well, when the time comes for a new system - Google opens the source)