Answer the question
In order to leave comments, you need to log in
How to store and quickly process a large amount (> 10-100M) of advertising system statistics?
I am developing an advertising network, I need to collect and store impression statistics and be able to make quick reports on it for analysis. For each impression, you need to save:
Answer the question
In order to leave comments, you need to log in
Once I did a similar project. The whole point boiled down to the following:
There was a main table where all the traffic logs were thrown.
Once every 10-20 minutes I ran kroner, and collected all the necessary reports from this table, and wrote them down in separate tables by day / hour.
At 00 on the server, statistics were frozen, all statistics from the main table for the day were dumped into the archive, and the table was reset to zero (truncate).
In the end, everything worked as it should, and it was always possible to clarify specific details from the archive table.
That is, in fact, we have the following tables:
Where *_ is the report type.
PS
Of course, you can write directly to *_daily_report if you really need realtime stats, but then for each request there will be +N table updates, which will also affect performance. You decide.
Another option is to store the current aggregated statistics for 10-20 minutes in memory, dropping this case into the database if necessary. That. you can get real time aggregated data.
The main problem is that users of such systems want to receive:
1. Real-time data (with minimal delay)
2. Get aggregated data in various sections (predefined)
3. Build additional sections based on historical data (you have to process enormous volumes for previous periods)
I did something like this:
Every hour I formed chunks with several cuts, in your case, you can take for example "date time" and "ad space number" (depending on requests), as a result, 1 "column" will "disappear", instead of the date you can store diff from the beginning of the hour.
Chunks were compressed as much as possible and spilled over the sharding.
As a result, the stored volume was 5-10 times less than the original one.
Report generation: Service in python, chunks were taken from the database for the period, unpacked, transferred to the C extension, which calculated and returned the result.
In terms of speed, about 250 million "events" came out and were calculated in 2-4 seconds on one core of the virtual machine from DO.
Every hour (time period) for each report, results / caches were generated in all sections - this made it possible to issue results to clients instantly. They weighed little when compared with the weight of the original data.
Those. fresh data was formed every hour, but besides that, there were also real-time reports - a copy of the input stream went to this server, which "counted the numbers" and reset them every hour.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question