How to store and quickly process a large amount (> 10-100M) of advertising system statistics?

Z

zozo302015-03-13 01:23:39

big data

zozo30, 2015-03-13 01:23:39

I am developing an advertising network, I need to collect and store impression statistics and be able to make quick reports on it for analysis. For each impression, you need to save:

unique user ID (string of 20 characters)
ad block number (what was shown)
ad space number (where it was shown)
country (where the visitor was from)
device type (desktop/mobile/tablet)
date Time
whether there was a click (this flag is updated to true if there was a click on the ad)

You need the ability to get reports with filtering and summing up by any of the fields (well, except for the identifier and the flag whether there was a click). In the report, you need to have the number of impressions (hits), the number of unique users (uniques) and the number of clicks.
Now I put all this in a regular table (PostgreSQL) and form a report with an SQL query. I count uniques as count(DISTINCT user_id). Everything is fine, only there are more than a million impressions a day already now, and in a week the statistics are already formed for half a minute, i.e. sooooo slow, and this is only a test operation, i.e. data needs to fit a hundred times more.
I reasoned aside to somehow "flatten" these statistics, but it's not clear how then to count the uniques.
Thank you very much for your help.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

D

D', 2015-03-13
@Denormalization

Once I did a similar project. The whole point boiled down to the following:
There was a main table where all the traffic logs were thrown.
Once every 10-20 minutes I ran kroner, and collected all the necessary reports from this table, and wrote them down in separate tables by day / hour.
At 00 on the server, statistics were frozen, all statistics from the main table for the day were dumped into the archive, and the table was reset to zero (truncate).
In the end, everything worked as it should, and it was always possible to clarify specific details from the archive table.
That is, in fact, we have the following tables:
Where *_ is the report type.
PS
Of course, you can write directly to *_daily_report if you really need realtime stats, but then for each request there will be +N table updates, which will also affect performance. You decide.

S

Sergey, 2015-03-13
@begemot_sun

Another option is to store the current aggregated statistics for 10-20 minutes in memory, dropping this case into the database if necessary. That. you can get real time aggregated data.
The main problem is that users of such systems want to receive:
1. Real-time data (with minimal delay)
2. Get aggregated data in various sections (predefined)
3. Build additional sections based on historical data (you have to process enormous volumes for previous periods)

L

lega, 2015-03-16
@lega

I did something like this:
Every hour I formed chunks with several cuts, in your case, you can take for example "date time" and "ad space number" (depending on requests), as a result, 1 "column" will "disappear", instead of the date you can store diff from the beginning of the hour.
Chunks were compressed as much as possible and spilled over the sharding.
As a result, the stored volume was 5-10 times less than the original one.
Report generation: Service in python, chunks were taken from the database for the period, unpacked, transferred to the C extension, which calculated and returned the result.
In terms of speed, about 250 million "events" came out and were calculated in 2-4 seconds on one core of the virtual machine from DO.
Every hour (time period) for each report, results / caches were generated in all sections - this made it possible to issue results to clients instantly. They weighed little when compared with the weight of the original data.
Those. fresh data was formed every hour, but besides that, there were also real-time reports - a copy of the input stream went to this server, which "counted the numbers" and reset them every hour.