How / how to aggregate large amounts of data (and in real time?)?

R

raiboon2015-01-04 13:29:26

PostgreSQL

raiboon, 2015-01-04 13:29:26

There are several servers with postgresql on them. Logs fall into the table in them - about a thousand records / second.
You need to write statistics for it. The statistic is a GROUP BY over several arbitrary fields and the summation of field values for an arbitrary period of time.
Problem... On each server, this takes an unacceptably long time - if we take an interval of a year, then it will be performed on each server for several minutes (this is still fast, because there are indexes for all fields), and group statistics are already pain and suffering - due to the large volumes of already lying data, which is also growing rapidly. And the user should give all this in a reasonable time.
What are the data aggregation solutions?
I'm looking at hadoop (for me it's more like a magic word), but I don't know what exactly to use from its ecosystem.
I see something similar in Influxdb... But as I understand it, the main emphasis is on timeseries, and data aggregation by custom fields will not be faster.
As a bonus, maybe there are real-time aggregation solutions?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

S

Sergey, 2015-01-04
@begemot_sun

Look towards OLAP. This is just an aggregation of data in various sections.

Y

Yuri Shikanov, 2015-01-04
@dizballanze

Apache Storm

A

Alexey Cheremisin, 2015-01-04
@leahch

What does "realtime aggregation" mean? We cannot aggregate in real time according to incomprehensible conditions ... If at least some parameters are available, then we can use key / value stores like redis or mongo. And if you need reports in the form of graphs, then I strongly recommend looking towards graphite. Well, at least you can cluster the aggregation using the map / reduce method.

R

realfreeman, 2015-02-27
@realfreeman

Hello.
Well, as an option, you can really use hadoop. Only here you will not get anything even close to realtime. Well, at least it's simple and fast in terms of implementation time (of course, you can try hive over spark).
Consider cassandra as an option.