R
R
raiboon2015-04-14 19:02:00
Python
raiboon, 2015-04-14 19:02:00

How to aggregate data from multiple postgresses?

There are dozens of independent postgres, with approximately the following table structure:
On one:
name | date | url | shows | clicks
alex | 04/21/2015 | 1 | 21 | 42
max | 04/21/2015 | 4 | 34 | 21
max | 04/22/2015 | 4 | 34 | 21
On the other:
name | date | url | shows | clicks
alex | 04/21/2015 | 1 | 1 | 1max
| 04/21/2015 | 4 | 1 | 1
shows and clicks on each server grow every second, records with new name and url are added, well, that's clear.
How to quickly and easily take all the data from them, group and sum it up? What would appear in the so-called master postgres in a table of similar structure:
name | date | url | shows | clicks
alex | 04/21/2015 | 2 | 22 | 43
max | 04/21/2015 | 8 | 35 | 22max
| 04/22/2015 | 4 | 34 | 21
In general, now this is done by a python script that slowly goes through the list of all postgres, takes data for the current date, sums it all up, deletes data for the last day in the master database and inserts new ones. And all this is extremely slow. And with each new postgression, it will be even slower.
And I would like, if not real-time processing, then minimum delays for recalculation. When there were a couple of databases, everything was tolerable, and now, when there are more than a dozen servers, you can go to sleep while everything is aggregated.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
S
sim3x, 2015-04-14
@sim3x

Want and need - two big differences

And I would like, if not real-time processing, then the delay is not more than an hour and the absence of linear complexity from the number of postgressions.
this is unlikely to work without any hadups
Here is a funny solution
stackoverflow.com/a/3200176/1346222
truncate table tableA;

insert into tableA
select *
from dblink('dbname=postgres hostaddr=xxx.xxx.xxx.xxx dbname=mydb user=postgres',
            'select a,b from tableA')
       as t1(a text,b text);

you can also play around with WAL files and replications
Well, it's nice to have a service scheme in general terms

S
Sergey, 2015-04-14
@begemot_sun

At the extreme, to implement a service on some kind of Erlang'e for which parallel work is done very easily and simply.

L
lega, 2015-04-16
@lega

In your case, parallel scooping of sorted data (one pass through each server and one record per row) is probably the most optimal if the % of intersections is high.
In general, why not do sharding ?, make an index (for example) on 3 fields and upload data to the necessary servers (like all alex on 1 server, max on the 2nd), so that there were no intersections, so merge the data it will not be necessary + saving memory.
The presence of a master base is also not clear, it is quite possible that it could have been avoided.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question