I
I
Ivanoff-i2016-11-30 11:22:51
Database
Ivanoff-i, 2016-11-30 11:22:51

How can you speed up a low-selectivity fetch from a table with hundreds of millions of records?

I have a table with hundreds of millions of records. About 200 GB. It has many columns, including text. There are columns with low selectivity. For example, the city column. The task is to select everything where the city is equal to a certain value. For example, in St. Petersburg there are about 10 million records. All of them need to be output to a file. Those. a query like this COPY (SELECT multiple fields...) TO 'file.txt'. Now they are stored for half an hour. No indexes help. Moreover, if you do not SELECT several fields, but SELECT id ... WHERE city = ..., then this happens in a few seconds. If you take out the records for the city of St. Petersburg in a separate materialized view, SELECT several fields ... it takes not half an hour, but half a minute.

  1. Is it really necessary to create separate tables for each city?
  2. If you create separate tables for each city, then what if you need to filter by other columns, not by city?
  3. I read a little about PgPool 2 and its parallel query capability. If you make partitioning by id and use parallel queries to all partitions at once, is this an option? And can pgpool do it on a single machine?
  4. How else can you optimize?
  5. Can one machine do the job at all? I read somewhere, people write that they have a couple of billion records in postgres flying on one machine, even with fairly complex queries. How so?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
X
xmoonlight, 2016-11-30
@xmoonlight

Cast the base to DNF3.
For the current case: all cities should be in a separate table - a list of cities with IDs.

M
Melkij, 2016-11-30
@melkij

What indexes are there? Table structure?
explain(analyze, buffers)?
1. not needed
2. see 1
3. if you hit the CPU, not the disk. If in a disk - will make only worse.
4. first find out how the existing plate behaves. Then think. For example, brin by city id. On low-selective fields, a distinct compact index will be obtained.
5. 200GB is quite a normal base. It's not even astronomically expensive to place it entirely in shared_buffers.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question