How to force Postgres to calculate values in columns after sorting?

D

detinkin2018-04-05 15:36:23

PostgreSQL

detinkin, 2018-04-05 15:36:23

I have a large query with many calculated columns in SELECT. This query also has a sort by one of the calculated columns and a limit on getting the first 100 rows. So, for some reason, Postgres calculates column values for all rows, and not just for the first 100. I understand that it must calculate for all rows the column by which sorting is done - this is logical. But why calculate all the rest?
I'll explain with an example.
Create a test table And fill it with random data
CREATE TABLE test_main(col1 INTEGER);

DO
$do$
BEGIN
  FOR r IN 1..100000 LOOP
    INSERT INTO test_main(col1) VALUES (trunc(random()*1000));
  END LOOP;
END
$do$;

Then we will create two additional tables

CREATE TABLE test_main_agg1(
  col1 INTEGER,
  val INTEGER
);
CREATE TABLE test_main_agg2(
  col1 INTEGER,
  val INTEGER
);

Let's fill them in too.

DO
$do$
DECLARE
 r test_main%rowtype;
BEGIN
  FOR r IN SELECT * FROM test_main LOOP
    FOR i IN 1..5 LOOP
      INSERT INTO test_main_agg1(col1, val) VALUES (r.col1, trunc(random()*1000));
      INSERT INTO test_main_agg2(col1, val) VALUES (r.col1, trunc(random()*1000));
    END LOOP;
  END LOOP;
END
$do$;

And of course we will create indexes

CREATE INDEX test_main_indx ON test_main(col1);
CREATE INDEX test_main_agg1_val_indx ON test_main_agg1(col1,val);
CREATE INDEX test_main_agg2_val_indx ON test_main_agg2(col1,val);

Now, let's try to run the following query:

SELECT col1,
       (SELECT MAX(val) FROM test_main_agg1 g WHERE g.col1=m.col1) max_val1,
       (SELECT MAX(val) FROM test_main_agg2 g WHERE g.col1=m.col1) max_val2
  FROM test_main m
 LIMIT 100;

The query will be executed quickly thanks to indexes. If we add to the query ORDER BY col1, then it will still be executed quickly. However, if we add ORDER BY max_val1, then the execution will take about two seconds. I looked in and saw these lines:EXPLAIN ANALYZE

SubPlan 4
 -> Result (cost=4.06..4.07 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=100000)
  InitPlan 3 (returns $3)
   -> Limit (cost=0.42..4.06 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=100000)
    -> Index Only Scan Backward using test_main_agg2_val_indx on test_main_agg2 g_1 (cost=0.42..1818.25 rows=500 width=4) (actual time=0.010..0.010 rows=1 loops=100000)
     Index Cond: ((col1 = m.col1) AND (val IS NOT NULL))
     Heap Fetches: 100000

That is, Postgres calculates the value max_val2for all 100,000 rows, not for the first 100. Why? After all, sorting is by field max_val1, and for 100,000 rows it is enough to calculate only it?
Maybe Postgres has some hints or something to tell it to calculate column values after applying ORDER BYand LIMIT?