How to organize the attribute storage structure?

A

Alexey Kovalenko2021-06-12 23:58:50

Database design

Alexey Kovalenko, 2021-06-12 23:58:50

How to organize the storage structure of object attributes for many different types of goods and services in a relational database: smartphones, hotels, movies, etc.?
Object attributes cannot be put into separate columns, as there are too many of them.

The first option is to use postgreSQL's noSQL features, i.e. store attribute values as json. But there is an unacceptable problem with indexing. GIN indexes work for an exact match. For example:

SELECT
  *
FROM
  products
WHERE
  products.attributes @> '{"length": 10}'

but do not work for selection by range

SELECT
  *
FROM
  products
WHERE
 (attributes->>'length')::int4 > 10

The documentation says that in order to range search a jsonb field, you need to create a Btree index on a specific JSON key. This option is not suitable, since there are a lot of attributes

. The second option is to use the EAV pattern (object - attribute - value). Here there are difficulties with different types of data and with complex conditions. For example, for 2.5 million records, the following query will be quite heavy, despite BTree indexing

SELECT
  product_id
FROM 
  products_attributes
WHERE 
  attribute_id = 348612462852833281 AND value BETWEEN 1 AND 10

INTERSECT

SELECT
  product_id
FROM 
  products_attributes
WHERE 
  attribute_id = 348612464425861121 AND value BETWEEN 10 AND 20

INTERSECT

SELECT
  product_id
FROM 
  products_attributes
WHERE 
  attribute_id = 372655158259580929 AND value BETWEEN 20 AND 30

The third option is to use a noSql database to store the attributes. But the main data is stored in a relational database and you will need to somehow transfer the result of a query from a noSQL database to a request to a relational database. Passing hundreds of thousands of object IDs is not a good idea.

How can this problem be solved?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

G

galaxy, 2021-06-13
@kovalit

In general, it is rather pointless to argue in isolation from the data. It is necessary to look at specific query plans and execution time.
For example, for the statistics that you cited here somewhere (it seems that the answer has already been lost), I get quite acceptable results (in my opinion, it’s not a fact that you will be satisfied):

Schema and Data

create table products (
  id serial primary key,
  name text not null
);

create table products_attributes (
  id serial primary key,
  attribute_id int not null,
  product_id int not null references products (id),
  value double precision
);


-- 10 млн продуктов
insert into products (name)
select 'p' || n from generate_series(1, 10000000) n;

-- в среднем по 10 аттрибутов на продукт
-- всего 20 разных аттрибутов
-- значения - случайные из 1-1000
insert into products_attributes(attribute_id, product_id, value)
select a,
       p.id,
       trunc(random() * 1000)
  from products p, generate_series(1, 20) a
 where random() < 0.5;


create index ix_attr_attribute_id on products_attributes(attribute_id);
create index ix_attr_product_id on products_attributes(product_id);
create unique index uk_attr_attr_product on products_attributes(product_id, attribute_id);
create index ix_attr_value on products_attributes(value);

-- немного статистики
select attribute_id, count(*) from products_attributes group by 1 order by 1;

 attribute_id |  count
--------------+---------
            1 | 5001345
            2 | 5001937
            3 | 4998754
            4 | 4998706
            5 | 4999357
            6 | 5004465
            7 | 4999215
...

select product_id, count(*) from products_attributes group by 1 order by 2 desc limit 20;
 product_id | count
------------+-------
    4769292 |    20
    5366802 |    20
    7241348 |    20
    3019891 |    20
    7789046 |    20
    1688646 |    19
    1585970 |    19
...

SELECT count(*) FROM products_attributes WHERE attribute_id = 1 AND value BETWEEN 1 AND 400;
  count
---------
 1999212
(1 row)

SELECT count(*) FROM products_attributes WHERE attribute_id = 2 AND value BETWEEN 1 AND 400;
  count
---------
 1999385
(1 row)

SELECT count(*) FROM products_attributes WHERE attribute_id = 3 AND value BETWEEN 20 AND 30;
 count
-------
 55318
(1 row)

explain analyze
select * from products
 where id in (
   select product_id from products_attributes
    where attribute_id = 1 AND value BETWEEN 1 AND 400
 )
 and id in (
   select product_id from products_attributes
    where attribute_id = 2 AND value BETWEEN 1 AND 400
 )
 and id in (
   select product_id from products_attributes
    where attribute_id = 3 AND value BETWEEN 20 AND 30
 );

No limit, 2189 entries - 3.8 sec:

Plan

QUERY PLAN                          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=164603.86..347381.12 rows=2047 width=12) (actual time=680.688..3753.927 rows=2189 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop  (cost=163603.86..346176.42 rows=853 width=12) (actual time=652.714..3713.617 rows=730 loops=3)
         ->  Nested Loop  (cost=163603.43..341334.66 rows=762 width=12) (actual time=652.439..3294.369 rows=730 loops=3)
               ->  Parallel Hash Join  (cost=163602.86..313081.39 rows=4167 width=8) (actual time=649.437..1742.210 rows=3689 loops=3)
                     Hash Cond: (products_attributes_1.product_id = products_attributes_2.product_id)
                     ->  Parallel Index Scan using ix_attr_attribute_id on products_attributes products_attributes_1  (cost=0.57..147444.27 rows=775173 width=4) (actual time=0.577..880.079 rows=666462 loops=3)
                           Index Cond: (attribute_id = 2)
                           Filter: ((value >= '1'::double precision) AND (value <= '400'::double precision))
                           Rows Removed by Filter: 1000851
                     ->  Parallel Hash  (cost=163306.17..163306.17 rows=23690 width=4) (actual time=647.483..647.483 rows=18439 loops=3)
                           Buckets: 65536  Batches: 1  Memory Usage: 2752kB
                           ->  Parallel Index Scan using ix_attr_attribute_id on products_attributes products_attributes_2  (cost=0.57..163306.17 rows=23690 width=4) (actual time=18.296..639.541 rows=18439 loops=3)
                                 Index Cond: (attribute_id = 3)
                                 Filter: ((value >= '20'::double precision) AND (value <= '30'::double precision))
                                 Rows Removed by Filter: 1647812
               ->  Index Scan using uk_attr_attr_product on products_attributes  (cost=0.57..6.78 rows=1 width=4) (actual time=0.420..0.420 rows=0 loops=11066)
                     Index Cond: ((product_id = products_attributes_1.product_id) AND (attribute_id = 1))
                     Filter: ((value >= '1'::double precision) AND (value <= '400'::double precision))
                     Rows Removed by Filter: 0
         ->  Index Scan using products_pkey on products  (cost=0.43..6.35 rows=1 width=12) (actual time=0.572..0.572 rows=1 loops=2189)
               Index Cond: (id = products_attributes.product_id)
 Planning Time: 4.481 ms
 JIT:
   Functions: 93
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 14.220 ms, Inlining 0.000 ms, Optimization 2.852 ms, Emission 50.889 ms, Total 67.961 ms
 Execution Time: 3762.035 ms
(29 rows)

The same, but with a limit (LIMIT 20):

Request

explain analyze
select * from products
 where id in (
   select product_id from products_attributes
    where attribute_id = 1 AND value BETWEEN 1 AND 400
 )
 and id in (
   select product_id from products_attributes
    where attribute_id = 2 AND value BETWEEN 1 AND 400
 )
 and id in (
   select product_id from products_attributes
    where attribute_id = 3 AND value BETWEEN 20 AND 30
 ) limit 20;

20 entries - 48ms:

spoiler

QUERY PLAN                                   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2.14..7292.14 rows=20 width=12) (actual time=3.743..47.724 rows=20 loops=1)
   ->  Nested Loop  (cost=2.14..746133.39 rows=2047 width=12) (actual time=3.741..47.714 rows=20 loops=1)
         ->  Nested Loop  (cost=1.70..734511.89 rows=1829 width=12) (actual time=3.313..44.710 rows=20 loops=1)
               ->  Nested Loop  (cost=1.14..666709.45 rows=10000 width=8) (actual time=2.377..42.276 rows=95 loops=1)
                     ->  Index Scan using ix_attr_attribute_id on products_attributes products_attributes_2  (cost=0.57..209246.27 rows=56856 width=4) (actual time=1.250..14.969 rows=511 loops=1)
                           Index Cond: (attribute_id = 3)
                           Filter: ((value >= '20'::double precision) AND (value <= '30'::double precision))
                           Rows Removed by Filter: 46859
                     ->  Index Scan using uk_attr_attr_product on products_attributes products_attributes_1  (cost=0.57..8.05 rows=1 width=4) (actual time=0.053..0.053 rows=0 loops=511)
                           Index Cond: ((product_id = products_attributes_2.product_id) AND (attribute_id = 2))
                           Filter: ((value >= '1'::double precision) AND (value <= '400'::double precision))
                           Rows Removed by Filter: 0
               ->  Index Scan using uk_attr_attr_product on products_attributes  (cost=0.57..6.78 rows=1 width=4) (actual time=0.025..0.025 rows=0 loops=95)
                     Index Cond: ((product_id = products_attributes_1.product_id) AND (attribute_id = 1))
                     Filter: ((value >= '1'::double precision) AND (value <= '400'::double precision))
                     Rows Removed by Filter: 0
         ->  Index Scan using products_pkey on products  (cost=0.43..6.35 rows=1 width=12) (actual time=0.149..0.149 rows=1 loops=20)
               Index Cond: (id = products_attributes.product_id)
 Planning Time: 9.959 ms
 Execution Time: 47.796 ms
(20 rows)

As you can see, the selectivity estimates of the scheduler are adequate. He starts with the most highly selective condition. The limit is carried inside the deepest joins (seen by the number of entries).

I

Ivan Shumov, 2021-06-13
@inoise

Stop doing wild things, for example, and find out that you should use search engines like elasticsearch for search. And you can also store a relational database in json