Is the database architecture for filters similar to Yandex Market?

S

seva_str2011-02-19 07:47:27

MySQL

seva_str, 2011-02-19 07:47:27

We are now developing a structure of product characteristics in order to implement filters like on Yandex Market.
A high-load project, so requests for 5 A4 sheets will not work.
market.yandex.ru/guru.xml?cmd=-rr=9 ,0,0,0-v…
Look, if we mark any item in the filter, then the entire filtering is rebuilt and those items for which the selection will no longer pass are removed
my.jetscreenshot.com/5783/20110219-smd2-59k...
And it all flies
Question, can there be articles or a person who can build a table architecture with such filters?
Willing to pay well

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

E

Evgeny Bezymyannikov, 2011-02-19
@psman

They did something similar (filter by 100-120 checkboxes and about 50 other variations (selects, ranges))
Checkboxes were driven into one 128-bit number - already saved a lot of time
part of the ranges (price from 100-200 200-500 500-1000, etc.) etc.) were also converted to small numbers and combined into other indicators. The Index is 300-350 bytes in size. They made 128 bit “md5” out of it And the “search table” was just an id, our Index + product id + 128 bit md5. The selection is made according to a 128-bit number (it is clear that some of the goods are not the ones that are needed there (maximum 5 percent)), then from a selection of 100-300 goods a check is made against the full Index (the first selection is made from the memtable). The output is what we need.
In fact, a fuzzy search algorithm has also been implemented, so that, with a maximum price limit of $ 200, to show the product for 220-230 (+ -10-15%).
There are about 12 million products in the database (machine parts, auto parts, etc.).
The lookup table is a couple of orders of magnitude smaller than the original one. The search goes in a matter of milliseconds.

C

coxx, 2011-02-19
@coxx

You should use Sphinx. Look at the report about the organization of the product catalog in dostavka.ru. Sphinx is also used in the product catalog on gorbushka.ru.

A

archibaldtelepov, 2011-02-20
@archibaldtelepov

What is wrong with the standard relational solution like
product (product_id [PK], name, ... );
property(property_id[PK], name, ...);
value(value_id[PK], value);
product_property_value(product_id, property_id, value_id, primary key(property_id, value_id, product_id, ));
or is there any data that shows that it will slow down on your volumes?
on one Internet project for 250,000 active assortment and 60,000 unique visitors per day does not slow down much.

A

archibaldtelepov, 2011-04-12
@archibaldtelepov

Oh sure.
I'll start from the end. value - to select by a numeric identifier and so that the number 100 is stored for the "lid color" property, and not the word "white" somewhere, but somewhere "white" and so on.
further, in the selection interface by parameters, a person selects “show me all the products that have a hard disk capacity of 2 GB and a screen width of 100 meters.”
we select products in the presence of an index ppv(property_id, value_id) should work quickly here, of course, a reasonable question arises - what to do with requests like "monitor width is more than 17 inches". to which a reasonable answer arises - if we have such requests, we have several options: 1) do not bathe, and add the value table to the above query, for which an index is built on the value field
select distinct p.name,p.code,p.price from product p inner join product_property_value ppv using (product_id) where ppv.property_id = HDD_SIZE_PROPERTY_ID and value_id = VID_100GB and ppv.property_id = SCREEN_SIZE_PROPERTY_ID and value_id = VID_100M
select distinct p.name,p.code,p.price from product p inner join product_property_value ppv using (product_id) inner join value v on v.value_id = ppv.value_id where ppv.property_id = SCREEN_SIZE_PROPERTY_ID and v.value > VID_17IN
2) divide the value table into, say, three
value_int for integer values
value_string for text
value_decimal for non-integer values
in the property table, we add a sign of the type of property values and at the stage of constructing the above query we connect with the required table
from all this we see that the main problem These are range selections, right?
and selections by one or more values are normally solved using value_id
3) the following way of optimizing for speed:
for all values for which selection by range is possible,
we add the fields
max_value_id and min_value_id to the property
pointing to the series identifier in value, which stores the maximum and minimum values of the property, respectively.
it is clear that property identifiers must be ordered by property values.
when using this approach, you can choose using the value_id between construction even when searching by a range of values and not climb into the value table when selecting, which is good