Sampling, sorting, counting data by multiple keys, how to solve?

S

Sergey Kornienko2014-07-22 14:50:02

PHP

Sergey Kornienko, 2014-07-22 14:50:02

Hello, there is a problem with the choice of storage for data:
we have a table in MySQL with different data (let's call it Items) and we have two tables Regions (Regions) and Categories (Categories). Categories and Regions contain an infinite nesting tree organized according to the Joe Celko method (left right keys)
in Items there are fields pointers to these tables - Region_ID and Category_ID
there is a task - to store the number of items in the category and region, and the number should be taken into account all children of the category and region (example):
Regions:

ID	Parent_ID	Level	Name
one	0	one	Russia
2	one	2	Moscow
3	one	2	Peter

Categories:

ID	Parent_ID	Level	Name
one	0	one	IT
2	one	2	Provider
3	one	2	Hosting

5 Providers
in Moscow 10 Hosters in Moscow
12 Providers
in St. Petersburg 20 Hosters
in St. Petersburg 8 Providers in St. Petersburg (2nd level category
not specified) 3 Providers in Russia (city not specified)
Respectively:
There will be 20 Providers in Russia pcs (summing up the Providers in St. Petersburg, Moscow and without the city)
In St. Petersburg IT - there will be 40 pieces (Summing up all of the Hosters, Providers and IT people)
If we present everything in the form of a Counters table:

region	Category	Count
Russia	IT	58
Russia	Providers	eighteen
Russia	Hosting	thirty
Moscow	IT	fifteen
Moscow	Providers	5
Moscow	Hosting	ten
Peter	IT	40
Peter	Providers	12
Peter	Hosting	20

If we have 10,000 categories and 18,000 regions, then there will be 180 million rows in the counter table, which doesn’t seem like much, but 18,000 rows will be added to each category.
If we store it in MySQL, a query to this table will take a lot time ....
Advise a fast storage that can work with more than 1 billion rows and an acceptable time for issuing results (no more than 0.01 seconds per request for 1 row) and the ability to select counters of all regions more than 0.1 sec)
Maybe advise another option for storing counters and their quick selection

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Dmitry Entelis, 2014-07-22
@DmitriyEntelis

I will not pretend to be absolutely correct in my point of view, but still, different thoughts out loud.
I wrote in the order in which it occurred to me :)
1. It seems to me that the solution with the storage of pre-calculated counters is initially not correct, because it has complexity N ² . According to the FIAS database, there are only ~55,000 territorial units in the Russian Federation starting from the village, and if you want to expand beyond?
2. If we talk about SQL, you can make selections like
Accordingly, in order to get the id of the elements for substitution, you can cache the id of the children in the categories and regions tables. You will be spoiled by queries of the Russia level - because there will be essentially all the id's there)
Considering that the IN query starts to behave not very well for ~ thousands of values inside the condition, I would shard the items table by the category_id and region_id fields
3 Do you really need the possibility of infinite nesting of categories and regions? Is ten levels not enough? Yes, this will inflate the items table, but it can simply be sharded by item_id
4. It’s not very clear how many items you have, but if there are also hundreds of millions, I would look somewhere in the direction of Hadoop MapReduce
5. You can stupidly take redis to store counters.
instagram-engineering.tumblr.com/post/12202313862/...
Here the guys write that it took them ~ 70MB to store 1kk keys, and this can be optimized.
Post two years ago, memory has fallen in price since then. You will only need 12Gb according to the upper estimate, again, no one has canceled sharding.
True, see p1, given the complexity of N ² , this can be a quick temporary solution, but it will not last long if you increase the regions and categories.
Phew. Let's discuss :)

O

olamedia., 2014-07-22
@w999d

For searching, it is better to use search-oriented services. For example Solr. Dig towards /dataimport, or implement a daemon, for example, on nodejs - for on-demand indexing (via an http request or a queue, for example, rabbitmq). The search will work quickly. The question is only for fast indexing.