Answer the question
In order to leave comments, you need to log in
Sampling, sorting, counting data by multiple keys, how to solve?
Hello, there is a problem with the choice of storage for data:
we have a table in MySQL with different data (let's call it Items) and we have two tables Regions (Regions) and Categories (Categories). Categories and Regions contain an infinite nesting tree organized according to the Joe Celko method (left right keys)
in Items there are fields pointers to these tables - Region_ID and Category_ID
there is a task - to store the number of items in the category and region, and the number should be taken into account all children of the category and region (example):
Regions:
ID | Parent_ID | Level | Name |
---|---|---|---|
one | 0 | one | Russia |
2 | one | 2 | Moscow |
3 | one | 2 | Peter |
ID | Parent_ID | Level | Name |
---|---|---|---|
one | 0 | one | IT |
2 | one | 2 | Provider |
3 | one | 2 | Hosting |
region | Category | Count |
---|---|---|
Russia | IT | 58 |
Russia | Providers | eighteen |
Russia | Hosting | thirty |
Moscow | IT | fifteen |
Moscow | Providers | 5 |
Moscow | Hosting | ten |
Peter | IT | 40 |
Peter | Providers | 12 |
Peter | Hosting | 20 |
Answer the question
In order to leave comments, you need to log in
I will not pretend to be absolutely correct in my point of view, but still, different thoughts out loud.
I wrote in the order in which it occurred to me :)
1. It seems to me that the solution with the storage of pre-calculated counters is initially not correct, because it has complexity N 2 . According to the FIAS database, there are only ~55,000 territorial units in the Russian Federation starting from the village, and if you want to expand beyond?
2. If we talk about SQL, you can make selections like
Accordingly, in order to get the id of the elements for substitution, you can cache the id of the children in the categories and regions tables. You will be spoiled by queries of the Russia level - because there will be essentially all the id's there)
Considering that the IN query starts to behave not very well for ~ thousands of values inside the condition, I would shard the items table by the category_id and region_id fields
3 Do you really need the possibility of infinite nesting of categories and regions? Is ten levels not enough? Yes, this will inflate the items table, but it can simply be sharded by item_id
4. It’s not very clear how many items you have, but if there are also hundreds of millions, I would look somewhere in the direction of Hadoop MapReduce
5. You can stupidly take redis to store counters.
instagram-engineering.tumblr.com/post/12202313862/...
Here the guys write that it took them ~ 70MB to store 1kk keys, and this can be optimized.
Post two years ago, memory has fallen in price since then. You will only need 12Gb according to the upper estimate, again, no one has canceled sharding.
True, see p1, given the complexity of N 2 , this can be a quick temporary solution, but it will not last long if you increase the regions and categories.
Phew. Let's discuss :)
For searching, it is better to use search-oriented services. For example Solr. Dig towards /dataimport, or implement a daemon, for example, on nodejs - for on-demand indexing (via an http request or a queue, for example, rabbitmq). The search will work quickly. The question is only for fast indexing.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question