Which implementations can quickly search for the intersection of sets (tagging system)?

A

Alexey Sundukov2016-09-21 13:01:48

MySQL

Alexey Sundukov, 2016-09-21 13:01:48

I would like to ask how someone solves the problem of searching by tags. What software.
Two main conditions:
1) Task - the system must quickly (~1-50 ms) answer a question in the spirit of " find all documents that have (Tag-1 or Tag-30 or Tag-100500 or ... ) and (Tag -50 or Tag-1000 or ...) and ... ". There can be up to two dozen tags in OR, AND there can be under a dozen conditions.
2) Since the data can be updated frequently, it is necessary to achieve a minimum time for updating the changes made.
I tried to do it on Redis using the SET and BITMAP types (like here "Quick catalog filter for online stores on... "). due to the strong rarefaction by document IDs, as a result, extra memory consumption for “holes.” In general, if the sets are large, then Redis does not fit well out of the box.
Now the variant works on Sphinx. Tag IDs are written in sql_attr_multi. This provides the specified requirement for the speed of searching for documents. The update requirement is addressed by building the main and delta index. The main index (on which we are searching) is declared as distributed. This works well in principle, but sometimes there are a lot of new changes and the delta index starts to slow down. Rebuilding the main index takes a few minutes (now there are about 3.5M document id's in it). It seems not long, but it is planned to increase the number of documents dozens of times. The data update time will also increase.
I would like to know if there are any other solutions (C? Tarantool? Elasticsearch?) and who uses what.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexey Sundukov, 2017-01-12
@alekciy

I'll leave it as a side note for history. At the moment, the schema in sphinx is the fastest. The issue with the delta index was resolved simply by recalculating it more frequently. Now the application monitors its size and as soon as it has more than 20k documents, rotation starts. It turns out the required speed of sampling even on complex queries.

Y

yspb, 2018-04-06
@yspb

In version 4 of the radish, it became possible to connect external modules.
For example, you can add support for JSON and roaring bitmaps:
https://github.com/RedisLabsModules/rejson
https://github.com/aviggiano/redis-roaring
The latter reduces the memory usage of sparse bitmaps.
Nothing can be faster than and, or and xor operations of bit masks.
I think the same compression mechanism is used in Sphinx.