V
V
vad42011-10-20 15:27:40
Algorithms
vad4, 2011-10-20 15:27:40

Fast search algorithm

There is a large array of objects. An object has several attributes. The attribute can take a boolean value or a number (either integer or real).
Example:

Object 1:
attribute 1: true
attribute 2: 2
attribute 3: 22.43

Object 2:
attribute 1: false
attribute 2: 5
attribute 3: 10.7

and so on.

How to quickly implement the ability to search by attributes in such an array? Moreover, you can set search ranges, that is: show me all objects that have attribute 1: false, attribute 2: from 1 to 90, and so on.
The task is similar to database queries. It is interesting how it is implemented there.
One thing came to mind so far: to make binary search trees for each attribute.

Answer the question

In order to leave comments, you need to log in

7 answer(s)
S
SabMakc, 2011-10-20
@SabMakc

Binary trees for each element - the complexity is still obtained as O (N), because results still need to be combined. Even if they are presented as sets, it is still necessary to go through the entire array to merge.

I
ivsedm, 2011-10-20
@ivsedm

With the proposed option, it still remains difficult to merge search results for each of the attributes. In this case, the order in which the attributes are compared will also be of great importance. For example, if you select all records by attribute 1, we get a giant heap of half of all records (if the distribution is even), then they need to be intersected with the results for attribute 2, and so on.
Plus, there is also the problem of how to quickly make the intersection of the results.
In this case, I did a binary search on the most frequent and most diverse attribute, listing the results. And then I already selected elements from this list that fit other conditions, and removed the unsuitable ones from the list. As a result, only the query result remains.

X
xappymah, 2011-10-20
@xappymah

As one of the search optimization options: factorization of all objects by boolean attributes. That is, the objects are divided into non-overlapping groups with different combinations of Boolean parameters.
There is an array with the number of elements equal to the number of groups.
Each group is combined into a list and clings to the corresponding element of the array by the head.
Thus, having a query with Boolean parameters, it is possible in almost constant time (limited from above) to find all such factor groups and linearly walk through them, selecting elements by other parameters.
The lists themselves can also be factorized by some frequently used attribute (you need a set of indices to find the appropriate factor groups) or simply sorted.
But all of the above is nothing more than search engine optimization. Worst case would still be linear time.
(By the way, why does the author not like the linear complexity?)

A
Anatoly, 2011-10-20
@taliban

Each language has a sorting function that takes an array as one parameter, and a handler function for the second, which should return -1/0/1, and depending on this result, sorting occurs as the developer wants. Why don't you take advantage of this?

X
xappymah, 2011-10-21
@xappymah

As an extension of my previous suggestion, an optimization to narrow the search set.
We discard the boolean attributes, considering them integer for universality.
We have attributes A1, A2… An. Each of these attributes has a range of values.
For each attribute, we divide the corresponding range into k_i pieces, each of which is denoted by Ai_1, Ai_2...Ai_k_i.
Next, we have the set of all ALL objects with the mentioned attributes.
Take the attribute A1 and divide the set ALL into non-overlapping subsets ALL_A1_j, where 1 <= j <= k_1, such that ALL_A1_j contains objects whose value of attribute A1 is in the range A1_j.
Further, with each of these subsets, a similar partition is made by attribute A2. Then each subset is split by the values ​​of attribute A3 and so on up to An.
Thus, there is a partition tree of the set ALL by ranges of values ​​of all attributes.
Now, we can assume that the desired ranges for all attributes were specified during the search: if some attribute was not specified during the search, then the desired range is the set of all attribute values.
For each attribute Ai, at the beginning, such ranges of Ai_k are selected that intersect with the desired one.
After that, a detour is made through the partition tree in depth: that is, starting from the root ALL, we go to ALL_A1_k1, from which to ALL_A1_k1_A2_k1, and so on, until we get a certain subset of objects, from which the necessary ones are selected in a linear way, after that we return a step back and move on.
In the worst case, the partition tree traversal may have linear complexity, and in fact, slightly worse than a simple linear enumeration.
But with a successful choice of the order of attributes and ranges of partitions, the search circle can be reduced many times over (up to the degenerate case, when exactly one subset, which is found in constant time, corresponds to the search parameters).

D
dbmaster, 2011-10-21
@dbmaster

Very interesting and relevant task! Thank you!
Nothing is said about memory and server limits, so my solution is as follows:

  • assign each object number 1,2,3,4,5
  • build an independent index for each attribute (if the attribute values ​​match, sort by object number)

we split the query into parts and for each index we can find out how many objects will result in o(log(n)) = o(log(100M))=19 operations if we filter only by this one attribute.
then you can make a selection by one attribute and check all other conditions
or
merge lists of object numbers

M
MikhailEdoshin, 2011-10-21
@MikhailEdoshin

Oh my favorite challenge. There is such a UB-Tree (plus useful links on Wikipedia ) from the same material from the same Rudolf Bauer that invented the B-Tree. In my opinion, however, it is too puzzling, and I myself have not experienced it.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question