How to organize pagination if the database and the search engine are separate services?

I

imaginationunicorn2020-03-13 14:49:26

MySQL

imaginationunicorn, 2020-03-13 14:49:26

Good time to all.

How can you organize pagination of queries filtered simultaneously by a full-text query and some arguments, if the full-text search engine and the database are separate services? Point-blank, I don’t notice the solution, although it feels like it’s very close.

Details:
MySQL server (MariaDB 10.3+), which stores records of the Document type with a huge set of parameters by which you can filter them - these are many-to-many relationships (filtering by several IDs of other models), and "greater than / less than " by column, IS TRUE/FALSE, string comparison, and so on. All Documents have a feature - each of them corresponds to text content, according to the content of which it is necessary to filter along with the search by parameters. The content is stored in Sphinx / Manticore (search engines). The problem is the organization of pagination.

For example, you need to select documents by query:
attribute1 IN (1,2,3)
attribute2 = 'example_string'
attribute3 IS TRUE
attribute4 > 500
full-text search query "jojoba oil extract firming essential shampoo"

What to expect: a list of documents filtered by all parameters, sorted by relevance given by the search engine.

Problem: pagination.

In the current version, all IDs are taken from the search engine and later added to the database query of the form WHERE IN (ID list) AND ORDER BY FIELD(id, ID list). The problem is that for any query from the search engine, there are more than 100,000 documents, and when trying to return them, the search engine executes the request for a rather long time (1-2 seconds). After that, the database itself with such a refinement also works longer than if this data were simply joined with another table.

The reverse option - to take a filtered list of IDs from the database and form a condition in the search engine WHERE IN (list of IDs) is also not an option - again, the processing speed is low (and the number of them there exceeds 100,000 per request).

What options were tried:
- FEDERATED / FEDERATEDX table - failed to "cook", unstable operation and no answer to any sneeze error on the network
- SphinxSE table - similarly, marked by the developers as an obsolete and unsupported option
- add all parameters for filtering to the index search engine is a terrible option, since the search engine performs filtering on its usual fields very slowly, plus it becomes necessary to keep two copies of the data up-to-date - in the database and the search engine
- PostgreSQL with full-text search - weak search for Russian words and very little information on how to use

Documents in the database, in general, about 100 million. Need speed. Why do you think she might be? Because any complex query from the database side is completed in less than 0.15 s, in the search engine it is also about 0.15 s. And the devil knows how to combine the results of their filtering. I also tried the option "calculate intersections of IDs from two results", but here the search engine becomes a bottleneck - it cannot quickly return all found IDs if their number exceeds 1000 (who uses Sphinx / Manticore should know this feature).

Restriction: The search engine must be Sphinx / Manticore. The rest is a painfully low-quality search for the content that is in the documents.

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

B

Boris Korobkov, 2020-03-13
@imaginationunicorn

All attributes must be indexed / filtered by the search engine.
sphinxsearch.com/docs/current.html#attributes

R

rPman, 2020-03-14
@rPman

Adding thousands of identifiers to a sql query like where in (...) is a bad practice, what happens if you first fill the table with identifiers in nmemory (from a full-text query of an external database), and then filter by your attributes from the sql database, joining is this a temporary sign?
PS, do pagination only through - you get a list of id of all records (filtered and sorted as needed), saving in a temporary table according to the user's session, and then you load the data with windows. No limit for heavy requests!
If it is not possible to download the entire list or the download of data even in the current window is slow (you never know when you index the Internet), then preload several pages ahead, and when you navigate through them, start loading new ones with a background service, this increases the load on the server at times, but significantly increases responsiveness for the user.
And of course, caching (you don't need to cache the final result, but the intermediate one, which creates the greatest overhead, makes sense), everything and everything, memory costs much less than processor time.

B

batyrmastyr, 2020-03-19
@batyrmastyr

1. Try to search the entire sphinx using distributed indexes.
index rss {
type = distributed
agent_persistent = 127.0.0.1:9312:rss1
...
agent_persistent = 127.0.0.1:9312:rssN
agent_persistent = 127.0.0.1:9312:rssUpdates
}
In the first N parts, throw all the documents more or less evenly distributing them , for example, by creation dates or numbers and rarely index, once a day or week.
In the last rssUpdates, select only documents that have changed since the last indexing, prescribe sql_query_killlist to exclude obsolete copies of documents from other indexes from the results.
2.1. The exact number of results can be found fairly quickly fromSHOW META :
"SELECT ... WHERE <conditions without sorting> LIMIT 1; SHOW META like 'total_found';" and see the result of the second query.
2.2 Then you make "honest requests" - filtering, sorting to get the desired page, but specifying max_matches: "SELECT .. WHERE ... LIMIT ... OPTION max_matches=page * page size". In a typical situation, when the user settles down on the first pages, the speed will increase significantly.
3. CREATE INDEX appeared in the third sphinx for regular, non-text columns. May I help.
4. Postgres now has a "foreign data wrapper" and through it you can knock, for example, to mysql. I did not try to knock on the Sphinx, but it is worth trying the scheme "data in postgres, we connect the full-text index from the sphinx via JOIN".

M

ManticoreSearch, 2020-03-19
@ManticoreSearch

All Documents have a feature - each of them corresponds to text content, according to the content of which it is necessary to filter simultaneously with the search by parameters

And in Sphinx / Manticore, what functionality is missing to filter immediately, both by full text and by everything else? Can I have an example?

- enter all parameters for filtering into the search engine index - a terrible option, since the search engine performs filtering on its usual fields very slowly

It is possible an example of a request for "the search engine performs filtering on its usual fields very slowly." I don’t understand what you mean by “their usual fields” and why it suddenly turns out slowly.
In my opinion this is the best option for you. It is possible to filter by full text in one place and non-full text in another and at the same time operate on hundreds of thousands of documents, but it may not always be possible to achieve good performance, because everything will rest on the fact that what is filtered in one place will need to be transferred to another. And if you do everything in one place, this happens automatically.

plus there is a need to keep two copies of the data up-to-date - in the database and in the search engine

So aren't you doing it now? You just need to additionally synchronize other fields with the id / full-text field, by which you want to filter in the end. If you mean that it is difficult to synchronize document updates, then for plain indexes there is main+delta for such cases ( https://play.manticoresearch.com/maindelta/) , for RT it depends on the application.