C
C
cicatrix2017-10-11 16:58:51
Search engines
cicatrix, 2017-10-11 16:58:51

Full text search in documents?

There are several tens of thousands of unstructured MS Word documents, both in DOC and DOCX formats in Russian. It is required to be able to perform a full-text search in all these documents , taking into account Russian morphology (i.e., all word forms should be included in the output, or vice versa, only in one specific word form) + it is desirable to be able to indicate that, for example, the word A should be next to with the word B (within 10 words, for example).
Now I use a self-written tool with regular expressions, but they are not quite what I need. Puggle is not very good at Russian morphology.
I heard that there was some kind of product from Yandex, but I did not find it.
Does anyone know similar products?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
R
Roman Mirilaczvili, 2017-10-20
@cicatrix

The Sphinx search engine can search in Russian, has its own query language SphinxQL.
It does not know how to index documents itself, you need to use additional components to extract text from them. sphinxsearch.com/forum/view.html?id=8289
But DocFetcher can search in documents, indexing with Apache Lucene, which supports Russian morphology.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question