F
F
Fridary2016-03-23 14:03:40
ruby
Fridary, 2016-03-23 14:03:40

Best techniques for database searching via regex?

I have a database of a million text documents that contain information about court decisions. They give names, types of lawsuits, and so on.
I need to create a web service in which the user can search the database for the given data/parameters from these judgments, that is, pure work with text is in progress (as I understand it, you need to take sphinx or ElasticSearch). But I need to do not just search for a phrase in the text, but do a complex search like regex.
For example, a user on the site will write in the "Find a judge:" judge "Ivanova".
Then the search will be like "/Judge at the meeting: (.*?)/s", where $1 = "Ivanova".
Question:
1) in which database (postgresql, mongodb, redis, etc.) should I store it on the web and in what language should I create such a service (php, ruby, python, etc.) so that everything works quickly with a large selection? What methods and technologies can be used to make such a search through a million documents?
2) if it is impossible to make such a regexp search, will the search work quickly if you search for 1 million at once, and then apply a regexp filter to each output document, not on sql, but on the backend (that is, display this document according to Ivanova judge or not?

Answer the question

In order to leave comments, you need to log in

6 answer(s)
M
Melkij, 2016-03-23
@melkij

1) no, it's still full scan is not the cheapest finite state machine.
2) no, it won't.
It will work quickly if you first parse and normalize the document index.
Need to find a judge? Parse your documents, get a list of links, in which documents which judges participate. And so with the rest of the data by which to search.

D
Dimonchik, 2016-03-23
@dimonchik2013

sphinxsearch postgres
asynchronous
python (aiohttp)

A
Alexey Cheremisin, 2016-03-23
@leahch

I think that just ES is very out of place here. Moreover, the search will be carried out by terms of words, and it will be possible to search not only for "Judge at the meeting: Ivanova", but also "Judges in the chairmanship of Ivanova", "Judge Ivanova" and so on.
Give it a try, ES has features but isn't that scary.
PS. At a minimum, ES will allow you to reduce the list of considered documents to a reasonable minimum, according to which you can also make a program regexp.
Those. first, we select from ES all the docks with judges, meetings, and Ivanova, and then, based on the search results, we filter with our program regexp.
Oh yes, I completely forgot - there are also scripts that can do just that on the side of ES itself! Well, Lucinovsky search supports "exact search" of phrases,

A
asd111, 2016-03-23
@asd111

Get Elastic. There is a fuzzy search (Iavnov instead of Ivanov), faceted search (search by document attributes, for example, the surname of a judge), regular expression search (i.e. you will not need to process the search result - you will need to add what you wanted to do with a regular expression to the search query independently), search by synonyms (they wrote a car in the search line and a car is written in the text), etc., etc.
The search technology you describe is called full text search. It works quickly on millions of documents - this is what elastic, sphinx and similar programs implement. These programs are used on many sites to search for large amounts of available data - on the same stackoverflow, instagram, etc.
You can take the database postgres, because. sharding and replication work well there.
You can take the language that you know best. Take the main server more powerful, because. The main work will be done by the search engine. They say that the new php 7 eats half the RAM than before, but in my opinion php frameworks are very inferior to the same python Django in terms of ease of use and long-term support for already written code.

T
taaadm, 2016-03-23
@taaadm

Perhaps you should first transfer all these files to the database and work with it already. That is, first parse all files according to possible parameters, save them in the database. Create indexes in the database and write a web service to the database

D
Dmitry Mironov, 2016-04-01
@MironovDV

I would use Sphinx, it allows you to fine-tune the search. It was created just for such tasks. Of course you need knowledge or a specialist. The language does not play a big role, I know php, I would do it on it. It makes no sense to advise further, you need to know what exactly you need, what data, how much, whether they are updated, how often they are added, etc.
Forgot about the base. Again, your choice, I would choose MySql. But there is an option without a database, when you have text files and Sphinx directly indexes them. I don't know how much this option is for you.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question