D
D
datahub42019-12-04 20:19:09
big data
datahub4, 2019-12-04 20:19:09

What to choose for full-text search on a large amount of data?

Good afternoon
There is such an ambitious (for me at least) task
There are ~50M PDF documents, the average size of each is ~1MB, the minimum is 10KB, the maximum is 50MB.
The total volume goes under 50TB.
95% of the data in a document is text.
It is necessary to provide a full-text search over the entire amount of data, that is, there is a phrase - you need to show the documents where it occurs and (optionally) show snippets, that is, the text environment where the phrase was found in the document.
Adding data to the database is rare and non-critical, that is, it can be performed for a long time and with low priority. Deleting/changing data does not happen at all.
System requirements in order of priority.
1 The ability to run it all on the cheapest and most accessible hardware is critical. budget for infrastructure is limited
2 Search speed
3 Reliability and fault tolerance
4 Ease of scaling
I independently read about Elastic, Mongo, Postgr, Kasandra and got even more confused from this.
If someone has experience in similar tasks, share the idea with what technologies this could be implemented.
Thanks in advance to all who responded

Answer the question

In order to leave comments, you need to log in

7 answer(s)
R
Roman Mirilaczvili, 2019-12-05
@datahub4

Sphinx/Manticore Search may be suitable for both cost and data volume.
Elastic eats all the memory and will not choke.
There are other players.
Apache Solr . SolrCloud - sharding and replication. Solr can parse (search) various documents.
elasticsearch vs. Solr vs. Sphinx: Best Open Sourc...
You can use the Apache Tika framework to extract text and metadata yourself .
Apache Hadoop - for PDF storage.
Such a volume of data will not be easy to process. There will be a lot of trouble with the infrastructure and operation of the software.

A
Alexey Kartashov, 2019-12-05
@antixrist

Why didn't anyone mention Sphinx ?

V
Vladimir Korotenko, 2019-12-04
@firedragon

I recommend elastic, but we used Lucene.Net as its basis. However, the native FTS engines in postgre, oracle and mssql are also quite good.
The main gag is morphology, or rather dictionaries, at least in the case of Cyrillic and German.
https://habr.com/en/post/280488/

X
xmoonlight, 2019-12-04
@xmoonlight

I recommend doing it by hand without using old-school tools (Elastic, Mongo, Postgr, Cassandra). Decide what data you have, then how to link it.
Usually, one node ("node") is one syllable (of any word).
Next - build a graph, passing through the text: entering syllables and putting links (from left to right: id-shniks of neighboring "nodes"), and separately - locations: id-node, id-location (link, file, document, URL, etc. .P.).
Search - the path through the nodes will give all the locations at once. (this is instant, because everything happens by ID)
1 The ability to run it all on the cheapest and most accessible hardware is critical. infrastructure budget is limited
2 Search speed
3 Reliability and fault tolerance
4 Ease of scalingAll requirements are met 100%.

A
Alexey Prikazchikov, 2019-12-05
@alexprik07

I don’t know, I would index the existing ones, and index the new or changed ones in the process of adding (changing). Those. I pulled out the text and already drove it around the database with Snipix. Like search engines. In any case, it's faster to search through the text of a file and get a list of links than by searching through files. Yes, the data will be redundant, but the speed will be significantly higher. Because there are further and indexes, and so on.

A
akimdi, 2019-12-05
@akimdi

have you already tried it?

G
Gnusi, 2019-12-05
@Gnusi

Have you tried ArangoSearch in ArangoDB?

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question