M
M
Muiron2017-07-04 16:24:09
SQL
Muiron, 2017-07-04 16:24:09

Which DBMS to choose for a large volume of documents?

I would like some advice on this matter.
I am creating my own system for analyzing a large amount of information and wondering how to store it.
Total:
- There is a relational database with the characteristics of the document (date, number, authority, signatory, etc.). Tens of millions of rows, 9 GB in MS SQL Server with maximum compression.
- There is data for processing - texts of documents. Completely unstructured, there are links with other entities in the text, but it is very difficult to build a relational model from them, and it’s not a fact that it will be required at all. Volume - about 2 TB.
- Text processing takes place in ElasticSearch, which is why there is a temptation to store texts only there.
- The main purpose of the database is to search by the characteristics of the document and the content of the text and the subsequent processing of this text. Data creation is rare and is done manually via a query. Modification occurs rarely and mostly during development. Data will not be deleted at all.
Options that I see:
1) Write everything to a normal relational database (now it's MS SQL Server). But the speed of such a monster confuses + the data from it is still downloaded to ElasticSearch.
2) Leave the structure as it is, and store the texts bound to the document id in a NoSQL database (for example, MongoDB). But the need for this link confuses.
3) Leave the structure as it is, and store the texts linked to the document id directly in ElasticSearch. It seems to be an ideal option, but I read that ElasticSearch is not the most reliable and suitable tool for use as a database. And then there is no desire to pump out 2TB from backups.
4) Freak out, denormalize the data (fortunately, there is nothing very complicated there) and store both the characteristics and the text of the document in ElasticSearch.
What option would you recommend?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
T
terrier, 2017-07-04
@terrier

Well, look - what is the problem with ES as a primary database: it can periodically lose updates, behave badly with split-brain, no ACID, sometimes replicas behave unreasonably, etc. (some bugs are fixed, some are added)
However, returning to your task - you have only 2 terabytes of data and almost no write, only read. And, I think, you don't have 40 machines in the cluster, right? :). That is, a scenario like: "The master passed out during a split-brain, 2 slaves declared themselves masters, while they were repairing, they recorded 5 terabytes of conflicting information on them, we all ***ed where the backups are" you actually do not threaten.
So why not store these unfortunate 2TB in elastic + backup / snapshot plus somewhere else for reliability against alien invasion (you still have almost immutable data).

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question