Answer the question
In order to leave comments, you need to log in
Designing a DBMS for storing large volumes?
Faced with the problem of designing a database for a large array of data, we are talking about billions of records (currently 200 million). Standard solutions on such volumes begin to degrade in terms of insert / read speed (very important).
Acceptable response time up to 20 seconds, of course, the faster - the better.
The data is stored in one data center. Now 10-50 requests/sec. In the near future, about 100zapr / sec.
Currently using MongoDB. The data structure looks like this (I will write in Mongo terms) - a document of about 80 fields, with the type string, datetime, int, float, null, boolean. The entry has a unique key, of type string (30 characters long). The search is carried out on 30 fields and their possible combinations. It is necessary to read in real time and do all kinds of aggregation operations with data. On such data, the count operation takes a very long time.
I would like to know what approaches are used to implement this task?
Hear good advice on data organization and structure.
Answer the question
In order to leave comments, you need to log in
They say that it is in such cases that relational databases show their advantage.
for a large data array, we are talking about billions of records (currently 200 million).
You can make a search in elasticSearch, which will return document identifiers, and then quickly get documents from mongo using them.
Well, you have already been written about OLAP
I can't speak for MongoDB. But the general direction is obvious:
- query plans (are indexes used? or is the whole table iterated?)
- disk operations (maybe it makes sense to buy an SSD with better IOPS).
- scaling (organize several slave replicas
and distribute the "search" load between them)
of these operations)
- application logic (perhaps you can do without some operations)
I would also look towards Clickhouse or another column-store DBMS (instead of making a classic DWH snowflake).
For fun, I would also try to write this table of "about 80 fields" into a partitioned Parquet and subtract columns into Apache Arrow tables as needed (with a binding language to taste, it seems that all languages are there). I think the performance will be comparable to Clickhouse, or certainly better than MongoDB. Here are benchmarks from two years ago. If a cluster is not needed, then Spark is not needed there either.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question