Designing a DBMS for storing large volumes?

E

egorov_a2019-12-02 21:22:26

MongoDB

egorov_a, 2019-12-02 21:22:26

Faced with the problem of designing a database for a large array of data, we are talking about billions of records (currently 200 million). Standard solutions on such volumes begin to degrade in terms of insert / read speed (very important).
Acceptable response time up to 20 seconds, of course, the faster - the better.
The data is stored in one data center. Now 10-50 requests/sec. In the near future, about 100zapr / sec.
Currently using MongoDB. The data structure looks like this (I will write in Mongo terms) - a document of about 80 fields, with the type string, datetime, int, float, null, boolean. The entry has a unique key, of type string (30 characters long). The search is carried out on 30 fields and their possible combinations. It is necessary to read in real time and do all kinds of aggregation operations with data. On such data, the count operation takes a very long time.
I would like to know what approaches are used to implement this task?
Hear good advice on data organization and structure.

Reply

Answer the question

In order to leave comments, you need to log in

6 answer(s)

O

Oleg Frolov, 2019-12-03
@Digiport

They say that it is in such cases that relational databases show their advantage.

Z

zavodp, 2019-12-03
@zavodp

for a large data array, we are talking about billions of records (currently 200 million).

This is not "big data".
For modern hardware and modern DBMS - this is nonsense.
For reading - indexes.
For insertion - bulk loading
Is this trolling?
Or are you writing to us from the 1960s?
It's not a burden at all. Funny.
Indexes.
And for aggregations - the prepared data to use. Count is always slow because it's overkill. Read in advance, store in auxiliary data.
It makes no sense to use MongoDB, unless you are going to spread it over a huge cluster. There will be Monga's advantage.
On 1-2-3 servers, classic relational DBMS like PostgreSQL take precedence over Mongo.
Indexes on fields and combinations.
See the query plan to understand which indexes are needed.

S

Sergey, 2019-12-03
@begemot_sun

ClickHouse consider

F

Frozen Coder, 2019-12-03
@frozen_coder

You can make a search in elasticSearch, which will return document identifiers, and then quickly get documents from mongo using them.
Well, you have already been written about OLAP

T

tester12, 2019-12-03
@tester12

I can't speak for MongoDB. But the general direction is obvious:
- query plans (are indexes used? or is the whole table iterated?)
- disk operations (maybe it makes sense to buy an SSD with better IOPS). - scaling (organize several slave replicas
and distribute the "search" load between them)
of these operations)
- application logic (perhaps you can do without some operations)

D

Dima, 2019-12-04
@v_m_smith

I would also look towards Clickhouse or another column-store DBMS (instead of making a classic DWH snowflake).
For fun, I would also try to write this table of "about 80 fields" into a partitioned Parquet and subtract columns into Apache Arrow tables as needed (with a binding language to taste, it seems that all languages are there). I think the performance will be comparable to Clickhouse, or certainly better than MongoDB. Here are benchmarks from two years ago. If a cluster is not needed, then Spark is not needed there either.