I
I
Insurgent20182020-10-20 01:35:39
PostgreSQL
Insurgent2018, 2020-10-20 01:35:39

Database choice: PostgreSQL or MongoDB?

Greetings, I need advice, 4 times a month it is planned to process files. Each file is about 3-4 billion lines of json. It is planned that the data for 1 month will be written to 1 table (collection), the data will not change.

The list of keys is limited.
{
"key1" : type Int64,
"key2" : type string,
"key3" : type Int64,
"key4" : type string,
"key5" : type string,
"key6" : type string
} Key3
is planned to store IPv4 .
Moreover, when parsing a json string - if key3 suits the conditions of a certain logic during parsing, then there is no need for key6. And vice versa - if key3 is not satisfied, then it is not necessary,
Initially, I added 1 dataset (1 file sized 294G, 2.9 billion records) - in MongoDB (ver.4.4.1). Key3 - IPv4 - saved as Int64.
With the above logic, everything turned out to be very acceptable - the size of the database is 99 Gb, + the size of indexes for 3 keys: key3 Int64, key4 string, key5 string - 70 Gb. That is, the predicted size of 1 download is 170 GB. On mediocre hardware - MongoDB - query execution time for indexed keys is acceptable.
Then I thought about it and decided that the data, in fact, has a tabular representation, and there is an excellent tool for tables - PostgreSQL.
Created the following table:
CREATE TABLE dataset_10_2020
(
id bigserial NOT NULL,
key1 timestamp without time zone NOT NULL,
key2 character varying(128),
key3 inet,
key4 character varying(64),
key5 character varying(256),
key6 character varying(256),
key7 bytea,
PRIMARY KEY (id)
);
And in this case, if the value of key3 does not satisfy the logic, this field writes null, and the raw data destined for key3 is stored in key7, and vice versa, if key3 is IPv4, then this value is written, and in key7 - null
Became save data. Somewhere in the middle - 1.67 billion records of loaded data, I decided to estimate the size of the database, 170 Gb, this is without taking into account the size of the indexes. Built a clustered index on a unique id field - and the total size is 224 Gb. And this, I repeat, without mandatory indexes for 3 fields.
And, as they say, paused. Is this normal, such a gap in the amount of data? Approximately, the same data as in MongoDB - will take 170 Gb, in PostgreSQL - will be for 300Gb. What could be wrong?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
Sergey Gornostaev, 2020-10-20
@sergey-gornostaev

There is a simple rule for determining if you need a monga: Do you have three separate servers for it?

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question