On what amounts of data do relational databases stop working?

dbratus2011-12-08 12:19:08

MySQL

dbratus, 2011-12-08 12:19:08

We are designing a system that in one of the tables, per year, should accumulate more than 52.5 billion records with a total volume of 2.7 Tb (if we take into account only the payload). That is, a lot of records with a payload of 52 bytes per record. Now we are thinking about the data warehouse. The customer offers to shove it all into MS SQL 2008, but I categorically do not like it, that is, I am almost sure that MS SQL will not pull, but I need proof. Therefore, the question, in fact: are there independent published data on the maximum loads on various relational databases, including MS SQL? (in English) I've seen many comparisons of MySQL with MongoDB, Cassandra, etc., but I can't find comparisons with MS SQL.

Thanks in advance.

Answer the question

In order to leave comments, you need to log in

12 answer(s)

Grigory Peretyaka, 2011-12-08
@Peretyaka

What do you mean it won't pull?
The size of a table in MS SQL is only limited by the disk size.
Another issue is that data processing will be slow, there may be errors, but this is a settings problem or a discrepancy between the requested data volumes and the size of the RAM. The first problem can be easily solved with the help of Google or a fairly cheap specialist, and the second one will still have to be fixed in the client, regardless of the database.
If the key-value suits you, then of course such engines will work an order of magnitude faster, there are a lot of popular ones.
Here I should have said that if another non-SQL model describes your data more optimally, then it is better to use it. But such bases, yet, cannot be compared in popularity with relational ones, and there is no comprehensive information on all possible problems. In addition, in my opinion, the performance there is also not very rolled back, and here it may well “not pull” suddenly and for unknown reasons. In general, I would recommend this option only if you have some kind of completely neglected case that cannot be solved using a relational database in any way. And just like that on such volumes, I would not experiment.
And in general, all these tests are bullshit. The only normal test is to create your table on two engines, fill it with demo data and test it with real queries and under a load close to expected. Although this does not give a complete picture, there are still such nuances as: reliability, hot backups or even a mirror, if the loss of even the latest data is critical, scalability, etc.
Yes, and I understand the customer, you put something fashionable and NOSQL for him now, even if the performance is several times better (although there are also questions here), and then, in which case, he will have to urgently look for specialists for this base, who will also take exorbitant prices.

dbmaster, 2011-12-08
@dbmaster

As my old friend said, you need to sit down and count.

Zorkus, 2011-12-09
@Zorkus

Well, generally speaking, 2.7 TB in itself is not so much (telecoms use a much larger base). We used databases of about 3 terabytes on Oracle, first two regular server middle tiers in RAC, then we tried Exadata DB Machine Quarter Rack (http://www.oracle.com/us/products/database/exadata-database-machine/overview/ index.html - read, animal machine), everything worked fine.
Key problems:
- partitioning and dividing tables into separate sections that lie on different hard drives in the raid (critical partitions, where the hottest data is, can be put on SSD)
- will there be a large number of "live" requests for aggregating data at a high level? Queries to a table of several billion records are quite fast if they are strictly followed by partition keys, if the table is correctly divided into partitions, and if they are on different disks. Requests like - calculate me the average price for 5 billion orders, of course, you will quickly be put on the shoulder blades, just because of the crazy IO.
— Disks. Estimate the cost of a normal SAN, look at what MS SQL has means like the Oracle ASM (automatic storage manager).

dbmaster, 2011-12-08
@dbmaster

I agree with pietrovich.
52500 million records per year ~= 4375 per month ~= 145.83 per day ~= 6.08 per hour
achieve the addition of 6 million records per hour given the distribution by DATASOURCE_ID and you will be happy. In principle, you can distribute different data sources to different servers and thus achieve scale out.
Hint: Microsoft allows you to try the Enterprise version for half a year - the customer does not have to pay right away.
although you seem to be looking for arguments against (

sl_bug, 2011-12-08
@sl_bug

It would be more appropriate to talk about the data and how they will be used. But in any case - sharding.

dbratus, 2011-12-08
@dbratus

The data structure is as follows: Selections by: should return data for a maximum of 1-2 seconds.
DATASOURCE_ID: UUID TIMESTAMP: DATETIME VALUE: DOUBLE
SELECT * FROM DATA_TABLE WHERE DATASOURCE_ID = :ID AND TIMESTAMP = (SELECT MAX(TIMESTAMP) FROM DATA_TABLE WHERE DATASOURCE_ID = :ID)

phasma, 2011-12-09
@phasma

Take Oracle and don't worry. Well, or DB2.

Zorkus, 2011-12-09
@Zorkus

In general, I would say, if we don’t talk about prices, but simply about data volumes, you are still far from the limit when relational databases will not withstand your load, unless of course your data model is relational. In a full Exadata V2-8 rack, for example, there is almost 5 terabytes of Flash Cache Memory alone (and 100/330 terabytes of main storage, depending on whether you install SCSI or SATA for it.)

kmike, 2011-12-08
@kmike

With such volumes and data structure, can something like RRDtool be used, or some kind of whisper ?

mike114, 2011-12-09
@mike114

Teradata is used in banks, it easily digests huge amounts of information

iryndin, 2014-07-09
@iryndin

Judging by the request, you need to get the latest (in time) data for each datasource. Why not periodically (once per minute, hour, day) take the latest timestamp for each datasource and add its data to another table? And already on this (reduced) data to make your request. Accordingly, the amount of data will decrease, i.e. your query will run faster. Primary data (raw date) either to the slag after n months of storage, or to files and tape, if they may be needed in the future.

InChaos, 2019-05-08
@InChaos

MS SQL 2016 base 6.5TB, base growth 500 GB/month, new records 10-50 per second, reading constantly, on average 1500-2000 transactions/sec
RAM 32 GB, 8 cores,
Server just smokes, average 5-7 % CPU, I saw peaks up to 70% but rarely, the average IO Wait is 30ms
You just need to design the base normally knowing what loads, what operations will be performed, and not as you are used to, created a plate and that's it, and then the plug is already 100 GB))) )
Besides requests to basis normally to optimize, indexes.