Answer the question
In order to leave comments, you need to log in
What DBMS to choose for storage of one big table?
There will be one large (from several million to tens of millions of records, there are not many fields and the fields themselves are relatively small, the size of the table in gigabytes - well, the same from several gigabytes to several tens of gigabytes (if you point your finger at the sky looking at previous implementations (on MySQL InnoDB ))) table. The primary key is a composite of several natural keys (such as the ISO 3166 country code, year, etc.).
It is necessary to be able to relatively quickly make arbitrary selections from it using any conventional logical filters on all fields, it is very desirable to have indexes on primary key columns for quick selection (I focus on these things because, it seems, not all NoSQL have it) . Entries almost never change, only added, usually in large batches. There are few clients (not high-load), but high speed will not be superfluous.
It is desirable that the same database (in the sense of the files in which it is stored) can be easily dragged from computer to computer and used under different operating systems (say, in MySQL this seems to be not easy to achieve).
The main programming language in the project is Scala, the main platform is 32-bit x86 (that is, the amount of RAM in the general case, as far as I understand, is limited to two or three gigabytes).
I'm considering both SQL and NoSQL options. Experience in NoSQL is absolute zero, so do not judge strictly.
Thanks in advance for the tips.
Answer the question
In order to leave comments, you need to log in
It is customary to measure data volumes in mega-, giga-, tera-, peta-bytes, but not in lines. 10 rows is not a problematic amount of data for any database, even sqlite. As long as they fit into the RAM and you don’t think of scaling the solution to 2-3 servers, you can take what is dearer and closer to you. Do not choose a muscle if it is difficult for you to copy data from computer to computer there.
Clustering affects the choice much more (if there is data that rarely twitches, it is better to put it on a screw from the RAM)
Replication (failover)
Backups, migrations, and so on. And pull one sign ... pf)
I'll tell you a little about NoSQL: the main feature of these databases (at least DynamoDB, which I work with) is the ability to scale with the growth of records with load distribution across several DB servers. Sampling by an arbitrary filter directly will be very inefficient and close to a complete enumeration, you will have to come up with additional "index" tables. From here it turns out that if the database grows and it is desirable then to work it all out at about the same speed, then nosql in the cloud would be nice, but as I understand it, everything is locally, in theory, with one user, in this case look at the SQL database.
The whole table will not fit into the RAM, whatever one may say (the question was clarified, thanks for the comments). As for sqlite, to be honest, I don’t believe that it will be any good, say, even on a 5-gigabyte file. If desired (which I do not grieve, because Occam does not order) can be easily meaningfully divided into several parts. Replication, migration, backups are, roughly speaking, not needed (by the nature of the task: the value of the stored data is very low because you can always suspend production and restore from the primary source, the number of clients is also low (<100)). I am interested in the possibility of querying by conditions on columns, speed (working with a multi-gigabyte table when using no more than two (better - one, so that it normally turns over on a laptop and does not interfere with other processes) gigabytes of RAM) and ease of development. When I looked at comparisons of NoSQL subds, I was confused by the fact that some are column-based (as I understand what I need), some are document-based (as I understand it (but I feel that at least not quite right) here to make queries on columns are not allowed), some key-value. According to the diagramimgur.com/kyahZ I like Vertica the best, but as far as I understand it is paid and not really supported by the community.
In my opinion, there is no word here to change the DBMS.
Under the description of SQL, the solution fits almost perfectly.
With document-based NoSQL, you will surely get more volume on the same data, because as there strukrura the document is stored in each document.
I won’t say anything about colum-based NoSQL, but it’s unlikely that it will be possible to store more optimally than in SQL DBMS.
Accordingly, it can become slower only because of the larger volume, plus, in the absence of experience, unforeseen problems may come up.
I think the best option would be to just properly optimize the existing solution and deal with the transfer, which seems to be not particularly difficult, and there are instructions
1-10 million records are garbage, at least mysql, at least mongo, at least store it correctly in the file.
The amount of data about nothing.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question