What is the best way to store such a dataset?

P

prostoprofan2021-03-17 19:50:38

MySQL

prostoprofan, 2021-03-17 19:50:38

Hello, I am looking for an answer to my question, because I do not know how to do it correctly in this situation.
There is a ready-made project with hundreds of active users who upload data in the format
Line 1
Line 2
Line 3, etc.

The amount of data that is loaded daily is from 100 to 10,000 lines. These lines are a certain set of data that needs to be saved somewhere, now organized (nothing better invented by past developers) storage in files. The bottom line is that these lines are periodically extracted 1-2 pieces, sometimes more, and sometimes less, you can do it all at once, it's not so important.
The problem is that sometimes (it is still not clear why exactly) the extraction fails, that is, it seems that the file is empty with data, but in fact the data is there and, as a result, there is a double "extraction" of data, for example, there were 100 pieces , it was necessary to extract 4pcs, it will extract 8pcs, the first 4 will be lost. It does not depend on what operations are currently taking place on the server (SQL fetch), MB backup of files or something else, it does not matter (this has already been investigated for more than a year, there is a problem and they could not solve it).

So what needs to be done now (to store these lines, and there are a lot of them and they weigh 5-10MB each), it is necessary to count these lines every time an increase or decrease occurs so that on the side of the verifier it is clear whether the figure has decreased or increased ( so that in the future admins will understand whether they need to do something with the data or not).
If you use MySQL, then in what data type to store it? DB won't go? Is there something better for storing such a data format?

Oh yes, it is important that you can unload this very data at any time without long waits, roughly speaking, as PHP file_get_contents does

ps Now File_get_contents gets the contents of the file through the file's lock keys (a queue is created so that another process cannot do anything in parallel). I would be glad to hear suggestions, the task seems to be easy, but everything is ok in the code, there are even additional checks for taking lines, but they do not help in solving the problem ...
I will add, the server is 8 cores, 16 threads, SSD (Raid 0 mirroring), Lots of free RAM (100GB+)

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

L

Lazy @BojackHorseman MySQL, 2021-03-17
Tag

TEXT
won't "go", whatever that means.

R

rPman, 2021-03-17
@rPman

5-10MB data! it is not necessary to the database, it is bad!
I strongly recommend, if possible, transfer the processing of strings to the client, with a probability of 99% it will not be difficult if at all required (and apparently you just stupidly read the file and give it in the request body).
Then it will be possible to transfer multi-threaded data sending to a web server, where this is implemented as efficiently as possible, and all bugs have been licked out for years. Those. publish your files statically in the web server, directly the entire directory or create symbolic links in a separate public directory (bad practice but working) for the files you need right now and with a name that includes the session ID (if you need to limit access for authorized users, create / delete a symlink - the operation is instant, especially if you do it on a ramdisk, if you feel like it, you can find / write a module to the web server that controls the rewrite / location rule).
It turns out that instead of serving files, the backend should simply tell clients which links they need to open in order to get these files (for example, identifiers - they are also file names)

A

AUser0, 2021-03-18
@AUser0

That is, you have the only plug - the "disappearance" of data from files? Which in fact turns out to be not disappearance, because the data is in place? Then do not suffer with adaptation in a DB, however solve a question with "false disappearance" of the data.
PS At least check that the blocking is really blocking.

R

Roman Mirilaczvili, 2021-03-17
@2ord

The bottom line is that these lines are periodically extracted 1-2 pieces, sometimes more, and sometimes less, you can do it all at once, it's not so important.

You need to know by what principle those same lines are extracted.
Like grep line by line?
If yes, then the LIKE search is not very efficient.
Or are there dates, timestamps?
Then it is better to store in a DBMS with a datetime column.
Maybe a key-value?
Then either two columns in a relational DBMS or k/v storage.
Or is it unstructured data, just text? For example, a log (log) can have some structure and is split into its component parts by parsing. Graylog, FluentD, ...
Based on the description of the problem, it feels like the DBMS will replace this whole garden, but you need to know what data is inside and how it is used.
I would suggest writing more about the data in the question, if possible.

A

Alexey Smirnov, 2021-03-19
@aleksmir

Write in MySQL. The most correct thing. There are no losses. InnoDB engine.
The structure of the tables depends on the source data.
Write how many files you have now. And what kind of growth - i.e. how much are added daily. Then it will be possible to estimate the structure.
The second question is how to select these rows? By text search or something else? By the headings of the filenames or by the whole text?