When writing data to a file, does it stay in the OS cache so it can be read from the cache?

S

Sergey Pugovkin2021-12-18 17:39:51

linux

Sergey Pugovkin, 2021-12-18 17:39:51

There is a big file. In which the data is written only completely. The OS, as I understand it, caches this data (that is, it does not immediately dump it to disk, but leaves it in memory).
So, two questions:
1. We read the same data (i.e., we specify the offset (which is equal to the previous (before writing) file length) and the length of the data just written), the OS will understand that exactly the data that was written and cached, and will the OS give this data out of the cache?
2. The same question, but if before reading the same data, new data was added.

I ask this why. There is an application that stores minifiles glued together in one large megafile (this is to bypass restrictions on inodes). Minifiles are written to the same stream always at the end of the megafile, never modified or removed from the megafile. And, as practice shows, most often it is the newly recorded minifile that is required. And it will be required once, then - much less often. Those. it is logical to cache it not by the fact of reading, but by the fact of writing (especially since it is already in RAM when writing). And after reading it, it’s generally more logical, on the contrary, to delete it from the cache. Does it make sense to save it in the cache using the application tools (for example, in tmpfs), or will the OS itself do this through its cache?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Armenian Radio, 2021-12-18
@gbg

It's important to understand this: there's no spoon
From a Linux point of view, the data you supposedly write to disk is pages in memory that are marked as non-anonymous (meaning there's a file on the disk that's associated with it), and dirty (and this means that these pages must be flushed to disk).
Moreover, the moment when the reset occurs depends on a bunch of factors - the programmer can influence this in the following way:
-he can require the kernel to flush the data to the drive and return control only after the drive has reported that the data has been reset.
That is, the programmer may require that the data be flushed to disk early, but not later.
After the pages are flushed to disk, the OS marks them as clean. This means that they can be borrowed for something else. You can run htop and look at the yellow bar - these are the same cache pages, and there can be a lot of them.
For small files, it may be more beneficial to use file-to-memory mapping (mmap), instead of manual picking with write-read.
And now the important question is how to understand whether the data remains in the cache after being written. Yes Easy.
- clear cache sync; echo 1 > /proc/sys/vm/drop_caches
- see that the yellow part on the memory strip in htop has disappeared
- run DD, write a gigabyte to disk. (dd if=/dev/urandom of=test.raw bs=8M count=128 status=progress)
- see if you have a yellow gigabyte
-you can then read this file and see the amazing speed - more than a gigabyte per second (I get 6, which obviously indicates that the reading was from RAM). The latter is better to do if you wrote on the HDD - because a good SSD can give out the same amount.
Objectively, this test shows that YES, the recorded data remains hanging in memory.
But at the same time, the moment when this data leaves from there depends on many factors (for example, if memory is running out, PageCache will go under the knife in the first place). So it may be useful to use other strategies - for example, mmap

V

Vladimir Korotenko, 2021-12-18
@firedragon

Assumption is so-so. I'll explain why.
You are ignoring the standard OS threading, file locking procedures and trying to get a reproducible result.
Perhaps it will work out, but if the kernel changes, your work will become unnecessary

S

Saboteur, 2021-12-19
@saboteur_kiev

Both for writing and for reading, the same mechanism is used - page cache, only for writing they are still marked as dirty, which must be written to the next level.