Answer the question
In order to leave comments, you need to log in
Tell us about your experience with file systems for small files?
Here again the project met with a gigantic number of small files, there are trillions of them. They are small up to a megabyte, and more often 100kb.
ext4 slows down godlessly on a 20 TB partition, only ten files.
No ext4 tuning with journaling, barriers, but this time does not change the situation, the speed of disk operations is wildly low.
In general, this situation is always observed with a large number of files, but there are a lot of them and it happens that a directory can be opened with 30,000 directories for a second, or even tens of seconds, which of course is not acceptable.
Where to go and how to live?
I had experience with mongo grid fs, but it works even slower, but it scales, but again, buying 20 servers like this, when everything fits on one, is somehow not financially justified.
Who uses what from file systems to store small files?
How do you tune the file for this?
Answer the question
In order to leave comments, you need to log in
I’m laying out
identical ones at the same time like this, you can store it 1 time
when you stored everything in one folder, you just couldn’t go into it, and if you go in, there’s nothing to do. and these were not terabytes, but some 10 GB
You can move all files into a directory structure with 256 subdirectories at each level.
1st level of nesting - 256 folders
2nd level of nesting -256^2 folders
......
nth level - 256^n
You can get the md5 hash from
md5sum filename - 9673a892a7d8c1c9ac598ebd06e3fb58
then cut the path from the directories, choosing 2 each character into a subgroup:
/96/73/a8/filename
Thus, for a three-level structure, about 4 billion files can be decomposed, where the final folder will have an average of 256 files.
Trillion files - make four levels.
It's one thing to read a folder with 256 objects, another thing - when there are several tens of thousands, the speed of work will change by orders of magnitude.
Oh brother! You entered the pain zone... It is, alas, the best :-( unix.stackexchange.com/questions/28756/what-is-the...
Yes, we can't tune ext4 in any way, they turned off atime only when mounting.
You can also btrfs try, but it didn't work for us...
Here are the tests (not ours), we have a similar one.
Using Linux Kernel version 3.1.7
Btrfs:
create: 53 s
rewrite: 6 s
read sq: 4 s
read rn: 312 s
delete: 373 s
ext4:
create: 46 s
rewrite: 18 s
read sq: 29 s
read rn: 272 s
delete: 12 s
ReiserFS:
create: 62 s
rewrite: 321 s
read sq: 6 s
read rn: 246 s
delete: 41 s
XFS:
create: 68 s
rewrite: 430 s
read sq: 37 s
read rn: 367 s
delete: 36 s
How and why do you open these directories?
I ask because
time ls -f -1 | wc -l
937070
real 0m1.240s
user 0m0.632s
sys 0m0.680s
time ls -1 | wc -l
937076
real 0m25.873s
user 0m24.978s
sys 0m0.940s
According to work experience (with a smaller number), but nevertheless, the daily work of ~ 15 people on the network with the advertising dump of the publisher. We specifically ran tests for a week with real content - that is, we cloned the entire dump and measured performance.
reiserfs 3 for small files still cannot be replaced by anything.
xfs, jfs are very good for large files i.e. media content, xfs is slightly faster with them.
Further - it is possible to optimize only iron. Hardware raid1 on SSD + manual scheduling by file types if possible.
If you have ext4, then the problem comes from the journal. If you run you will
see jb2 eating up all the io (or iostat -kx 1)
1) you can simply delete the log.
where sdX is your disk with a partition (i.e. sda2 for example)/
Contrary to popular opinion, nothing bad will happen in the hosting context without a log (assuming you have a relatively normal server in a relatively normal DC).
2) It is possible to deliver normal server disks.
This is, for example, Intel s3610, but if without very heavy loads, then Intel S3500 or Seagate 600 Pro will also most likely be enough (but I don’t recommend Seagate 600 Pro, at the moment it makes no sense to buy it).
upd:about 20tb. In general, there should be no problems if it is zfs (raidz2, for example), + l2arc cache. Well, or do it on ssd disks (server type s3610, or ordinary ones, but with LSI controllers).
And the files do not have anything similar to themselves? Well, for example, a structure, all sorts of headers ...
Maybe it's easier to extract all the information from them and drive it into the database.
And for portability, make a script that will rivet such a file for export.
As an example, docx can store information in 10Kb, and as soon as you export to pdf, then a 10-fold increase is real.
At a minimum, you need to mount noatime!
We once lived on reiserfs a long time ago (because of the limit on the number of inodes in ext2), but it was buggy and the brakes got stronger over time. reiserfs had the advantage of having our files smaller than 1kb. They started moving to ext3, at that time there were already a lot of small files on average 1Kb, on ext3 they started setting the block size to 1kb and increasing the number of inodes. Then the files became larger, and the disks became more capacious, they stopped changing the block size. Now only stock ext4 with default block/inode settings, mount defaults,noatime.
Lyuba FS, over time, any FS starts to slow down (hello to those who think that defragmentation is not needed on Linux). Moreover, a FS can show the same results in tests even with real volumes, and after a year of work - a completely different distribution of the pedestal.
There in the kernel there are all sorts of locks on directory objects during file lookups, therefore, the more files / directories inside the directory, the slower it will be. Solution: split tiered by hash from file name (see answer 65536 @65536).
The second trick: re-upload the data every six months. If there are several sections, you throw it in a circle, reformatting it. If one large section, but you need a free server.
Maybe I'm wrong, but:
$phar = new Phar('images.phar');
$phar->addFile('img.jpg', 'img.jpg');
echo file_get_contents('phar://images.phar/img.jpg');
According to my experience, it is reiserfs that drives small files well
The correct answer is aggregation. No file system with POSIX semantics will work well with this many files.
As a landmark example, we can cite the transition of Ceph to the use of RocksDB - in the storage backend "BlueStore".
PS Most likely, over the past year, the topic starter was convinced of this. :)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question