Tell us about your experience with file systems for small files?

Puma Thailand2015-09-29 15:02:33

linux

Puma Thailand, 2015-09-29 15:02:33

Here again the project met with a gigantic number of small files, there are trillions of them. They are small up to a megabyte, and more often 100kb.
ext4 slows down godlessly on a 20 TB partition, only ten files.
No ext4 tuning with journaling, barriers, but this time does not change the situation, the speed of disk operations is wildly low.
In general, this situation is always observed with a large number of files, but there are a lot of them and it happens that a directory can be opened with 30,000 directories for a second, or even tens of seconds, which of course is not acceptable.
Where to go and how to live?
I had experience with mongo grid fs, but it works even slower, but it scales, but again, buying 20 servers like this, when everything fits on one, is somehow not financially justified.
Who uses what from file systems to store small files?
How do you tune the file for this?

Answer the question

In order to leave comments, you need to log in

17 answer(s)

65536, 2015-09-29
@65536

I’m laying out
identical ones at the same time like this, you can store it 1 time
when you stored everything in one folder, you just couldn’t go into it, and if you go in, there’s nothing to do. and these were not terabytes, but some 10 GB

Alexander Ryabtsev, 2015-10-08
@dad1

You can move all files into a directory structure with 256 subdirectories at each level.
1st level of nesting - 256 folders
2nd level of nesting -256^2 folders
......
nth level - 256^n
You can get the md5 hash from
md5sum filename - 9673a892a7d8c1c9ac598ebd06e3fb58
then cut the path from the directories, choosing 2 each character into a subgroup:
/96/73/a8/filename
Thus, for a three-level structure, about 4 billion files can be decomposed, where the final folder will have an average of 256 files.
Trillion files - make four levels.
It's one thing to read a folder with 256 objects, another thing - when there are several tens of thousands, the speed of work will change by orders of magnitude.

Alexey Cheremisin, 2015-09-29
@leahch

Oh brother! You entered the pain zone... It is, alas, the best :-( unix.stackexchange.com/questions/28756/what-is-the...
Yes, we can't tune ext4 in any way, they turned off atime only when mounting.
You can also btrfs try, but it didn't work for us...
Here are the tests (not ours), we have a similar one.

Using Linux Kernel version 3.1.7
Btrfs:
    create:    53 s
    rewrite:    6 s
    read sq:    4 s
    read rn:  312 s
    delete:   373 s

ext4:
    create:    46 s
    rewrite:   18 s
    read sq:   29 s
    read rn:  272 s
    delete:    12 s

ReiserFS:
    create:    62 s
    rewrite:  321 s
    read sq:    6 s
    read rn:  246 s
    delete:    41 s

XFS:
    create:    68 s
    rewrite:  430 s
    read sq:   37 s
    read rn:  367 s
    delete:    36 s

neol, 2015-09-29
@neol

How and why do you open these directories?
I ask because

time ls -f -1 | wc -l
937070

real	0m1.240s
user	0m0.632s
sys	0m0.680s

but

time ls -1 | wc -l
937076

real	0m25.873s
user	0m24.978s
sys	0m0.940s

ext4, of the options, only noatime
The fs itself, as it were, does not slow down. There really are only a few million files for a few gigabytes.
In general, I have not seen hundreds of millions of files on one partition, but maybe it's not the FS?

Dmitry, 2015-10-08
@deemytch

According to work experience (with a smaller number), but nevertheless, the daily work of ~ 15 people on the network with the advertising dump of the publisher. We specifically ran tests for a week with real content - that is, we cloned the entire dump and measured performance.
reiserfs 3 for small files still cannot be replaced by anything.
xfs, jfs are very good for large files i.e. media content, xfs is slightly faster with them.
Further - it is possible to optimize only iron. Hardware raid1 on SSD + manual scheduling by file types if possible.

knutov, 2015-10-08
@knutov

If you have ext4, then the problem comes from the journal. If you run you will
see jb2 eating up all the io (or iostat -kx 1)
1) you can simply delete the log.
where sdX is your disk with a partition (i.e. sda2 for example)/
Contrary to popular opinion, nothing bad will happen in the hosting context without a log (assuming you have a relatively normal server in a relatively normal DC).
2) It is possible to deliver normal server disks.
This is, for example, Intel s3610, but if without very heavy loads, then Intel S3500 or Seagate 600 Pro will also most likely be enough (but I don’t recommend Seagate 600 Pro, at the moment it makes no sense to buy it).
upd:about 20tb. In general, there should be no problems if it is zfs (raidz2, for example), + l2arc cache. Well, or do it on ssd disks (server type s3610, or ordinary ones, but with LSI controllers).

Eldar Musin, 2015-10-02
@eldarmusin

And the files do not have anything similar to themselves? Well, for example, a structure, all sorts of headers ...
Maybe it's easier to extract all the information from them and drive it into the database.
And for portability, make a script that will rivet such a file for export.
As an example, docx can store information in 10Kb, and as soon as you export to pdf, then a 10-fold increase is real.

ilnarb, 2015-10-08
@ilnarb

At a minimum, you need to mount noatime!
We once lived on reiserfs a long time ago (because of the limit on the number of inodes in ext2), but it was buggy and the brakes got stronger over time. reiserfs had the advantage of having our files smaller than 1kb. They started moving to ext3, at that time there were already a lot of small files on average 1Kb, on ext3 they started setting the block size to 1kb and increasing the number of inodes. Then the files became larger, and the disks became more capacious, they stopped changing the block size. Now only stock ext4 with default block/inode settings, mount defaults,noatime.
Lyuba FS, over time, any FS starts to slow down (hello to those who think that defragmentation is not needed on Linux). Moreover, a FS can show the same results in tests even with real volumes, and after a year of work - a completely different distribution of the pedestal.
There in the kernel there are all sorts of locks on directory objects during file lookups, therefore, the more files / directories inside the directory, the slower it will be. Solution: split tiered by hash from file name (see answer 65536 @65536).
The second trick: re-upload the data every six months. If there are several sections, you throw it in a circle, reformatting it. If one large section, but you need a free server.

Optimus, 2015-09-29
Pyan @marrk2

Maybe I'm wrong, but:

$phar = new Phar('images.phar');
$phar->addFile('img.jpg', 'img.jpg');
echo file_get_contents('phar://images.phar/img.jpg');

Well, you understand))

Alexander Melekhovets, 2015-09-29
@Blast

Have you tried shamanizing with vfs_cache_pressure?

Nadz Goldman, 2015-09-30
@nadz

xfs

Pavel, 2015-10-01
@pbt39

Isn't that what the database is for?

irvinzz, 2015-10-08
@irvinzz

According to my experience, it is reiserfs that drives small files well

Sergey Kamenev, 2015-10-08
@inetstar

reiserfs 3

mirosas, 2015-12-08
@mirosas

SSD has not solved this problem yet?

poige, 2017-09-11
@poige

The correct answer is aggregation. No file system with POSIX semantics will work well with this many files.
As a landmark example, we can cite the transition of Ceph to the use of RocksDB - in the storage backend "BlueStore".
PS Most likely, over the past year, the topic starter was convinced of this. :)

neochar, 2018-12-20
@neochar

be-n.com/spw/you-can-list-a-million-files-in-a-dir...