F
F
Fahrain2012-10-28 10:50:28
Database design
Fahrain, 2012-10-28 10:50:28

What is the best way to store many files - in a database or as files?

I am making a service, several versions of files will be stored, plus, it is assumed that there will be a lot of such initial files. At the moment, I'm counting on a couple of thousand files (in the end), but in the future, the number of files may be more.
Hence the questions:
- where is it better to store all these files - as a Blob / text in MySql or as separate files on disk? The files themselves will not be given directly to clients - first they are processed by a special php script and only then the processing result is returned. For each client, the result can be different (or maybe the same)
- what is the best way to back it all up later? There is a suspicion that it will be hard to back up all these thousands of files if you do not store them in the database ...

Answer the question

In order to leave comments, you need to log in

13 answer(s)
E
edogs, 2012-10-28
@Fahrain

In fact, both implementation options are so minimally different that good advice would be "do both options now, use files first, and when there are a lot of files, measure performance."
In 99% of cases, files are better because there is direct and obvious access to them without any bases, and the base is in any case a layer.
In terms of backups, the situation is twofold, on the one hand, it is more convenient to backup files using the same increment and conventional file means, on the other hand, you can make a one-time database backup simply by copying the file with the table and you do not need to collect a bunch of files.
In terms of speed / load, of course, if this is one server, then the files will be faster (just hit the folders, do not shove more than 1000 into one in any case), but if you have several servers, then a separate server with a database for files can have certain advantages , access to the database over the network is a little more obvious (although if you have an admin, then it doesn’t matter).
Files, other things being equal, definitely get into the cache better, on the other hand, cache clogging by the database is easier to control.

V
Vitaly Zheltyakov, 2012-10-28
@VitaZheltyakov

The only advantage of storing files in a database that you need to pay attention to is concurrent access. That is, the DBMS correctly handles simultaneous change access to the same record. If concurrent access is unlikely, then it is better to use files.
The advantage of files:
- It doesn't take much longer to backup files, but you can restore them by individual files, rather than uploading the entire dump.
- Accessing files and reading is faster than getting a record from the database (even if sockets are used).
- File caching is performed automatically by the OS and the server (you can also use the opcache for control). On the other hand, for DBMSs, caching large files causes a performance penalty by pushing simple queries out of the cache.
- The number of files can be very large without losing server power. I once stored about 12,000 files in one directory and nothing - the server read without any delay. Of course, manually opening this folder was problematic.
- Sphinx - freely searches through files.
But with all the advantages of files, concurrent access can ruin all the "raspberries", so start from it.

S
Sergey, 2012-10-29
@seriyPS

When I was faced with the task of storing 2 million HTML files, I tried a lot of things. Only my archive was formed once and then distributed in read-only mode using the Tornado web server.
I settled on SQLite + gzip. Those. created a table with fields (name, blob) and compressed each HTML in gzip.
Disabled synchronous write in SQLite to speed up seeding.
I managed to try bsd btree, bsd hash, gdbm, json-lines, csv, and just a hierarchy of files on disk. Wanted to try tokyo cabinet but couldn't find drivers for python.
bsd btree is basically comparable to SQLite in terms of speed, but takes up more disk space and is less flexible. json-lines takes up much more space, csv (and json too) cannot be gzipped and does not support keyed access.
Just a set of files on the FS is extremely inconvenient for backups and difficult to work with, for example, recursive deletion takes several hours.
gzip vs not gzip - definitely gzip! Not only is “compressing to gzip and writing to disk” faster than just “burning to disk”, but you will also save space.

V
Vampiro, 2012-10-28
@Vampiro

- Strange, the entire Internet stores files on disks and nothing. If there is an option that MySQL will store them not_on_disk, but somewhere else, then there may be some bonus. Otherwise, this is an extra overhead for each request.
- Never had a problem with backing up files. There is a bunch of ready-made utilities. And, besides, if you plan to backup the database not by backing up its files, but by raising the slave, then this is even more of a hassle.

V
Vampiro, 2012-10-28
@Vampiro

:) Do you think if a 4GB movie is squeezed into a muscle, it will then be faster / easier to download? Why's that? It will take up more space in the database. Accordingly, the backup volume will also be larger. And download it will be more difficult.
In general, searching for changed files and incremental backup on files is easier than fetching from Mysql, archiving the dump and downloading.

S
serso, 2012-10-28
@serso

Actually the difference is small.
Let me just tell you about the advantages of the database:
1. The files stored in the database are not susceptible to infection by viruses
2. Full programmatic control over files (giving away files by privileges, etc.)
3. (Oracle) Full-text search (for example, in xml)

A
Alexey Huseynov, 2012-10-28
@kibergus

If you have html's, that is, text files, and even if the versions to the version do not change, then you should consider storing in the file system, but storing old versions in git.
If the files change a lot, then I would use the FS.
But if there are really a lot of files and you need to distribute the load over several nodes, then I would think about using key value storage.
But MySQL is definitely not for that.

F
Fahrain, 2012-10-28
@Fahrain

Is the magic number of 1000 files per folder justified?

A
artyomst, 2012-10-28
@artyomst

as an option - gridFS

V
vsespb, 2012-10-28
@vsespb

I agree that it is better not to store them in the database.
In addition to what I have listed, I want to note that:
- if the files are stored in the database until the database cache is full of files, and not records from other tables, i. there is no way to prioritize and divide memory space between files and data.
- if you have files on disk and meta-information about them in the database, you need to be careful here, there is no transactional integrity. (some people even prefer not to mix the database and the disk and store all the data only on the disk, but this should be considered on a case-by-case basis)

F
Fahrain, 2012-10-29
@Fahrain

In general, thanks to everyone, I understand everything :) For now, I will do it on the files, and then, depending on the situation (quantity / load), I will already decide whether to bother further or leave it as it is

D
DMakeev, 2012-10-29
@DMakeev

“recursive deletion takes several hours” - deleted via FTP?

V
Vladislav Zolotukhin, 2015-04-22
@doctorzer0

It all depends on the tasks and on the file system on the server. So, in some cases, the Linux kernel for ext2 and ext3 has a limit on the number of directories, namely no more than 32000 (''EXT2_LINK_MAX''' and '''EXT3_LINK_MAX''). Accordingly, if your system is intended for use on various operating systems and you cannot foresee what it will be installed on, I advise you to store the files in the database itself. In addition, this will avoid unnecessary problems with mandatory access.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question