How to reduce backup (90% - distribution files)?

Y

Yaroslav2018-08-19 16:46:30

OpenVZ

Yaroslav, 2018-08-19 16:46:30

If we make a backup of some kind of virtual machine, most of the backup will be standard files, like / bin / ls, etc. Identical on millions of systems (and even within a company they are the same on many machines).
The decision arises - to minimize archives. We look at each file, take its hash, and somehow check it centrally. If this hash occurs many times, we simply delete this file from the archive (marking that there was a file with such and such a hash in that place). When unpacking, we fill these hashes with real files (for example, downloading them from the service by hash, or, for example, downloading .deb where there is a file with this hash).
Is there any software or service for this?
PS
Yes, sometimes you can get by with incremental backups to partially solve this problem, or use LXC overlayfs for virtual machines. But the decision at the level of archives interests.
update :
made my bike - hashget utility for simple deduplication.
Article on Habré: Reduce backups by 99.5% with hashget

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

A

Artem @Jump, 2018-08-19
Tag

Deduplication in an archive storage system, or an archiving system with deduplication. For example, I have a couple of dozen VHD test windows of virtual machines lying on a small SSD. And they work smartly because SSDs and fit on a disk that is much smaller than the sum of the sizes of these VHDs.

But I'm interested in the solution at the level of archives

An example of an archiver - zpaq In addition to the actual banal compression - deduplication, support for remote archives.

If this hash occurs many times, we simply delete this file from the archive (marking that there was a file with such and such a hash in that place).

What you described is called file deduplication. A thing known for a long time, but ineffective and not needed by anyone.
Now block deduplication is used - small chunks are deduplicated, regardless of whether they belong to a file.

A

Ambrosian, 2018-08-19
@Ambrosian

Differential backup.

D

Dmitry, 2018-08-19
@Tabletko

Full-diff-inc at the backup level. But this does not take into account other backups. Deduplication will help you here, but you need to be careful with it.

C

CityCat4, 2018-08-19
@CityCat4

If we talk about backup, let's say virtual machines, then I really liked Nakivo Backup. Deduplication is done by means of hyper, the first backup is essentially full, all the rest (well, how you set it up) are incremental. Can only virtual machines.