What data compression algorithm should be used when archiving log files?

R

Ramid Salmanov2019-01-01 17:41:23

Data archiving

Ramid Salmanov, 2019-01-01 17:41:23

The task is to write a log file archiver. There
are a lot of data compression algorithms and each has its own specification.
How to determine which data compression algorithm should be used to archive a file that has a lot of duplicate elements?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

Z

Zettabyte, 2019-01-01
@Zettabyte

Of the "traditional" Linux, BZIP2 compresses best.
It will be even better to compress LZMA (7-Zip), in it you set the maximum size of the dictionary for which you have enough memory, set the word size to the maximum.
Also test PPMd with the same settings up, this algorithm is also in 7-Zip. I have not used it for a long time and did not read anything, but before it was good for texts and logs.
The 4th WinRAR had a preset specifically for texts and logs, check it out as well (I don't know if it was preserved in the 5th version).
Well, read MaximumCompression, there will surely be an algorithm that will easily drive everyone higher, but will work for half a day.

S

Saboteur, 2019-01-02
@saboteur_kiev

"The task is to write a log file archiver"
"There are many data compression algorithms and each has its own specification.
How to determine which data compression algorithm should be used to archive a file that has a lot of repeated elements?"
In your task, I do not see the requirements to write an ideal log file archiver.
In addition, writing at least the simplest archiver is already a non-trivial task for a beginner, and judging by the question, you are a beginner.
What I can suggest is my guess what your teacher wanted, but for this you need to understand the principles of dictionary compression algorithms - when you analyze a file, find something similar in it, make a dictionary of similar parts and compress using this dictionary.
So - in the case of compressing several files, you can use the dictionary of the first file to compress the second, and so on.
In archivers, this is called a solid archive.
Its advantage is its smaller size.
The disadvantage is that when unpacking one specific file (for example, the last one), you will have to unpack all the previous ones.

R

Roman Mirilaczvili, 2019-01-01
@2ord

Typically, the GZip compression algorithm is used.
bzip2 compresses better but is slower. And it will be best to compress xz (LZMA).
If Linux, then there is the logrotate program and then you don’t have to invent anything. She compresses in gzip and rotates the logs.

D

dmshar, 2019-01-01
@dmshar

If this is a real task, then it is pointless to write something yourself. Take ready-made ones, check them on real log files, compare, draw conclusions, launch the best one in production.
If this is a learning task - just to learn how to implement archivers - then study existing algorithms and implement any of them. Anyway, your educational implementation will be worse than the available commercial ones. Therefore, no one in their right mind will compare them.