Algorithms for data compression using an external dictionary and their implementation

R

rPman2012-04-13 08:33:16

Algorithms

rPman, 2012-04-13 08:33:16

Almost all existing lossless data compression algorithms (meaning binary data compression) work on the principle of data analysis with a set of some information (block frequency tables, recoding dictionaries, ..), subsequently this data is placed block by block in the archive (situations are possible with their dynamic change).

But if, instead of storing this dictionary in the archive itself, it could be stored separately (and transmitted separately over the network, regardless of the main data stream) ... or, for example, stored in the archiver itself for various types of data ... the data transmission option looks much more attractive, it could greatly improve the compression ratio.

Often, a network communication session between applications is based on a small number of message types, the formats of which are the same, only the data (field values) differ. It is practically pointless to pack each message separately (iron solutions at the level of creating a channel, or the same vpn networks, do this quite well), but if you compress the entire communication session, this can increase the compression ratio by an order of magnitude.

The simplest example (I implemented it in one of my applications, it is quite useful and convenient) using the diff algorithm (more precisely, bzdiff is almost the same, but the result is compressed by bz).

There is a series of messages (for simplicity, let's assume that they are of the same format).
* select the first/any message as a template (send it to another node as an indication of the appearance of the template)
* each subsequent message is compared with the current template using the diff algorithm
* the received patch is sent with the template identifier
* on another node we restore the message using the specified template and the sent patch
* if the template is missing on the remote node, send the message itself with a new identifier
Each time a message is compared with a template, you can use several templates (store several, for example, the most recently used, or sort by the frequency of matches), i.e. choose which template to patch based on (of course, use the one with the smallest patch size). Changing the list of saved templates can be limited in order to minimize the distribution of new / changed templates, there is a lot of room for experimentation.

In this algorithm, this set of templates is an external dictionary, it is clear that this is a special case, in a more universal form the dictionary and compression algorithms can be more complex and more detailed information can be transmitted between nodes.

Specifically, this algorithm worked to compress text data from websites (messages are new versions of the pages of a limited set of websites), I think it is pointless to report that in this case the size of the transmitted message was extremely reduced to the size of the changes on the pages, the compression ratio was a couple of orders , naturally, no algorithm for compressing the messages themselves could give such a result.

The algorithms are almost certainly highly parallelizable (i.e., cheaper computing power such as GPUs can be used), i.e. it is possible to create a complex for data compression in a communication channel (especially if it is a forcedly narrow channel).

Attention to the question ... am I making another bike or is it 'the field is not plowed'? Are there ready-made solutions and libraries (ideally open source, as there may be too narrow application niches with corresponding patch requirements, plus flexibility in choosing platforms)?

Question number two - are there those who need to reduce the load on communication channels (if they are few or expensive, such as satellite communications)?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

E

egorinsk, 2012-04-13
@egorinsk

Have you watched Google's SPDY? There it seems that something is being compressed just with a predefined dictionary.

N

ntz, 2012-04-13
@ntz

Ah, bicycles.
Most of the compression algorithms are just based on the Huffman and Shannon-Fano algorithms, it's just that their "dictionary" (code tree) is stored along with the compressed data, and as a result, the archive is self-sufficient.
If we talk about data transfer, then in applied development there are libraries such as Apache Thrift and Google Protobuf, which allow you to generate structures that describe the data format in advance, and then use compact binary formats when sending and transfer the necessary minimum of data.