Are there algorithms for compressing random data with a finite alphabet?

C

chilicactus2020-09-10 04:53:06

Algorithms

chilicactus, 2020-09-10 04:53:06

So, there is a random string of a given length and with a given alphabet. The length and alphabet of the original string can be set as you like. It is necessary to reduce the number of characters without changing the alphabet . Those. if the input was a string of 256 digits from 0 to 9, then the output should also be a string consisting of numbers only, but shorter.

I tried classics like RLE, LZ and Huffman. I tried to implement it myself, tried to take ready-made ones, tried to change the string parameters, but at best I got the same length, usually the size only increased. Either I do not understand something, or the input data has too high entropy.

1) Do such algorithms even exist?
2) If yes, then I would like an implementation example, in any language.
3) And a couple of books on the topic for the really dumb ones :)

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

R

Rsa97, 2020-09-10
@chilicactus

Random data is the worst option for compression. Almost all lossless data compression algorithms are based on the search for patterns and repeating sequences. There is neither one nor the other in a random sequence.

G

Griboks, 2020-09-10
@Griboks

I would like to notify geniuses like Rsa97 and Adamos .
For example, let's take the original random sequence: 3333221 (7 digits in total).
Now we use the well-known run length encoding algorithm : 432211 (6 digits in total). Hooray, we managed to compress the "incompressible".
The question remains: when will this algorithm be effective? When the distribution skewness coefficient is significant in absolute value.
ps
Some may say that my sequence is not random. If so, then I suggest these craftsmen continue it to 20 digits. Otherwise, the sequence is random by definition.
pss
Some people mean by randomness a uniformly distributed random variable. In this case, I suggest using the grandmother's archiver - for 256 digits it will work pretty quickly. Well, or just go through all the strings of length 256 and write down the number of the original one.

C

ComodoHacker, 2020-09-10
@ComodoHacker

Generic compression algorithms will not be able to compress random data (see Rsa97 's answer ). But if you know the specifics of the data, where they come from, maybe something will work out. For example, still reveal some patterns.

W

Wataru, 2020-09-12
@wataru

The restriction on a fixed output alphabet is very disturbing if this alphabet is not a power of two characters (as for numbers from 0 to 9). So many compression algorithms use the fact that you can write the smallest possible unit of information - one bit. Otherwise, it is inefficiently compressed. And on random data without any structure and with high entropy, everything works badly anyway.
I advise you to look towards Burrows-Wheeler transform and then try to add RLE or LZ on top. Maybe your data will compress well with it.
Something like Base64 encode/decode would also help a lot here. Let's say you have k characters in the alphabet. So each character carries log_2(k) bits. And if you have N characters, then your input string contains N*log_2(k) bits of information. Round this number up and generate that many bits. This is actually a conversion from the k-ary number system to binary. It will slow down on large strings, because so far it is not obvious to me how to do the conversion quickly for an arbitrary k, and not divide a large number by 2 with a remainder. Unless you have k not a power of two, while base64 can be quickly converted block by block.
Then you can compress this bit string with any algorithm you like (break it into blocks, say, 8 bits and even with a huffman, even with lz). Then you need to convert the compressed bit string back to the k-ary number system.
You can combine plain text compression and writing an arbitrary bit string in your alphabet. For example, after BW-transform, you drive LZ on text from numbers. LZ to be efficient, one must be able to write arbitrary bit strings. Here you are somewhere in memory separately collecting new characters that close new reference lines (numbers in your example), and separately a bit string. Then translate this string into a k-ary number system and write it in front of just characters (somehow encoding its length in the first few characters of the header).