Answer the question
In order to leave comments, you need to log in
Tell me a decoder of encodings unknown to science
Please tell me an opensource java-library or a console utility (anything else is just fine) that can recover text with incorrectly applied encodings, for example koi8-R -> utf-8 -> win1251. Simply put, an analogue of the Lebedev decoder , only a server one. Thanks in advance.
Answer the question
In order to leave comments, you need to log in
If (anything else is just fine) , then I'll tell you how it works:
We take the text, break it into words, and look for the first few in different encodings in ispell-dictionaries . As soon as a couple of words matched - profit.
There are several improvements to the idea.
1) use only the first 6 letters of the word.
2) use the frequency analysis data to obtain a sorted list of encoding transformations.
3) use chains for the list of encodings (we are looking for frequently occurring syllables).
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question