How to automatically detect that the text is in the wrong encoding?

S

Satisfied IT2018-11-14 11:22:27

C++ / C#

Satisfied IT, 2018-11-14 11:22:27

There is a database in which a third-party program writes data, the task is to take data from it for reports. In general, everything is written and works, except for one inconvenient moment, the text is periodically saved in the table in the wrong encoding, that is, it looks like this or Microsoft PowerPoint - Презентация ремонтыthat

Р—Р°РєСѓРїРєР° РЅРѕСЏР±СЂСЊ СЂР°СЃС…РѕРґРЅРёРєРё.docx

and is treated by the usual recoding from 1251 to utf.
The question is, how to automatically determine that the text is stored incorrectly, other than to check for the presence of characters in it °ЂЃ? Maybe there is another, smarter way?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

I

Ivan Arxont, 2018-11-15
specialist @borisdenis

https://www.codeproject.com/Articles/17201/Detect-...

D

d-stream, 2018-11-14
@d-stream

For example, you can peep the implementation of auto-detect encodings in the far manager. Well, or google something like that. Usually they store statistically characteristic character codes - they start reading the file until statistics are more or less unambiguous and assume an encoding. far determines the encoding quite successfully in most cases.
well, or when there are some hints like the file starts with Russian text, then you can stupidly count the number of characters in the list of Russian letters of characters in several transcoding options)