S
S
Satisfied IT2018-11-14 11:22:27
C++ / C#
Satisfied IT, 2018-11-14 11:22:27

How to automatically detect that the text is in the wrong encoding?

There is a database in which a third-party program writes data, the task is to take data from it for reports. In general, everything is written and works, except for one inconvenient moment, the text is periodically saved in the table in the wrong encoding, that is, it looks like this or Microsoft PowerPoint - Презентация ремонтыthat

Закупка ноябрь расходники.docx
and is treated by the usual recoding from 1251 to utf.
The question is, how to automatically determine that the text is stored incorrectly, other than to check for the presence of characters in it °ЂЃ? Maybe there is another, smarter way?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
I
Ivan Arxont, 2018-11-15
specialist @borisdenis

https://www.codeproject.com/Articles/17201/Detect-...

D
d-stream, 2018-11-14
@d-stream

For example, you can peep the implementation of auto-detect encodings in the far manager. Well, or google something like that. Usually they store statistically characteristic character codes - they start reading the file until statistics are more or less unambiguous and assume an encoding. far determines the encoding quite successfully in most cases.
well, or when there are some hints like the file starts with Russian text, then you can stupidly count the number of characters in the list of Russian letters of characters in several transcoding options)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question