Answer the question
In order to leave comments, you need to log in
How to correctly guess the character encoding in a file?
Good afternoon.
On the site, you need to implement the definition of the encoding of a simple text file.
Now it is implemented in the simplest way.
$filecontent = file_get_contents($path . '/' . $this->filename);
$c = mb_detect_encoding($filecontent, 'cp1251, UTF-8');
if($c == 'Windows-1251'){
$filecontent = iconv('Windows-1251', 'UTF-8', $filecontent);
}
Answer the question
In order to leave comments, you need to log in
For Cyrillic, the following algorithm can be used:
1. To detect UTF-16
, usually at the beginning of the text there is a Unicode sign of the U + FEFF order, it can be used to distinguish big endian / little endian.
If there is no sign, but in even positions of the text (starting from zero) there are mostly characters with the code 0x00 (for Latin) and 0x04 (for Cyrillic) - then this is UTF-16 big endian, if in odd positions - little endian.
2. Detect UTF-8
all Cyrillic characters will consist of two octets, the first octet will have the value 0xd0 or 0xd1, Latin will match ASCII
3. Distinguish Windows-1251 from KOI8-r - both here and there for Cyrillic characters are mainly used with the code 192-255, but in koi8-r small characters come first, in windows-1251 large characters first. If the text mainly consists of characters 192-223 but at the beginning of the sentence (after a period with a space) there are characters with the code 224-255 - this is ko8-r, if vice versa - windows-1251 + you can use character frequency analysis. Latin is the same as ASCII.
All other encodings are quite rare, although the Chinese (and for some reason Google) like to encode Cyrillic in Big-5.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question