How to correctly guess the character encoding in a file?

D

Dmitry2016-08-04 15:32:45

PHP

Dmitry, 2016-08-04 15:32:45

Good afternoon.
On the site, you need to implement the definition of the encoding of a simple text file.
Now it is implemented in the simplest way.

$filecontent = file_get_contents($path . '/' . $this->filename);
 $c = mb_detect_encoding($filecontent, 'cp1251, UTF-8');
 if($c == 'Windows-1251'){
     $filecontent = iconv('Windows-1251', 'UTF-8', $filecontent);
}

But if the file is in a different encoding than widnows-1251, then mb_detect_encoding() does not always determine exactly which encoding the file is in.
In this case, how can you determine exactly the encoding to bring it to UTF-8?
For example, the file can be in windows-1252 or utf-16 or whatever.
The calculation is made for a user who does not know anything about encodings and does not bother with setting the encoding in the system and notepad. He writes the text in the encoding that is, but on the site, before writing to the database, it is necessary to determine in which one and correctly show this text to the moderator in the text area.
ps With this solution, which I showed above, if the text starts with a number, then the rest of the text is simply not displayed if the encoding is different from cp-1251.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vladimir Dubrovin, 2016-08-05
@slo_nik

For Cyrillic, the following algorithm can be used:
1. To detect UTF-16
, usually at the beginning of the text there is a Unicode sign of the U + FEFF order, it can be used to distinguish big endian / little endian.
If there is no sign, but in even positions of the text (starting from zero) there are mostly characters with the code 0x00 (for Latin) and 0x04 (for Cyrillic) - then this is UTF-16 big endian, if in odd positions - little endian.
2. Detect UTF-8
all Cyrillic characters will consist of two octets, the first octet will have the value 0xd0 or 0xd1, Latin will match ASCII
3. Distinguish Windows-1251 from KOI8-r - both here and there for Cyrillic characters are mainly used with the code 192-255, but in koi8-r small characters come first, in windows-1251 large characters first. If the text mainly consists of characters 192-223 but at the beginning of the sentence (after a period with a space) there are characters with the code 224-255 - this is ko8-r, if vice versa - windows-1251 + you can use character frequency analysis. Latin is the same as ASCII.
All other encodings are quite rare, although the Chinese (and for some reason Google) like to encode Cyrillic in Big-5.