Determining unicode code by character (PHP)?

V

Vitaly Zhuk2013-04-23 12:31:12

PHP

Vitaly Zhuk, 2013-04-23 12:31:12

Hello to all habrasobshchestvo!
This is not the first time I have come across the task of determining the Unicode code by character. (In more detail, we parse some sites, and if there are Chinese ~~crocosables~~ , hieroglyphs, then we should block this material).
Question number 1:
What are the options for defining the code, without installing additional extensions on PHP (5.4.9 is used).
Tried to use: pear.php.net/package/Text_LanguageDetect - not suitable at all. Already very strong errors.
Now we use our own utility:

/**
 * Util for detect unicode code by symbols
 */
class UnicodeOrdDetect
{
    /**
     * Detect code for one symbol
     *
     * @param string $char
     * @param string $encoding
     * @param bool $hex
     * @return int
     */
    public static function ord($char, $encoding = null, $hex = true)
    {
        // Default encoding
        if (null === $encoding) {
            $encoding = 'UTF-8';
        }

        if (225 >= ($ordChar = ord($char))) {
            return $hex === true ? dechex($ordChar) : $ordChar;
        }

        $char = mb_convert_encoding($char, 'UCS-4BE', $encoding);

        list (, $ordChar) = unpack('N', $char);

        return $hex === true ? dechex($ordChar) : $ordChar;
    }
}

For the test, they checked with the table: unicode-table.com/ errors have not yet been found.
Question number 2:
Is it correct to define a unicode code this way, or are there better ways?
Thank you!

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

K

KEKSOV, 2013-04-23
@ZhukV

If you need to compare UTF-8 characters with UNICODE characters a lot and often, then I would do the following, depending on the specifics of the task:
1. If you need to leave only Russian-lat-numbers-punctuation characters, then we take the UNICODE tables we need from this page www.unicode.org/charts/
(in particular:
Some characters www.unicode.org/charts/PDF/U2100.pdf
Cyrillic characters www.unicode.org/charts/PDF/U0400.pdf , www.unicode.org/charts /PDF/U0500.pdf
Punctuation www.unicode.org/charts/PDF/U2000.pdf
And more punctuation www.unicode.org/charts/PDF/U0080.pdf )

And in advance we recalculate UNICODE codes from these tables into an array of UTF-8 characters

Правила перекодировки UNICODE в UTF-8 можно посмотреть тут en.wikipedia.org/wiki/UTF-8

Bits	Last code point	Byte 1
  7	U+007F	0xxxxxxx

Bits	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
  7	U+007F	0xxxxxxx
11	U+07FF	110xxxxx	10xxxxxx
16	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
21	U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
26	U+3FFFFFF	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
31	U+7FFFFFFF	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

Для тренировки можно взять Word и нажать Alt-X после символа, отобразится его UNICODE значение. Итак, для нашей буквы «Номер» UNICODE значение 2116 (Hex).
Это значение подпадает под правило (битовая маска) U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
При помощи calc преобразуем 2116 (Hex) в двоичный вид 10000100010110 (Bin).
Вставляем наши биты в маску U+FFFF 1110[0010] 10[000100] 10[010110]
Полученное число запихиваем снова в calc 111000101000010010010110 и получаем E28496 hex, это и есть искомый код нашего символа в UTF-8.

thus obtaining a list of desired characters. Further, when processing the text, we check its characters with this set - if the character is found in this array, then we need it.
If it didn’t hit, then it’s free (at first it will be necessary to check whether we took into account all the required characters ) pages - if the symbol is on this list, then we don't need it. All missing symbols are considered good.
The list of characters must be formatted as an associative array, where the key is the UTF-8 code, and the value is true or false. In this case, checking the next character from the text will be almost instant - just check the value of the array by key.

V

Vampiro, 2013-04-23
@Vampiro

Have you tried a simpler method
like iconv + preg_match('Az|AZ|0-9|....')?

D

Domini, 2013-04-23
@Domini

en2.php.net/manual/en/reference.pcre.pattern.modifiers.php
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern has been checked since PHP 4.3.5.

and ru2.php.net/manual/en/regexp.reference.unicode.php is not it?