Answer the question
In order to leave comments, you need to log in
Determining unicode code by character (PHP)?
Hello to all habrasobshchestvo!
This is not the first time I have come across the task of determining the Unicode code by character. (In more detail, we parse some sites, and if there are Chinese crocosables , hieroglyphs, then we should block this material).
Question number 1:
What are the options for defining the code, without installing additional extensions on PHP (5.4.9 is used).
Tried to use: pear.php.net/package/Text_LanguageDetect - not suitable at all. Already very strong errors.
Now we use our own utility:
/**
* Util for detect unicode code by symbols
*/
class UnicodeOrdDetect
{
/**
* Detect code for one symbol
*
* @param string $char
* @param string $encoding
* @param bool $hex
* @return int
*/
public static function ord($char, $encoding = null, $hex = true)
{
// Default encoding
if (null === $encoding) {
$encoding = 'UTF-8';
}
if (225 >= ($ordChar = ord($char))) {
return $hex === true ? dechex($ordChar) : $ordChar;
}
$char = mb_convert_encoding($char, 'UCS-4BE', $encoding);
list (, $ordChar) = unpack('N', $char);
return $hex === true ? dechex($ordChar) : $ordChar;
}
}
Answer the question
In order to leave comments, you need to log in
If you need to compare UTF-8 characters with UNICODE characters a lot and often, then I would do the following, depending on the specifics of the task:
1. If you need to leave only Russian-lat-numbers-punctuation characters, then we take the UNICODE tables we need from this page www.unicode.org/charts/
(in particular:
Some characters www.unicode.org/charts/PDF/U2100.pdf
Cyrillic characters www.unicode.org/charts/PDF/U0400.pdf , www.unicode.org/charts /PDF/U0500.pdf
Punctuation www.unicode.org/charts/PDF/U2000.pdf
And more punctuation www.unicode.org/charts/PDF/U0080.pdf )
Bits Last code point Byte 1 7 U+007F 0xxxxxxx Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 7 U+007F 0xxxxxxx 11 U+07FF 110xxxxx 10xxxxxx 16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx 21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 26 U+3FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 31 U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Have you tried a simpler method
like iconv + preg_match('Az|AZ|0-9|....')?
en2.php.net/manual/en/reference.pcre.pattern.modifiers.phpand ru2.php.net/manual/en/regexp.reference.unicode.php is not it?
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern has been checked since PHP 4.3.5.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question