How to correctly detect the presence of unicode characters in text through RegularExpressions (or otherwise) under .Net?

G

GODiDS2014-10-16 13:14:16

Unicode

GODiDS, 2014-10-16 13:14:16

So far the code is like this:

IsMatch ("[" & ChrW(128) & "-" & ChrW(65535) & "]", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

ChrW - vb function that returns a character in the current encoding (utf-8) by its number (values like \xFFFF vb.net for some reason refused to process correctly)
The code works fine, except for one case - it swears at i characters (character 105) and I (symbol 73). This behavior is incomprehensible to me.
With regular seasons, I didn’t go any further *, maybe I wrote heresy in general =)
Let me remind you - the question itself is in the title.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

G

GODiDS, 2014-10-16
@GODiDS

Managed to figure it out:
Achievement 1:
Works fine with a similar request, but with the opposite exception:
So i and I are no longer recognized in the aisles 128-65535.
Achievement 2:
Hex code of a double-byte character is set to "[\u00FF-\uFFFF]"
Achievement 3:
Accumulated and pasted System.Text.RegularExpressions.RegexOptions.IgnoreCase in vain. When this flag is disabled, everything works as it should. Apparently "i" has at least three case representations in utf-8, at least one of which is in the range "[\u00FF-\uFFFF]"
(although the reverse still doesn't work, so the question is still not fully resolved )

L

lam0x86, 2014-10-17
@lam0x86

When I see reports of problems with handling the characters "i" and "I" when the IgnoreCase flag is on, I immediately suspect that the comparison is done using the Turkish language. In it, the lowercase "i" is converted to a capital "İ", and the capital "I" (read as "ы" in Russian) is converted to a lowercase "ı". To be honest, I didn’t delve deeply into your problem, but maybe my comment will lead you to something.