Answer the question
In order to leave comments, you need to log in
How to replace utf8 characters?
Hello!
The text is teeming with Unicode. It is necessary to "reset" or resetnut multibyte characters. For example ā in a. How to do it right?
As I understand it, the symbol "a" is the main one, and the kryzhik on top is additional. Perhaps you can somehow cut off additional characters from the main one.
If you convert the string encoding, then it will not be bar , but b?r
Answer the question
In order to leave comments, you need to log in
As I understand it, the symbol "a" is the main one, and the kryzhik on top is additional. Perhaps you can somehow cut off additional characters from the main one.
As I understand it, the symbol "a" is the main one, and the kryzhik on top is additional.
ā 257
U+101
LATIN SMALL LETTER A WITH MACRON
$result = iconv('Windows-1251', 'ASCII//TRANSLIT', $src);
$result = iconv('UTF-8', 'ASCII//TRANSLIT', $src);
You understand correctly.
It remains to understand why you need it. Perhaps you want to implement a search? Like looking for 'a', while 'ā' matches? I can't think of any other options...
If so, then it's enough to use the NFKC decomposition form with subsequent canonical recomposition when searching. It recomposes according to compatibility rules, even those glyphs that are visually not similar to the original will match. That is, when in a language one character can be replaced by a completely different one, but at the same time have the same meaning.
If you need to "just remove the lid", then use the usual canonical NFD decomposition. She breaks everything down into its component parts. Then you need to go through the array again and clear the diacritics, that is, you can take the categories Lu and Ll if you have a clean text, or clearing Mn will be enough.
Python example:
>>> import unicodedata
>>> unicodedata.decomposition(unicodedata.lookup('LATIN SMALL LETTER A WITH MACRON'))
'0061 0304'
>>> unicodedata.decomposition(unicodedata.lookup('LATIN SMALL LETTER A WITH TILDE'))
'0061 0303'
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question