How to replace utf8 characters?

A

Anatoly2017-06-19 16:38:28

PHP

Anatoly, 2017-06-19 16:38:28

Hello!
The text is teeming with Unicode. It is necessary to "reset" or resetnut multibyte characters. For example ā in a. How to do it right?
As I understand it, the symbol "a" is the main one, and the kryzhik on top is additional. Perhaps you can somehow cut off additional characters from the main one.
If you convert the string encoding, then it will not be bar , but b?r

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

R

Rsa97, 2017-06-19
@Rsa97

As I understand it, the symbol "a" is the main one, and the kryzhik on top is additional. Perhaps you can somehow cut off additional characters from the main one.

You misunderstand. The character ã is a single letter in the Portuguese alphabet, in Unicode it is U+00E3 Latin small letter a with tilde. It is far from certain that the letter a can be used correctly instead .
What is the purpose of such conversion? Isn't it easier to work in utf-8?
If it is really necessary, then look at this article: https://habrahabr.ru/post/45489/

S

Stalker_RED, 2017-06-19
@Stalker_RED

As I understand it, the symbol "a" is the main one, and the kryzhik on top is additional.

You misunderstood, this is one character

ā	257
U+101
LATIN SMALL LETTER A WITH MACRON

You can look here, for example: xahlee.info/comp/unicode_index.html
You can use iconv , but it is desirable to know the original encoding.

$result = iconv('Windows-1251', 'ASCII//TRANSLIT', $src);
$result = iconv('UTF-8', 'ASCII//TRANSLIT', $src);

D

Dmitry, 2017-06-19
@TrueBers

You understand correctly.
It remains to understand why you need it. Perhaps you want to implement a search? Like looking for 'a', while 'ā' matches? I can't think of any other options...
If so, then it's enough to use the NFKC decomposition form with subsequent canonical recomposition when searching. It recomposes according to compatibility rules, even those glyphs that are visually not similar to the original will match. That is, when in a language one character can be replaced by a completely different one, but at the same time have the same meaning.
If you need to "just remove the lid", then use the usual canonical NFD decomposition. She breaks everything down into its component parts. Then you need to go through the array again and clear the diacritics, that is, you can take the categories Lu and Ll if you have a clean text, or clearing Mn will be enough.
Python example:

>>> import unicodedata
>>> unicodedata.decomposition(unicodedata.lookup('LATIN SMALL LETTER A WITH MACRON'))
'0061 0304'
>>> unicodedata.decomposition(unicodedata.lookup('LATIN SMALL LETTER A WITH TILDE'))
'0061 0303'