How is UTF-8 put into a char?

A

Alan Kabisov2017-02-19 17:36:30

C++ / C#

Alan Kabisov, 2017-02-19 17:36:30

I wondered how the UTF-8 character is placed in the char type, which, for example, in C has a size of 1 byte , and unicode, according to the logic of things and according to Wikipedia , can take from 1 to 6 bytes. So I just can’t understand, no matter how embarrassing it was, how is Unicode placed in char? How, for example, are Russian letters displayed in the console? Can someone clarify this situation for me?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

D

Dmitry, 2017-02-19
@TrueBers

And it doesn't fit.
Because C doesn't support handling UTF-8 strings. This requires third-party libraries that can do normalization, calculate the length of a string in abstract characters, not code points, etc. C and C ++ do not support this out of the box.
Well, you can, of course, stuff a UTF-8 stream into an array of chars, but no native string manipulation function will work with it correctly. Even the length of the string will never be able to count.
Therefore, the answer is simple: in C/C++, use a third-party library to work with UTF8.
And yes, never use wchar_t anywhere you can't avoid it, like in third party library APIs. wchar_t is a dumb language design crutch that even the creators of those languages have recognized.

A

abcd0x00, 2017-02-20
@abcd0x00

There is Unicode - this is a huge table of all the characters in the world. They have their own numbers there, which do not change in any way (all characters are numbered). And for this Unicode there are encodings, one of which is UTF-8 (and the others are UTF-16, UTF-32). What is an encoding is a table of byte sequences mapped to characters. One sequence of bytes is mapped to one character. Accordingly, one byte sequentially using UTF-8 is mapped to one character in Unicode (by its number). Conversely, one character in Unicode (its number) is assigned one sequence of bytes encoded in UTF-8. That is, you can translate here and there.
Then you read a sequence of bytes and it can be turned into a single number according to a certain algorithm, and then this number is already taken in Unicode and the character looks there.
The UTF-8 encoding itself (conversion law) is very simple: it takes the first byte, it says how many more bytes you need to take there. Then these bytes are taken and considered in their entirety as a continuous sequence of bits from which a number can be made. And then this number is searched in Unicode already.
Then you want to understand what cp1251 is. This is also an encoding, but it has nothing to do with Unicode. Instead of Unicode, another table is used there (a very small plate of 256 characters), so one byte is enough to get the code of any character in this table. And it just has its own Cyrillic alphabet, so for it the Cyrillic alphabet is placed in one byte.

A

Antony, 2017-02-19
@RiseOfDeath

You yourself answered your own question - from 1 to 6 chars are used.

R

rassant, 2020-07-20
@rassant

how do you convert a number to a letter?
for example in unicode 'f' is 1092.
How can this number be translated into a letter? wchar_t doesn't work, and of course char doesn't work either.