UTF-8 - what is the 0 after the ones in the leading byte for?

D

DariV2021-05-31 21:27:46

Unicode

DariV, 2021-05-31 21:27:46

If a character is encoded as a single byte, the most significant bit is set to 0 for ASCII compatibility. If a character is encoded in 2-4 bytes, then in the leading byte 2-4 high bits take on the value 1, and after them comes 0. What is 0 for, if theoretically units are enough to determine the boundaries of the character?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

G

galaxy, 2021-06-01
@DariV

What is 0 for, if theoretically units are enough to define the boundaries of a character?

For comfort. By the start byte of the sequence (encoding the character), you can determine its length in bytes: how many single leading bits, so many bytes in the sequence. Zero marks the end of the series of single high bits. Without this zero, it would be impossible to tell how many high bits are set in the start byte (you would have to read the bytes further until the next start byte or one-byte character is encountered).
Those. let's say you see this sequence (the second byte is just for illustration, pay attention to the first):
11110001 10xxxxxx
If there were no zero, then these are two bytes encoding the Unicode character 110001xxxxxx? Or is it three bytes encoding a character 1001xxxxxx...? Or four, symbol 001xxxxxx...?
It will not be possible to understand without counting to the beginning of the next character.