How to correctly trim a string of type string in utf8 in c++?

D

DiIce2016-01-07 18:37:43

C++ / C#

DiIce, 2016-01-07 18:37:43

There is a string of type string, it contains a string encoded in utf8 (Russian, English letters, numbers)
How can it be correctly trimmed or partially copied into a new variable by limiting, say, 10 characters?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

Stanislav Makarov, 2016-01-08
@DiIce

ICU

M

Mercury13, 2016-01-07
@Mercury13

Unicode characters or UTF-8 bytes?
In any case, UTF-8 bytes fall into three categories…
• Initial: 0x00…0x79, and 0xC0…0xF4
• Optional (doesn't occur at the beginning): 0x80…0xBF
• Forbidden: 0xF5…0xFF. For our purposes, it can also be attributed to the initial ones.
If the task is to get 10 characters, then we find the 11th initial character and cut off before it.
If the task is to get 10 bytes and the 11th (s[10], if there is one, of course) is not the initial one, we start trimming the string until we cut off the initial character.

O

Oleg Tsilyurik, 2016-01-07
@Olej

How to cut it correctly or partially copy it into a new variable, limiting it to, say, 10 characters?

Correctly - in any way (it will all be tricks).
Correctly with localized strings, you should work as wstring, wchar_t strings ... then you can do everything with them in the usual way: determine the length, search for characters, trim and complement ...
PS If you're not too lazy, you can dig and find ready-made code examples here :
Problems of programming in C language
Problems of programming in C++
language