S
S
Sergey Sokolov2020-11-26 23:43:29
Character encoding
Sergey Sokolov, 2020-11-26 23:43:29

Why does "y" (and-short) in UTF-8 Linux file system take 4 bytes?

On the CentOS 7.8 server there are files with Cyrillic names. For example, Юрий.jpg
In the name of this file, the first three letters are 2 bytes each, and for some reason “Y” is already 4:

%D0%AE%D1%80%D0%B8%D0%B8%CC%86.jpg

And when the web request with Cyrillic, all letters, including "й" - 2 bytes each:

%D0%AE%D1%80%D0%B8%D0%B9.jpg

(Both examples via php urlencode())

Required for a web request with the "Yuri" parameter find the corresponding local file. What is the best way to solve this problem with encoding ambiguity of some letters?

While I'm thinking of renaming the files, replacing the long sequence for "Y" with normal 2 bytes. Apparently, the files were transferred to the server in such a way that some characters were so distorted. If you create a new file with a Russian name, there is no problem - each character is 2 bytes.

But it is not clear what other symbols have been distorted. Hardly just "y".

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Sergey Sokolov, 2020-11-27
@sergiks

Thanks to hint000 for the clarification with reference to Unicode Normalization Algorithms - there is an example with "y" in the table. Learned about NFD, NFC.
As a result, in PHP, when searching for a file with a Cyrillic name by the accepted parameter, I normalize the parameter in NFD (in which local files turned out) using the Normalizer class :

Normalizer::normalize($cyrillic_name, Normalizer::FORM_D)
// преобразует имя "Юрий" из
// "%D0%AE%D1%80%D0%B8%D0%B9"
// в
// "%D0%AE%D1%80%D0%B8%D0%B8%CC%86" – как в локальных файлах оказалось.

R
Roman Mirilaczvili, 2020-11-27
@2ord

Because "y" consists of two glyphs: "and" + the glyph on top of it, which, when displayed on the screen, are combined into one character "y".
And there is absolutely no crime in this.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question