[Solved] Bash, regexp and Cyrillic?

M

Misty Hedgehog2015-05-16 11:45:44

linux

Misty Hedgehog, 2015-05-16 11:45:44

Good day, %username%.
We have a bash script that works on various platforms (from the old Debian GNU/Linux 5.0 (Linux 2.6.32.11, bash 3.2.39) to Red Hat 4.8.2 (Linux 3.10.0, bash 4.2.46)). This script at the input (as a parameter or STDIN) takes a string that contains a lot of things. This line is processed, the excess is cut out, the result is inserted into the JSON request and sent further. But I ran into a problem that I can't solve at the moment. And it consists in the following:

It is necessary to create a regular expression that will cut out all characters except for Latin, Cyrillic, numbers and punctuation marks.

And everything would be fine, in a number of operating systems we have Cyrillic in the source codes - it is perceived with hostility. Those. the script works until it becomes necessary to edit/correct it. After trying to edit, due to the construction of the form:
string=${string//[^0-9A-Za-zА-Яа-яЁё]/_};
(namely, because of A-Yaa-yaEyo ), saving an open file in the same nano is problematic. In my opinion, the most logical solution is to replace the Cyrillic characters themselves with their codes, but how? Attempts like \430-\44f \u430-\u44f \x430-\x44f fail. When viewing the hexdump codes, we have the following picture:

printf 'abcd' | hexdump -C; exit 0;
$ ./test.sh
00000000  61 62 63 64                                       |abcd|
00000004

printf 'абвг' | hexdump -C; exit 0;
$ ./test.sh test
00000000  d0 b0 d0 b1 d0 b2 d0 b3                           |........|
00000008

printf %x "'а"; echo " "; printf %x "'я"; exit 0;
$ ./test.sh test
430
44f

printf %x "'a"; echo " "; printf %x "'z"; exit 0;
$ ./test.sh test
61
7a

I finalize my question:

What form should a regular expression (applicable if possible in a pure bash environment) look like that matches all characters except Latin, Cyrillic, numbers and punctuation, given that the range of Cyrillic characters should be written as a range of character codes, not the characters themselves.

Thanks in advance to the community.
// this question has also been asked on stackoverflow.com

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

Shetani, 2015-05-16
@paramtamtam

try to replace Cyrillic with ranges \xDO\x90-\xd0\xbf and \xd1\x80-\xd1\x8f Ё \xd0\x81 ё \xd1\x91

string=${string//[^0-9A-Za-z\xDO\x90-\xd0\xbf\xd1\x80-\xd1\x8f\xd0\x81\xd1\x91]/_};

P

Power, 2015-05-16
@Power

It is possible like this:

message_text='qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM абв..эюяАБВ..ЭЮЯ [email protected]#$%^&*()_"`'"'"
string="<\!DOCTYPE html><html><body>$message_text</body></html>"

cyrillic=$'\xd0\x90-\xd0\xaf\xd0\xb0-\xd1\x8f\xd0\x81\xd1\x91' # 'А-Яа-яЁё' в utf-8
old_collate=$LC_COLLATE
LC_COLLATE=C # иначе могут быть неочевидные эффекты (например, "À" приравняется к "A")
eval "string=\${string//[^0-9A-Za-z${cyrillic}]/_}" # с eval нужно аккуратно. в данном случае всё OK, он выполнит строку string=${string//[^0-9A-Za-zА-Яа-яЁё]/_}
LC_COLLATE=$old_collate
echo "$string"

S

ShamblerR, 2015-05-18
@ShamblerR

1. What is the option of Posix classes, but it’s not a fact that it’ll work too hard, they are new.
2. Deliver the Cyrillic as a rule, this is done twice
3. Make the opposite exception