Answer the question
In order to leave comments, you need to log in
[Solved] Bash, regexp and Cyrillic?
Good day, %username%.
We have a bash script that works on various platforms (from the old Debian GNU/Linux 5.0 (Linux 2.6.32.11, bash 3.2.39) to Red Hat 4.8.2 (Linux 3.10.0, bash 4.2.46)). This script at the input (as a parameter or STDIN) takes a string that contains a lot of things. This line is processed, the excess is cut out, the result is inserted into the JSON request and sent further. But I ran into a problem that I can't solve at the moment. And it consists in the following:
It is necessary to create a regular expression that will cut out all characters except for Latin, Cyrillic, numbers and punctuation marks.
string=${string//[^0-9A-Za-zА-Яа-яЁё]/_};
printf 'abcd' | hexdump -C; exit 0;
$ ./test.sh
00000000 61 62 63 64 |abcd|
00000004
printf 'абвг' | hexdump -C; exit 0;
$ ./test.sh test
00000000 d0 b0 d0 b1 d0 b2 d0 b3 |........|
00000008
printf %x "'а"; echo " "; printf %x "'я"; exit 0;
$ ./test.sh test
430
44f
printf %x "'a"; echo " "; printf %x "'z"; exit 0;
$ ./test.sh test
61
7a
What form should a regular expression (applicable if possible in a pure bash environment) look like that matches all characters except Latin, Cyrillic, numbers and punctuation, given that the range of Cyrillic characters should be written as a range of character codes, not the characters themselves.
Answer the question
In order to leave comments, you need to log in
try to replace Cyrillic with ranges \xDO\x90-\xd0\xbf and \xd1\x80-\xd1\x8f Ё \xd0\x81 ё \xd1\x91
string=${string//[^0-9A-Za-z\xDO\x90-\xd0\xbf\xd1\x80-\xd1\x8f\xd0\x81\xd1\x91]/_};
It is possible like this:
message_text='qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM абв..эюяАБВ..ЭЮЯ [email protected]#$%^&*()_"`'"'"
string="<\!DOCTYPE html><html><body>$message_text</body></html>"
cyrillic=$'\xd0\x90-\xd0\xaf\xd0\xb0-\xd1\x8f\xd0\x81\xd1\x91' # 'А-Яа-яЁё' в utf-8
old_collate=$LC_COLLATE
LC_COLLATE=C # иначе могут быть неочевидные эффекты (например, "À" приравняется к "A")
eval "string=\${string//[^0-9A-Za-z${cyrillic}]/_}" # с eval нужно аккуратно. в данном случае всё OK, он выполнит строку string=${string//[^0-9A-Za-zА-Яа-яЁё]/_}
LC_COLLATE=$old_collate
echo "$string"
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question