0
0
0ldn0mad2020-06-07 00:25:40
Regular Expressions
0ldn0mad, 2020-06-07 00:25:40

Why do regular expressions work this way?

In python reg. expressions work correctly, for example:

import re
s = '''Если ты хочешь построить корабль, не надо созывать людей, планировать, делить работу, доставать инструменты. 
Надо заразить людей стремлением к бесконечному морю. Тогда они сами построят корабль.'''
pattern = r'\w+'
match = re.findall(pattern, s)
if match:
    print(match)

Outputs as expected:
['If', 'you', 'want', 'build', 'ship', 'don't', 'must', 'convene', 'people', 'plan', 'share', 'work ', 'get', 'tools', 'Need', 'infect', 'people', 'desire', 'to', 'endless', 'sea', 'Then', 'they', 'themselves', 'will build', 'ship']

I do the same in the program, the output is the devil knows what:
\w [a-z0-9] Letters and numbers
5edc091f452b6192908680.png
Can't see any letters or words at all.
Change to capital W
\W [^a-z0-9] In addition to letters and numbers
5edc09a0bb09a035601032.png
, A sees both letters and numbers here. What's wrong?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
dodo512, 2020-06-07
@dodo512

pcre.org/original/doc/html/pcrepattern.html#SEC2

Unicode property support
Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.

To \wmatch not only with the Latin alphabet, you need to add (*UCP).
(*UCP)\w+
5edc399de65ad852879103.png

S
Saboteur, 2020-06-07
@saboteur_kiev

in a programme

And what kind of program is this, and who wrote the implementation of regular expressions in it?
To the author and questions.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question