How to sort regular expressions by the degree of "specificity" in relation to a particular string?

N

nirvimel2016-10-03 02:01:17

Python

nirvimel, 2016-10-03 02:01:17

There are many filters with regular expressions and a string that can match regular expressions from some subset of these filters. Filters must be tried on the object represented by this string. The order in which filters are applied matters (subsequent filters may overwrite previous ones). It is required to build the order of applying filters in such a way that more general filters precede more specific ones (private ones should be able to override the results of more general ones). To do this, it is necessary to sort the regular expressions corresponding to the filters in the order of their "specificity" in relation to a particular string. For example, the string i want to leavecorresponds to the filters 1) .*; 2) [\w\s]*; 3)(leave|want|i|to|\s)*in exactly that order. This could be solved by knowing which template token corresponds to each of the characters in the string and to which class this token belongs: 1) "any" character - .; 2) a character from the set in square brackets; 3) a specific character. Then it would be possible to sort by a variety of fields: 1) the number of exactly matching characters; 2) the number of characters corresponding to the sets. Or assign a "weight" to each match type and sort by the sum of the weights. The problem is that I do not know of such a regexp library that, in addition to simple verification of the match (yes / no), would also give information about the correspondence of specific characters to specific tokens and types of these tokens.
I would be grateful for your thoughts on the subject. Interested in either a specific implementation (relevant to the problem) in the listed languages, or the bare idea in general in any imperative language.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

DaneSoul, 2016-10-03
@DaneSoul

And how you consider specificity of filters to the same line?
For a single line, the filter either passes it or discards it.
Accordingly, your task is logical for sets of rows, then a less specific filter will pass more rows, a more specific one less - based on this, you can set some conditional weights for the filters by which to rank them.
If, nevertheless, the task is exactly like this, and the specificity for one particular string is evaluated, you can try to mutate the string in different ways, making a set of strings, run this set through filters and thus reduce the ranking problem to the one described above.
You can mutate a string in different ways:
- change random letters to others
- swap letters
- remove\add letters, words
- swap words
In this way, you can make a whole test set from one line.
Accordingly, for example, we made 20 test strings out of a string in this way, ran them through the filters and counted how many of these new strings satisfied the filters, determining the specificity of the filters in the end.