Answer the question
In order to leave comments, you need to log in
Algorithm to quickly check if a string matches patterns?
In a simplified form, the task is as follows: there are a large number of patterns of zeros and ones. The symbol * at the end of the pattern is possible. ( For example: 1, 10, 1011* )
And there are a large number of lines. For each string, you need to check if the string is "covered" by any of the patterns.
Question: how to do it most efficiently? Surely there is some known algorithm or data structure for this task.
------------------------------
UPD: I tried to implement a prefix tree. It seems to work correctly, but, unfortunately, too slowly. I would be grateful if someone could take a look at the code -- maybe I wrote something incorrectly/inefficiently: pastebin.com/DKZzhvXA
Answer the question
In order to leave comments, you need to log in
A simple solution can be implemented through a prefix tree. We stuff all the templates, and then we compare them by line from the set.
If you want to do it one-time and on a large amount of data, I would do this. Let's define the usual lexicographic order on the strings (if there is a string and its prefix, then the prefix is less). For each pattern with an asterisk {s}*, match two strings {s} and {s}$, where $ is greater than 0 and 1. For a pattern without a star, just put {s} and {s}&, where & < 0 and 1. And now, starting from the first character, we start numerical sorting on the lines obtained from the template + the lines being checked.
In the resulting sequence, the patterns form a bracket sequence. Anything inside the parentheses matches the corresponding pattern.
On the positive side, this is easier to do when strings don't fit into memory. The number of bits per comparison can be limited by the length of the longest pattern.
It is possible to build a tree or a state machine from templates, it will be enough to run each of the lines once. The state machine is a little more complicated, but it can be reduced to a minimal form, which will speed up the comparison.
You need to build a finite state machine, which will receive a stream of characters as input, and the output of which will be signs of a template.
In short, you need to build a compiler to compile templates into a state machine.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question