How to check if a string matches a list of patterns (LIKE)?

K

Konstantin2013-02-27 16:46:44

MySQL

Konstantin, 2013-02-27 16:46:44

Has a table with 100k+ templates (different variations of '%text%'). It is necessary to check the rows (let's say 10k) using MySQL tools for compliance with at least one of the patterns (true/false).

What algorithm would you recommend, where to dig?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

H

Hint, 2013-02-27
@bergsteiger

SELECT 1 FROM table WHERE 'hello world' LIKE pattern LIMIT 1
Where is the table table with the pattern column (patterns) and the string being tested is 'hello world'.

R

rPman, 2013-02-27
@rPman

May I join the question by expanding it to:
There are a very large number of rows (in general, with binary data, of course it would be better). These lines are very similar! The size of strings varies from a few bytes to several tens of kilobytes.
the best that can be said about these lines is, roughly speaking, these are different messages according to a certain number of patterns (their number is also unknown in advance, but also large, approximately comparable to log(n)). There is no way to get these templates in advance (the data source is independent), moreover, the templates change over time, i.e. new ones appear and old ones disappear.
The task is, with some approximation (we are not talking about maximum efficiency), to detect similar messages, or in terms of the information about these lines described above, to identify a pattern for each message. The level of efficiency can be determined by the patch size of some diff algorithm (the same bzdiff).
If the problem is solved in a rush, then for each next message it is to compare with each previous one (you can take some window in time), calculate the distance (patch diff size) and take the smallest one.
Further, you can optimize based on the assumption that if message A is at a distance from B and if C is at approximately the same distance from B, then the distance between A and C will be approximately the same, which means that if you store the matrix of distances between messages (not for all but only tested ones), then the distance for untested pairs can be calculated from neighboring 'neighbors'. You can even go through the archive and find out the coefficient/error that accumulates if you use this above assumption (A->B)&(C->B) = (A->C) repeatedly for A->D, A->E on based on the same calculated B->D and B->E or even D->E ... in general, in order to get at least N*log (M) instead of the complexity of N*M where N is the number of messages, M is the window size,

A

agathis, 2013-02-28
@agathis

Here we step on slippery ground, I saw mysql once about ten years ago :)
Based on general considerations, however (head-on solution):
SELECT UNIQUE s.string FROM strings s JOIN templates t ON s.string LIKE t.template
temporary table.
You can experiment with the join condition, for example, use regexp_like (or what is it called in mysql?)