B
B
BadThings2020-03-27 07:51:42
Algorithms
BadThings, 2020-03-27 07:51:42

What algorithm to use to find one of the 200k+ substrings in a string?

The task is to find text IDs (part numbers) of goods in the line with the description of the goods. For example from the line

Накопитель SSD Samsung SATA III 500Gb MZ-76E500BW 860 EVO 2.5"

pull out MZ-76E500BW

Those IDs that can be found are in the SQLite database. others do not need

to be recognized the string is very short and it doesn't make sense to optimize for <256 characters.

Or am I wrong about something? What is the best way to do it?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
S
Sergey Pankov, 2020-03-27
@BadThings

BadThings , here's what I can offer you:

  • Break your strings into individual "word" candidates for search. Look for words separately in the table.
  • "Normalize" the numbers in the table (not in the relational sense, but in the uppercase sense, removing ambiguous delimiters).
  • Index the table.
  • Formulate stop criteria for words, for example, by length, the presence of some symbols that are not typical for the number. To do this, you can calculate the database statistics (min, max, set of char, etc.).
  • Morphize the words you are looking for, for example, in the word "123-0X" the number "zero" or the letter "O" of some alphabet, "X" or the Cyrillic letter "Kher" is not clear . We'll have to build combinations of ambiguities and look for them all. But it's not a problem.
  • Have an in-memory cache limited by size. In the cache, you need to keep only words with the maximum search frequencies in the database. This cache can be made persistent and loaded into memory before searching. The main calculation is that frequently occurring words that are not in the database will be cached.

So from the line
Накопитель SSD Samsung SATA III 500Gb MZ-76E500BW 860 EVO 2.5"

"SSD", "SATA", "III", "500GB", "860", "EVO", "2.5" - will not be included in the search due to the minimum length restriction;
"Drive", "Samsung" - will get into the cache with information that they are not in the database.
The remaining words, which will not be so many, will be morphed and searched in the database with logarithmic complexity.
I think everything will work just instantly. In any case, a local persistent cache of words that do not exist in the database can throw any brakes in the context of your task.

L
longclaps, 2020-03-27
@longclaps

Or

X
xmoonlight, 2020-03-28
@xmoonlight

In the database, create a column next to it and fill it in advance by pulling the part number from the row. Then, just search on that column. You can create a trigger that will populate this column automatically when new products are added.
/(?:[0-9A-Z-]+[-]{0,1}){8}/u

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question