P
P
pqgg7nwkd42019-08-02 18:04:08
Search Engine Optimization
pqgg7nwkd4, 2019-08-02 18:04:08

How to make an idea-like search?

In the idea, you can search for (Shift, Shift) identifiers in this way by the beginnings of words:
Let's say to find an ArrayList you can type al, ali, arrli, li, ...
Or to find LinkedHashMap: lh, lhmap, hm, ...
Let's say I can split identifiers into words, how can I make an index and how can I quickly search through it?
UPD: I'll clarify / reformulate the question a bit:
There is a list of proposals from 100,000 pieces. Each sentence contains 1-4 words.
Let's say one of the sentences is "green forest".
The user types a query word to search for and from the entire list of sentences it is necessary to find suitable ones. The sentence matches the request if the request consists of the concatenation of the beginnings of the words in the sentence (possible with omissions).
Let's say the queries correspond to the green forest:
z ( green forest) zl ( green forest ) zles ( green forest ) zelle ( green forest ze le


c)
ze ( green forest)
le (green forest )
But these do not match:
zes
el
es
Question, how to make such a filter?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
maaGames, 2019-08-03
@pqgg7nwkd4

ArrayList
is divided into separate words by capital letter, underscore and any other characters allowed by the language, including numbers. it turns out two words "array" and "list". case must be removed, or not taken into account when searching (or taken into account, if you want).
Then for all words you create combinations, as described in the comment above, but not manually, but automatically
a + l = al
ar + l = arl
arr + l = arrl
...
array + l = arrayl
...
If there are words in the identifier more than two, then you do this for all words. For each abbreviation, you write down which identifier it was derived from; many different identifiers can correspond to one abbreviation).
This may seem scary and take up a lot of memory, but there are a finite number of identifiers in programs and they are quite small by machine standards.

L
longclaps, 2019-08-02
@longclaps

import re

names = ['ArrayList', 'LinkedHashMap']

for s in 'al', 'ali', 'arrli', 'li', 'lh', 'lhmap', 'hm':
    f = re.compile('\\w*'.join(s), flags=re.I).search
    print(f'{s:5}:', list(filter(f, names)))
exhaust:
al   : ['ArrayList']
ali  : ['ArrayList']
arrli: ['ArrayList']
li   : ['ArrayList', 'LinkedHashMap']
lh   : ['LinkedHashMap']
lhmap: ['LinkedHashMap']
hm   : ['LinkedHashMap']

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question