B
B
BonBon Slick2021-08-22 22:31:39
PHP
BonBon Slick, 2021-08-22 22:31:39

Collapsing identical consecutive characters of a string into one like spaces in HTML?

An example of a database entry
User -> name -> vasyyyaaaaaa
Example of a search query by name
Find a user with the name : vasya

When searching for a fuzzy match and using pg_tgrm , vasya's query has a high % string difference. But if you collapse the username to an acceptable
vasyyaa
and compare it with
vasya , the
similarity will be many times higher.

So here's how to turn the collapse in a SQL query and programmatically up to a certain number of characters, transform the conditions with vasyyyaaaaaa into vasya because the number of identical consecutive characters is more than 3+. 2 identical in a row is allowed to have.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
alexalexes, 2021-08-22
@alexalexes

select string_agg(c.letter, '') as result
from (select b.letter, lag(b.letter) over (order by rownum) letter_before
        from (select row_number() over () as rownum, letter
                from (select unnest(string_to_array('vvvvvaassssyyyaaaaaaa', null)) as letter) as a
              ) as b
      ) as c
where c.letter_before is null or c.letter <> c.letter_before

It is possible to make a function to get a unique sequence of characters without repetition in SQL.
The query has the following steps:
a) Getting a selection of strings by character-by-character splitting of the word.
b) Sample b contains the numbering of characters rownum.
c) Using the lag function and the rownum column (to select the sort direction), we get the previous character of the word in the current line.
d) Select only those lines that contain different previous characters and collect the characters back into the string result.
Improved version:
select d.user_id, string_agg(d.letter, '') as result
from (select c.*, case when c.letter_before is null or c.letter_after is null or c.letter <> c.letter_before or c.letter <> c.letter_after then 1 else 0 end marker 
from
(select b.user_id, b.rownum, b.letter, lag(b.letter) over (partition by b.user_id order by b.rownum) letter_before, lead(b.letter) over (partition by b.user_id order by b.rownum) letter_after
        from (select a.user_id, row_number() over (partition by a.user_id) as rownum, a.letter
                from (select 1 user_id, unnest(string_to_array('vasyyyaaaaaa', null)) as letter
                      union all
                      select 2 user_id, unnest(string_to_array('pettyaaaa', null)) as letter
                     ) as a
              ) as b
 ) c
order by c.user_id, c.rownum) d
where d.marker = 1
group by d.user_id

a) Contains 2 users.
b) Added tracking of the leading character lead.
c) marker gives 1 in case of "head" or "tail" of the same sequence of characters.
d) only the head or tail of the sequence (if any) is taken from the character sequence, which gives a limit of <3 repetitions.
At all levels, the use of window functions allows you to get the values ​​of functions within the user_id.
PS: This solution demonstrates that it is possible to work within the standard SQL syntax without using stored procedures (with tinkering with ears in the form of window functions), but this does not mean that the solution is optimal, it may be worth writing a store using classic loops with cursors on splitting the username into characters.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question