How to find a partial match of strings?

S

Sergey Sergey2019-01-17 23:36:42

Python

Sergey Sergey, 2019-01-17 23:36:42

I have a list of strings, here is an example of one

0 .. tubes were used to burn the steel ladle. Replacement of the funnel 18m 8sl. Spilled completely.

and you need to determine if "pipes used for burning" is contained in this line
BUT: there
may be errors / typos
How to do this?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

R

Ruslan Gilfanov, 2019-01-18
@Sergey1712

Without additional libraries and using the Damerau-Levenshtein distance, you can do something like this:

import re


def get_substrings(string):
    """Функция разбивки на слова"""
    return re.split('\W+', string)


def get_distance(s1, s2):
    """Расстояние Дамерау-Левенштейна"""
    d, len_s1, len_s2 = {}, len(s1), len(s2)
    for i in range(-1, len_s1 + 1):
        d[(i, -1)] = i + 1
    for j in range(-1, len_s2 + 1):
        d[(-1, j)] = j + 1
    for i in range(len_s1):
        for j in range(len_s2):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
            d[(i, j)] = min(
                d[(i - 1, j)] + 1,
                d[(i, j - 1)] + 1,
                d[(i - 1, j - 1)] + cost)
            if i and j and s1[i] == s2[j - 1] and s1[i - 1] == s2[j]:
                d[(i, j)] = min(d[(i, j)], d[i - 2, j - 2] + cost)
    return(d[len_s1 - 1, len_s2 - 1])


def check_substring(search_request, original_text, max_distance):
    """Проверка нечёткого вхождения одного набора слов в другой"""
    substring_list_1 = get_substrings(search_request)
    substring_list_2 = get_substrings(original_text)

    not_found_count = len(substring_list_1)

    for substring_1 in substring_list_1:
        for substring_2 in substring_list_2:
            if get_distance(substring_1, substring_2) <= max_distance:
                not_found_count -= 1

    if not not_found_count:
        return True


search_request = 'трубок использовали для прожигания'
original_text = 'трубок использовали для прожигания стальковша.Замена воронки 18м 8сл. Разлита полностью'

result = check_substring(search_request, original_text, max_distance=2)

print(result)  # True если найдено, иначе None

You can modify to fit your needs. But keep in mind that finding the Damerau-Levenshtein distance is, in principle, a resource-intensive operation, especially with a pure Python implementation. For example, searching for the occurrence of a substring in several megabytes of text can be quite time consuming.
To speed up finding the DL distance, you can use the implementation for Python in the C language: https://github.com/gfairchild/pyxDamerauLevenshtein
There are also less accurate, but faster algorithms for comparing two strings:
https://habr.com/ru/ post/114997/
PyPI and GitHub should have libraries with ready-made implementations of the most requested ones.

A

al_gon, 2019-01-18
@al_gon

How to analyze string similarity and check for plagiarism?