Finding duplicate pieces of text?

I

Igor Statkevich2020-01-28 14:26:19

Python

Igor Statkevich, 2020-01-28 14:26:19

Good afternoon, maybe someone had experience, you need an algorithm / hint in which direction to dig.
There are large correspondences by email, it is necessary to anonymize the correspondence, it is necessary to remove the signatures from the text of the letters:

Вася Пупкин
Менеджер
телефон 674847585748
адрес 523645653

It would be possible to collect everything in one text, and isolate the repeating pieces of text, then approximately these would be signatures, but how to find exactly the pieces (several lines) of the repeating text.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Denis, 2020-01-28
@asteroid_den

Hello. I'm not entirely sure, but maybe regular expressions can help you. Try to work with them, it might work

U

Umpiro, 2020-01-28
@Umpiro

Maybe the textblob library will help you . Namely, this part: Tutorial: Building a Text Classification System . As an example:

#!/usr/bin/python
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier

train1 = '''
Вася Пупкин
Менеджер
телефон 674847585748
адрес 523645653
'''
train2 = '''
Иван Иванов
Менеджер
телефон 673844589748
адрес 513665053
'''
train3 = '''
Николас Медведев
Менеджер
телефон 674947581748
адрес 526641655
'''
train = [
    (train1, 'pos'),
    (train2, 'pos'),
    (train3, 'pos'),
    ('С уважением, от команды Хабра!', 'neg'),
    ('Купите наших котиков?', 'neg'),
    ('Скидки 120% но Aliexpress.', 'neg'),
]
test = '''
Привет!
Алиса Аксенова
Менеджер
телефон 678942581948
адрес 520671655
'''
cl = NaiveBayesClassifier(train)
blob = TextBlob(test, classifier=cl)
for s in blob.sentences:
    print("'{}' - {}".format(s, s.classify()))

Will output this:

'Привет!' - neg
'Алиса Аксенова
Менеджер
телефон 678942581948
адрес 520671655' - pos