Answer the question
In order to leave comments, you need to log in
Finding duplicate pieces of text?
Good afternoon, maybe someone had experience, you need an algorithm / hint in which direction to dig.
There are large correspondences by email, it is necessary to anonymize the correspondence, it is necessary to remove the signatures from the text of the letters:
Вася Пупкин
Менеджер
телефон 674847585748
адрес 523645653
Answer the question
In order to leave comments, you need to log in
Hello. I'm not entirely sure, but maybe regular expressions can help you. Try to work with them, it might work
Maybe the textblob library will help you . Namely, this part: Tutorial: Building a Text Classification System . As an example:
#!/usr/bin/python
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier
train1 = '''
Вася Пупкин
Менеджер
телефон 674847585748
адрес 523645653
'''
train2 = '''
Иван Иванов
Менеджер
телефон 673844589748
адрес 513665053
'''
train3 = '''
Николас Медведев
Менеджер
телефон 674947581748
адрес 526641655
'''
train = [
(train1, 'pos'),
(train2, 'pos'),
(train3, 'pos'),
('С уважением, от команды Хабра!', 'neg'),
('Купите наших котиков?', 'neg'),
('Скидки 120% но Aliexpress.', 'neg'),
]
test = '''
Привет!
Алиса Аксенова
Менеджер
телефон 678942581948
адрес 520671655
'''
cl = NaiveBayesClassifier(train)
blob = TextBlob(test, classifier=cl)
for s in blob.sentences:
print("'{}' - {}".format(s, s.classify()))
'Привет!' - neg
'Алиса Аксенова
Менеджер
телефон 678942581948
адрес 520671655' - pos
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question