T
T
Tuxman2021-10-25 23:53:01
Machine learning
Tuxman, 2021-10-25 23:53:01

How to recognize in mail / news where is the greeting, the signature, and where is the important meaning of the message?

There are large volumes of email conversations, or newsgroup feeds, where there are chains of responses. Different options for grouping by message topics, response trees, etc. are not so obvious when it comes to a large number of messages. Thinking about visualizing information from mail/news as a feed where one could read all the replies without having to see multiple lines of welcome, signatures, etc.

The first thing that is easy to do without machine learning is to determine the quotas of the original message and around this response, so to limit the corpus to only this amount of text, but there is a risk of missing contextually significant text.

Having many examples of letters from each correspondent, it is possible to build a model of his letter template, where part of the letter refers to a greeting and a mention that this is an answer to such and such a correspondent, and where there is already a signature with parting words that are not so important for the meaning.

I would like to know your opinion, in which direction to move, what libraries in Python, Golang, etc. look?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
R
rPman, 2021-10-26
@rPman

99% of technical garbage is eliminated:
* typical template messages at the beginning and at the end of the letter (greetings, you wrote the date, etc.) filtering this needs to be coded manually by setting filters by words (and position in the document), it is more difficult because at the end letters are inserted, in order to identify such messages, you need to bind these messages to a specific user, i.e. everything that is repeated in most of the user's messages is garbage. Also, do not try to look for this garbage in the middle - it is either at the beginning of the letter or at the end
* search for quotes and copies of messages in response letters by comparing by content (only compare not character by character, but by collecting them in lines, throwing out extra spaces and quoting characters and maybe punctuation marks), by itself, you need to delete only the full quotation of the message and not partial, and only if this is the end / beginning of the letter (after deleting template greetings), by the way, instead of deleting, you can leave a hyperlink in the final interface;
normal mail applications simply fold such quotes (if the quote is short framed by text, do not fold it)

E
Eugene Lerner, 2021-10-26
@ehevnlem

in general, this is a task for understanding the text, somewhat simplified. build semantic vectors of words. https://ru.wikipedia.org/wiki/Word2vec maybe it will be interesting

D
Dimonchik, 2021-10-26
@dimonchik2013

https://pypi.org/project/readability-lxml/
but without ML the task is not beautifully solved

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question