How to clear the text of noise in meaning? Or break the text into semantic blocks?

M

Meshutko2018-11-06 12:36:18

Python

Meshutko, 2018-11-06 12:36:18

Good afternoon!
Has anyone solved a similar problem (in Python or in general), can you share your experience?
There is an array of texts of conversations with clients on the phone. (Audio translated to text. There is only text to work with right now).
The text turned out to be noisy with an uninformative part.
What is marked as a client's speech actually belongs to one of 3 categories:
1) The client's speech itself
2) A song from a melody while waiting for a subscriber's answer (Dima Bilan, Grigory Leps and other performers)
3) Answering machines (" the subscriber is unavailable", "out of network coverage", etc.)
It is necessary to highlight the actual speech of the client.
These blocks in the text can be interchanged in any order. I tried to do it by compiling dictionaries of keywords (characteristics of the type of text) and counting their frequency, but it turns out time-consuming and not very accurate. If with autoresponders, there is at least a limited set of keywords, then what to do with songs is generally incomprehensible.
Please tell me, what are the methods / algorithms for cleaning the text from noise or separating a part of the text with a different meaning, despite the fact that the noise characteristics are also not 100% formalized?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Danil, 2018-11-06
@Meshutko

Offhand:
Great example of what it should look like. Just try to implement the code from the example using your data.

X

xmoonlight, 2020-01-21
@xmoonlight

The actual speech of the client
1. (+1) Mark all the phrases of the beginning of the conversation by the client.
2. (-1) With a back sign, mark all beginnings of autoresponder phrases.
3. (ALL - ([1]+[2])) The rest will be music and songs.