Answer the question
In order to leave comments, you need to log in
How to clear the text of noise in meaning? Or break the text into semantic blocks?
Good afternoon!
Has anyone solved a similar problem (in Python or in general), can you share your experience?
There is an array of texts of conversations with clients on the phone. (Audio translated to text. There is only text to work with right now).
The text turned out to be noisy with an uninformative part.
What is marked as a client's speech actually belongs to one of 3 categories:
1) The client's speech itself
2) A song from a melody while waiting for a subscriber's answer (Dima Bilan, Grigory Leps and other performers)
3) Answering machines (" the subscriber is unavailable", "out of network coverage", etc.)
It is necessary to highlight the actual speech of the client.
These blocks in the text can be interchanged in any order. I tried to do it by compiling dictionaries of keywords (characteristics of the type of text) and counting their frequency, but it turns out time-consuming and not very accurate. If with autoresponders, there is at least a limited set of keywords, then what to do with songs is generally incomprehensible.
Please tell me, what are the methods / algorithms for cleaning the text from noise or separating a part of the text with a different meaning, despite the fact that the noise characteristics are also not 100% formalized?
Answer the question
In order to leave comments, you need to log in
Offhand:
Great example of what it should look like. Just try to implement the code from the example using your data.
The actual speech of the client
1. (+1) Mark all the phrases of the beginning of the conversation by the client.
2. (-1) With a back sign, mark all beginnings of autoresponder phrases.
3. (ALL - ([1]+[2])) The rest will be music and songs.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question