M
M
MIsternik2015-08-08 11:07:44
big data
MIsternik, 2015-08-08 11:07:44

How to separate the documents that need to be classified from the rest before classifying the text?

There are certain topics that the classifier is trained to separate, and there are topics about which he knows nothing. If the classifier is given topics unfamiliar to him, he will find some answer, but it will not be correct.
I suppose that it is possible to build a common vector for each of the known topics and, before classifying, compare the document vector with the topic vectors to find a deviation, where some value can be considered a threshold.
But since the number of words is high and large texts can contain many different words, I doubt this option.
Are there any better suggestions?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
�
âš¡ Kotobotov âš¡, 2015-08-08
@angrySCV

well, if you want to remove non-essential text -> collect the most commonly used common words from all documents, then first remove them from all texts, thereby leaving only the most specific text.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question