Bayesian classifier, category selection problem

D

dtm2013-03-09 18:39:18

data mining

dtm, 2013-03-09 18:39:18

The situation is as follows:
There is a classifier with the help of which some documents are divided into several categories (let it be 'good', 'bad' and 'unknown').
Everything is calculated according to the formula

Pr(Category | Document) = Pr(Document | Category) x Pr(Category)

Pr(category) - the probability of a random document falling into this category, calculated by the formula

number of documents in this category / total number of

documents that after training in one of the categories of documents it turned out 4 times more than in the rest, respectively, any classified document falls into this category. If the number of samples in the categories is approximately the same, everything works as it should (which, in principle, is not surprising).

Question: how to fight?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

dtm, 2013-03-09
@dtm

I assume that the most correct option is to equalize the number of samples in the categories by carefully sampling data for training, but what if?

O

Ololesha Ololoev, 2013-03-10
@alexeygrigorev

Agree with previous advice. Try to select data for the training sample in such a way that it represents the distribution of the general population, and train the classifier on it.