D
D
dreammaker2012-11-16 12:16:59
Neural networks
dreammaker, 2012-11-16 12:16:59

How to normalize diverse data for input to a neural network?

We want to do something like filtering spam and problematic materials for the project. If it were just to delete / leave the text, then there seems to be no difficulty. But in reality, it turns out that you need to take not only the title of the material and its text, but also the section number, data on the region where it came from, and a bunch of other parameters, often numerical.

And if we allow a text of 1000 characters in size, but the region number 1002 or section 23 will simply be lost against the general background and amount of text. At the same time, you need to come up with such a “formula” so that you can substitute new parameters as needed, and the parameters of the network itself did not need to be changed much.

For continuous learning, we will look towards self-organizing Kohonen maps. This question remains - how to normalize diverse data?

Thanks in advance!

Answer the question

In order to leave comments, you need to log in

2 answer(s)
D
dtestyk, 2012-11-17
@dreammaker

You can try to take the cumulative distribution function from each parameter or a small group of parameters. The output of the function will be a value from zero to one. Moreover, each value will occur approximately equally often.

M
mithraen, 2012-11-16
@mithraen

Have a look at spamassassin, it uses a lot of tests and weights the results of those tests.
Text analysis is one of them.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question