How to train a convolutional neural network for text vectorization?

A

Alf1622016-11-16 14:32:20

Python

Alf162, 2016-11-16 14:32:20

I want to use CNN to get vector representation of text. Those. first, each word is converted to a vector (for example, using word2vec). And then you need to pass such text through a convolutional network to get a vector representation of the entire text. So, it is not clear how to train such a network. If we allow for text classification or sentiment analysis, it is clear that the network gives an answer that we compare with what we have, etc., but how to train the network when there is no way out as such?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

D

Dmitry Dart, 2016-11-16
@gobananas

I would go further and ask why we vectorize text. You indicated, for example, sentiment analysis. But sentiment analysis can be carried out by different methods , see sentiment analysis. And the method with vectors seems to me not the most convenient here, because the output is likely to be many separate graph-trees, i.e. forest, which means we will deal with unconnected graphs.
For this task, it is easier to use dictionaries then.
Further, why graphs, and for example not... bitram-trigram analysis? Bigrams may not give the required accuracy, but the number of trigrams, although quite large, is of course, and most importantly, significantly less than the number of possible words (more aztips.blogspot.ru/2009/04/blog-post_12.html)We made a list of trigrams in negative texts, in positive and neutral texts, you analyze and classify everything. If the accuracy does not satisfy, then apply a couple of methods.

V

Vlad_Fedorenko, 2016-11-16
@Vlad_Fedorenko

Here you need to lower the dimensionality of the input data. Why CNN? Why not PCA, t-SNE, autoencoders?
Downsizing for the sake of downsizing is a dubious undertaking, because you probably use it to prepare data for solving some problem. Here, according to the quality of the solution to this problem, select the grid parameters

I

ivodopyanov, 2016-11-19
@ivodopyanov

I myself am now interested in this problem - doc2vec algorithms. In general, it seems to me that this task should be solved by using seq2seq as the actual autoencoder - we feed the same data to the input and output, force the network to distill the phrase into a vector and back into a phrase, and we get inside some vector representation that you can work with. For example, serve in seq2seq next level.
In general, several interesting articles about Hierarchical RNN have been published over the past year. Another interesting issue is the use of char-based RNN or CNN for word embedding. In this case, the problem with OOV words should disappear, plus various spelling errors should be processed.