D
D
drlafa2017-08-01 23:48:58
Python
drlafa, 2017-08-01 23:48:58

How to prepare text data in Keras for training Encoder-Decoder LSTM network (sequence-to-sequence)?

Suppose there is a text sequence for entering the neural network X_train: ["Hi", "how", "doing", "?"]
And the answer to this sequence is Y_train: ["Everything", "excellent"]
How to properly prepare this data, tokenize, bring to the matrix mode, what could be fed to the neural network during training?
model.fit(X_train, Y_train)

Answer the question

In order to leave comments, you need to log in

2 answer(s)
X
xdgadd, 2017-08-02
@drlafa

Google NLP, text preprocessing and word embeddings . There are many ways and they depend on the architecture of the model, your task, the type and quality of the text. Also, there are embedding layers
in Keras, you can pass through them ready-made vectors, eg word2vec, OHE. PS For more flexibility, I recommend you try Lasagne or Tensorflow. Keras is good when the problem needs to be solved quickly, with a minimum of code and theory. And for experiments and training, it is better to use tools that are closer to the hardware (less abstract).

A
Alexander Pozharsky, 2017-08-02
@alex4321

1. tokenize - maybe nltk.tokenize ?
2. Next, it's probably worth removing the stop words. For example, received from nltk.stopwords
3. (probably, stemming will be useful - for example, nltk.stem)
4. further - for example, replace words / roots (after stemming) with some kind of embedding (which will match each word vector) - e.g. word2vec. Well, or use your own embedding. For example (but this is obviously an expensive way in terms of memory):
4.1. create a dictionary containing all the words of the training sample
4.2. match each word with its number. Then the text will be represented by a one-dimensional array of numbers
4.3. replace each number with a vector, where the Nth element is 1, the rest are 0. And N is, in fact, our number
4.4. add an embedding layer to the input of the network and a reverse layer to the output

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question