What would you recommend for comparing two sentences in meaning?

A

Astrohas2017-09-07 06:34:12

Python

Astrohas, 2017-09-07 06:34:12

Hello dear toasters! In one project specializing in tests, a semantic and semantic comparison of two short (2-5 words) sentences is required. What can you advise?
So far, I'm thinking about canonicalization and analysis through pymorphy2 and then, based on the synonym database, unify into one format and then compare.
I would like to know your experience in this area.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

SolidMinus, 2017-09-08
@Astrohas

Unification is suitable as text preprocessing. That is, there is a set <K, V>, where K is a numerical description of a synonym class (for example, different synonyms in one class), V is a tensor of synonyms, where vectors with words containing encoded ( Unicode , as I understand in this case) words. There is an inverse problem of finding the corresponding K by V, almost like a dictionary, but vice versa. The sentence is transformed into the K_i sequence, after which this unified one, as you rightly said, is analyzed. Before analysis, a unified vector must be mapped into a space with a fixed dimension so that all sentences are, as it were, of the same length. You can just add zeros, for example, so that the vector becomes with 5 components (maximum sentence length)
There are two options. 1st is more efficient and complex, 2nd is simpler
1) LSTM networks. Why LSTM? Because this type of RNN networks is most suitable for sequence analysis.
On an output define a semantic class. That is, we have an RNN with 5 inputs and outputs equal to the number of semantic classes, which gives a k-dimensional probability distribution vector by classes. argmax(output) will be our class. A typical multiclass classification problem, but with the help of RNN networks. If you don’t understand the RNN, you can use the usual MLP network, but the output will be crap, because this sequence is tied to the previous states of the element. We don't have sentences like "hello no yes bye bye eh car".
You need pre-training on a huge database marked manually. That is, such a vector is such a class.
2) You can follow a simple path, without neural networks, use the norm of the difference between two vectors. Required, and introduced, the smaller the norm, the closer the sentence is in meaning. After all, the numerical sequence of a sentence is a vector in an n-dimensional space. In our case, after normalization in 5-dimensional space. And the norm is a generalization of the distance to large dimensions, that is, with a difference in the vectors of sentences, we get a third vector, whose length is the distance between the vectors. Various metrics can be used. Which one do you like better. I would prefer the Minkowski metric cp = 2.
No pre-training required, no complications either. Just school math. But also suggestions, for example:
"Today I went to school again" and "Tomorrow I will go on a business trip again" may seem the same in meaning. This is what Maxim Chernyatevich was talking about, meaning that only the simplest analysis can be done with the base of synonyms, because after normalization by synonyms into one vector, they will most likely be completely equal.