E
E
Evgeny Sofonov2014-06-29 16:01:53
PHP
Evgeny Sofonov, 2014-06-29 16:01:53

What algorithm for sorting words by meaning or library to use?

I have never encountered the task of sorting words (phrases), determining similarities.
Number of words/phrases to sort - thousands.
There is a set of words:
a person 's
problem a person
's problems a person's problem
solving a person's
problem solving a person's problem solving a person's problems a person's problems solving a person's problems solving
a person 's
problems
solving
a person 's
problems questions of a person's
problems a question
of a person 's problems an answer of a person's problems
answers
a person has problems
a person has a problem
a person no problem
man no problem
It is necessary to find similarities in these lines in terms of meaning and sort them into 3-4 groups of phrases. Have you seen solutions in PHP or Python anywhere? Maybe there are ready-made libraries? Thanks in advance for your support.
What is the solution for?
analyze the texts received by the parser and put them on the shelves, compiling reports on the parameters.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
V
Viktor Vsk, 2014-06-29
@viktorvsk

It is necessary to find similarities in these lines in meaning

It's like saying: "Have you ever met ready-made plug-ins for Photoshop, what would make thousands of photos from hundreds of genres beautiful?"
If you have all the phrases of this type, then find the "distances" between them and group them by ranges.
Distances can be calculated, for example, by the number of identical letters in a phrase or by their strict sequence.
For example, the distance between "a person has no problems" and "a person has no problem" = 1, and between "a person has no problems" and "a person's problem solution" is already or 0 or about "the length of the string (which is longer or shorter) minus the number letters in the word "person)
Well, or try to build your grammars and facts with blackjack and Tomita
PS In general, this smacks of SEO, and if so, then do not mask unpleasant odors, but eliminate them (c)

D
Dmitry Fondomakin, 2014-06-30
@defond

It is necessary to find similarities in these lines by meaning and sort

Dear author of the question, victorvsk answered you correctly - you asked an incorrect question.
As I understand it, from the data that you provided, the sorting will be carried out exactly according to the meaning. Then what is the meaning? In my opinion, in the presented case, there may be several options - the meaning will be sorting by the word "problem" or by the word "task" or by the word "there is / is not a solution". What will make sense to you is not clear from the question.
Victorvsk also answered you correctly - the easiest option would be to search by distance. I did a bunch of two approaches by Levenshtein and Oliver.
Realization of Levenshtein distance . Look, try, everything is very simple there, and in words more than 3 characters, it gives very good results.
Or, as already advised, use ready-made solutions, there are a lot of them in Google.
Addition
Here is the implementation of the algorithms in PHP and more .
Somewhere there was a implementation in Python, but I can not find it yet. See also this article .
Good luck. :)

R
Rorg, 2014-09-07
@Rorg

these lines have similarities in meaning and sort into 3-4 groups of phrases

If I think exactly, then look probably in the direction of neural networks. Alternatively, a Markov chain may work well (but it is more for large texts than for phrases)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question