Answer the question
In order to leave comments, you need to log in
How to parse text into meaningful phrases (tags)?
Hello. There is a text parsing problem.
Let's say we have a description of an ad for the sale of a car.
"In my opinion, the best price-quality ratio for cars from 2006 to the present.
Good leather interior. Eight airbags!
Six-cylinder reliable engine.
Excellent stylish interior.
And it’s good precisely in the sense that there are no modern problems and you are focused specifically on driving In my opinion, it's better that you can afford For 500.000-600.000 rubles
People don't understand how much this car costs, especially girls.
Of the characteristics that I liked the most: it's a fully heated windshield! In fact, this is a triplex reinforced with a metal grate. How many stones flew into the windshield. And not the slightest chip.
On it, in general, all the glasses are thick.
Rear overhang 30 centimeters! Those. You can drive up to any curb!
Excellent soundproofing. Compared to Mercedes, it is at the level of the E-class.
In general, all those who I met on the way to Japanese cars. They said: this car, you can see the metal! Those who got behind the wheel said: it's driving!"
From the text we can understand that:
1) Salon - leather
2) Engine - 6 cylinders
3) Airbags - 8
and similar logical conclusions.
At the input I have millions of descriptions, at the output I would like to get a set of parsed "tags" such as leather interior and the like.
Tell me, what technologies, tools can be used for this, what to read on this topic? How to approach the problem correctly? What is the name of what I want to do?
It is necessary to understand not on the basis of any given parameters, I want all tags to be recognized by themselves, say, based on similar data in other ads.
If the phrase "leather interior" is found in 100 ads, it is marked as a separate tag and it can be further distinguished from other ads, then in semi-automatic mode indicate that the phrase "the interior is made entirely of leather" also corresponds to the "leather interior" tag and etc.
From what my colleagues suggested to me - try to beat the text into bigrams, trigrams (2 words, 3 words) to write in the database. Further, all the following announcements are run according to the same principle and allocate similar bigrams in them.
Does anyone have experience with similar systems. Tell.
Thank you!
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question