How to approach such a task of classification and search?

Z

zod ggs2017-06-06 17:30:53

Algorithms

zod ggs, 2017-06-06 17:30:53

Task - it is necessary to break it down into its component parts (name, weight, quantity, manufacturer) and (OR) link the two sets together. One is a record of the names accepted by the supplier (and there are 2000 suppliers, and all have different rules), and the second is already brought to the normalized name, in compliance with all the rules.
The first set (7 million records), an example of names:
Banana 1000 pieces box 1c Banana 1000 pieces
100 kilograms FrutProd
Banana N1000 pack. 100kg Fruit company
Banana 1000 crate 1 centner
Figa 2000 pcs. box 0.6c
FrutProd Figa 2000sh 60kg
Figa 2000 box 0.06 tons
Figa N2000 0.06t
Second set (100k normalized items), example:
Banana 1000 pieces box 100kg
Fig 2000 pieces box 60kg The
good thing is that most of the second set was MULTIPLE connected with most of the first set.
Those. variations of the name "Banana 1000 pieces box 100kg" in the first set of 50-100 pieces, and 20-30 of them are already associated with the second set.
There is already a connected sample for training on 2.7 million records.
The old algorithm (in short - the number of occurrences of letter pairs and numbers) uses a very bold index (at the moment more than 450 million records and growth of 5-7 million records per week) and is already failing in terms of speed and resource consumption.
Colleagues suggested that I should look at machine learning algorithms and ANNs. What do you recommend?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

Dimonchik, 2017-06-06
@dimonchik2013

approach something like this - build a search for the name, then search for the package, then the quantity and measurements
the name, most likely, is almost always easy
then, paired with it, it’s easy to classify the number (it’s boxes or pieces or packages)
some functions can be used to prepare (“c” not included in the box and packaging, etc.)
further determine which contain all the components and which do not,
well, look at the current algorithm, maybe it’s easier to optimize it, look at
Tamita parser (xs how is it for this, but you can build your own there)

S

Sergey, 2017-06-06
@begemot_sun

The task is seen in the construction of a certain multidimensional metric for the input strings.
That. the distance between the already classified line and the line that should be associated with this classified one should be minimal.
The construction of this metric should be done by a recurrent neural network.
How to do all this, you better not ask.