What method or algorithm can be used to harmonize/normalize the reference book?

N

Nik2015-11-10 15:58:46

data mining

Nik, 2015-11-10 15:58:46

What algorithms and methods can be used to solve such a problem:
There is one reference book of nomenclature positions from different departments of one company. Moreover, in different editions of the directory (different divisions) there are identical lines with different line details.
For example, in one directory there is an entry of the form Bolt M6 GOST 123-34 .... and in another Bolt M6 0.45 0.8yu ..., in the third Bolt M6 0.4..08mm ...
These lines describe the same goods with varying degrees of detail. It is necessary to find duplicates and supplement the line with all the attributive composition that appears in all editions of the reference book.
The example shows a simplified version, in reality the "same" lines may look like this
:
Hot-formed boiler pipe B 20 TU 14-3-460

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

R

Roman Mirilaczvili, 2015-11-10
@2ord

I agree with alexxandr that it is still hardly possible to do without manual work.
We will have to involve employees in the normalization of the directory.
In general, in this task it is worthwhile to lexically parse strings to extract attributes, determine the attribute class (for example, "14-3-460" is the catalog index of the name).
You can give some attributes a higher priority, according to which the name will be determined.
To determine duplicate names - connect the Levenshtein distance algorithm .
According to the dictionary (which you need to have), search for abbreviations and replace attributes, combine them into one.
Connect to solve the problem: fuzzy search, fuzzy logic, (probabilistic?) classifiers.

A

alexxandr, 2015-11-10
@alexxandr

Either way, the result will be inconsistent.
anyway, you will have to process it manually,
you can look for distance vectors for various product names, those that are higher than a certain value are most likely one product (the element of the vector, as you might guess, will be a word)

R

Roman Sedykh, 2016-01-21
@mrRomkin

read my answer to a similar question https://toster.ru/answer?answer_id=743052
only in your case you need to add weights for each feature to the rules for classifying words of names by feature types. For example, so that "14-3-460" in the name you specify has a higher weight. Track things like this with regular expressions (3 consecutive words of 1-3 numeric characters separated by hyphen/dash/minus), etc.