A
A
alexbbfg2015-09-21 14:35:51
Algorithms
alexbbfg, 2015-09-21 14:35:51

How to find duplicate records by meaning?

Good day to all!
There is a list (nomenclature) of objects by name, in the form of a table. The name of the object consists of several words and possibly an alphanumeric addition. In the list there are objects identical in meaning, but different in spelling. For example:
Metal coupling 3-M,
Metal. coupling 3-M and
Metal coupling 5-B,
and so, Metal coupling 3-M and Metal. the 3-M coupling is the objects that are identical in meaning, but different in spelling.
So we need an algorithm that will analyze the list and find objects that are identical in meaning (at least 70 percent).
What approaches to analysis can be used? Perhaps there are already ready-made solutions? We are also interested in theoretical calculations describing the principles, we will write the code somehow))
I will be glad for any help.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
O
Optimus, 2015-09-21
Pyan @marrk2

1. First define synonyms
for Metal. = metal. = metal
2. If the word "metal" is not important, but the word "coupling" is important, then specify a list of ignored words that will be deleted before analysis
3. Is "3-M" important in this case or not?
4. In short, filter out the unimportant, bring to one register and look for exact occurrences, or group by given characters. Select the model in a separate line with a regular expression and look for 2 matches in the searched line both for the word and for the model.
5. If the words are long and there are declensions, use Porter's Stemmer, but it does not work well with short words.
Everything.
Example:
Metal coupling 3-M
We delete the "metal" and bring to 1 register it became: "clutch 3rd" parse with a space into 2 lines it became: "clutch" and "3rd".
Search in the cycle:
String: "Metal washer 3-M". We give the register, it became "metal washer 3rd" we check for "coupling" - no, we check for "3rd" - yes - it does not fit, it is necessary that both this and that coincide.
Well, use regular expressions as needed.

/муфт[а-яё]+/ismu
/[0-9]-м/ismu // найдёт все модели от 0-м до 9-м

P
protven, 2015-09-21
@protven

Use metrics to search for similar objects - keywords Jaccard Similarity, shingling, minhashing. You can read here infolab.stanford.edu/~ullman/mmds/book.pdf , Chapter 3 Finding Similar Items.
There is a course for this book on the courser, you can see videos with live examples and explanations - https://class.coursera.org/mmds-003 , the material of the second week has just become available, I am currently taking this course. Well, according to the materials of the same book, there is a course from the Computer Science Center, and these issues are also considered there, https://www.lektorium.tv/course/22822 . But I didn’t like this course, just when the topic you needed was explained, the girl was clearly confused in the explanations.

A
Anton Fedoryan, 2015-09-21
@AnnTHony

Take the first 3 characters of each word, for example "met" "muf" "3rd" and look for occurrences of these characters in the rest of the text.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question