R
R
Ruslan2017-11-14 09:46:11
PHP
Ruslan, 2017-11-14 09:46:11

There is a database of goods, how to find duplicates?

The database contains 20,000 products from different categories
. Products have a brand, model, article, category, characteristics, description - but this data is filled randomly, not all and not all.
You need to find all the duplicates of the product, maybe someone has already done something similar or tell me in which direction to dig in order to determine the percentage of similarity of two products
. search problem

Answer the question

In order to leave comments, you need to log in

1 answer(s)
P
Pavel Belyaev, 2017-11-14
@mitrm

I once bothered with such a question, but there will be too many mistakes if without human help.
1. Similarity of names - you need to make some kind of conversion table, for example, a processor or cpu, ram or memory, in general, of all synonyms, one word should be given, well, like lemma
2. We are looking for each word from the left in the right - here we have some percentage of similarity in words, but you need to take into account that if there are 3 words in the left product and everyone agreed, and there are 4 of them in the right one, then there is not 100% similarity.
3. We are trying to find something like a model / article by the mask - the sequence of numbers is somehow long, this is the maximum weight.
4. You can use signs such as the price, if one product costs 10k, and the second 20, then there is something different, and if one is 4.5, and the second is 5, then it’s probably the same thing.
In general, it is necessary to study this in detail according to a certain profile, even the Yandex market and sometimes it shows something wrong from the search.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question