P
P
Pavel2015-01-05 17:48:47
Parsing
Pavel, 2015-01-05 17:48:47

How to identify the same product in different online stores?

There is the following task. It is necessary to create a catalog of goods with links to the positions of these goods in online stores. Following the example of price.ua, ava.ua, hotline.ua and other services.
I considered the following options:

  1. We upload the data of the reference catalog, for example, Yandex.Market. Then we go through the products in each online store and try to determine the similarity with the products in the standard by headings, cleaning the names from marketing and SEO husks.
  2. We go through all the online stores in turn. Let's take the first one. Retrieving all products and trying to clear the headers. We take the second. We are trying to match the products with the products of the previous store. If there is no match, then add a new one.

I really don't like all this witchcraft with names. Let's say there is such a laptop in YM and here it is in Rosette. Get rid of the words "Notebook" and "Super price!" not a problem. These are template headers. But the fact that the model name is different is already more serious ("E1-571G-33114G75Ma" vs "E1-571G-33114G75MAKS").
Are there any more accurate options for identifying goods? So that it was not necessary to shaman with guessing by name.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
A
Andrey Bezpalov, 2015-01-05
@Andrbez

Combination manufacturer + article. In the given example it is: Acer + NX.M7CEU.036.

S
Sergey Nalomenko, 2015-01-05
@nalomenko

Yandex.Market, pricelist.ua and other product aggregators simply provide online store owners with the opportunity to upload data about their products in a special XML format, where there are fields with the model code of a particular product.
In Yandex, this is YML . For others, look in the special sections for developers on each of the sites.

V
Viktor Vsk, 2015-01-05
@viktorvsk

E1571G33114G75MA
E1571G33114G75MAKS
Punctuation marks and other "stop words" (1366x768 LED LAPTOP...) need to be hardcoded and removed immediately. The register, in principle, also has no essence. With the removal of spaces, it may be necessary to experiment.
When at the "cleaning" stage you have reached an experimentally acceptable optimum, go and check the metric . And determine the range at which you think that the words are really identical. Again, experimental.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question