D
D
danforth2017-02-16 22:08:57
Data processing
danforth, 2017-02-16 22:08:57

Linking articles to each other based on the analysis of link relevance?

Hello!
I came up with the idea to do something like relinking on my site, the idea is as follows:
1. Load an array of entities that have links, these are: posts, pages, categories.
2. We load an array of texts from all these entities.
We get, for example, the following array:

<?php

$entities = [
    [
        'title' => 'Купить слона просто', 
        'url' => '/post/kak-kupit-slona/'
    ],
    [
        'title' => 'Как выбрать мобильный телефон: 10 советов', 
        'url' => '/post/vibrat-telefon-sovety/',
        'tags' => [
                'телефон', 
                'выбрать', 
                'советы'
        ],
    ],
    // ...
];

$texts = [
    ['text' => '... Очень много текста. Выбрать мобильный телефон не сложно, главное — ...'],
    ['text' => '... Тоже много текста. При этом текст объявления содержал заголовок: "Купить слона"... продолжение статьи'],
];

After that, we look for the occurrence of the title in the line, and if there is such an occurrence, we wrap the text section into a link. For example, the first output text would be:
... Очень много текста. <a href="/post/vibrat-telefon-sovety/">Выбрать мобильный телефон</a> не сложно, главное — ...

Thus, you need to relink the entire site. The task is not trivial, because the criteria are as follows:
1) Determine the relevance of a piece of text, whether it fits the link. For example, the article "Review of quadcopters" is perfect for the text "How to choose a quadcopter", i.e. the code should be able to understand this at least a little, and assume that this is what is needed (you can later send it to the editor for confirmation if the confidence threshold is not exceeded).
2) Be able to prioritize: the text "Most popular refrigerators" can fit under the link to the section with the products "Refrigerators" as well as to the product page "Refrigerator LG 22BA12", but the first option should "win", since this is a category, and it is more important . The same is true and vice versa: if there is an exact occurrence of the name of the product, then the link should lead to the product, and not to the category.
3) The link should not lead to the same page, the text of which is currently being processed. This point can be easily thought through, but it was worth including it here.
4) The ability to evaluate relevance not only by title, but also by tags. For example, if the article is called "Springboard - flight", and the tags contain "rally, racing, motorsport, book", then the text section "Rally book" can be wrapped in a link, although this is a little and not correct, nevertheless, such opportunity is needed.
5) ...
PS I know about levenshtein and metaphone, I used them a couple of times. Are there any other alternatives to these two functions/algorithms?
If anyone has had a similar experience, or has any thoughts about it, I'd be happy to hear. I can’t even imagine how to google on the American Internet, people probably had such ideas, and they were asked on the forums. Links to articles and books on this subject are welcome. I feel that this task is not quite up to me, yet ... but I am determined, and have been hatching the idea for a long time.
Mercy.

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question