What are the ways to get the canonical url when parsing a specific resource address?

R

Roman Mirilaczvili2019-04-09 22:29:05

HTML

Roman Mirilaczvili, 2019-04-09 22:29:05

Some "spider" is fed the URL of some resource address, for example, _http_://www.example.com/blog/2019/mega-article
In addition, different URL options are potentially possible:
_https_://www.example.com/blog/ 2019/mega-article
_http_://m.example.com/blog/2019/mega-article
If the title is present rel=canonical, then everything is clear: just extract that URL and that's it.
What if it's not specified rel=canonical?
Are there other ways to get the canonical URL? And if you still need to get it, then how to get out of the situation?
Addition:

task description

Есть одна задача, в которой API сервис должен получить url в качестве параметра, а ответом должен вернуть

ID representing the canonical URL of the given url

Загвоздка в том как получить канонический URL если rel=canonical отсутствует. Насколько я понимаю, тогда остается принимать исходный url за канонический. Так?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexander Denisov, 2019-04-10
@2ord

Could you rephrase the question or add what purpose you need it for?
Now the question sounds like "how to get the canonical url if it is not in the code?"
If the page does not have a canonical to another URL, then this page is canonical by default.

G

Gip, 2019-04-10
@Giperoglif

Well, how do you get out, if it can be, in general, whatever, if not specified. and why do you personally need a canonical of a third-party site? This is purely the problem of this site, not yours.