Answer the question
In order to leave comments, you need to log in
Getting rid of the recursive transition when parsing sites
I am writing a crawler that will visit certain sites and collect information about the pages. Something like a search engine mini-robot. When inciting some sites, a problem arose: there is, for example, a link, when you click on it, it adds a certain get parameter to the url. When you click again on the same link, the parameter changes. As a result, the robot falls on this link. And walks on the same page. On the one hand, it is logical, different url, different pages. On the other hand, the same content and in the recursion of such pages, the robot will pump up an infinite number of them until the url length limit is triggered.
For example: mega74.ru/- if you open in the upper right corner in a new tab "login and registration", and then do the same to the page that did not open, then the url will be endlessly supplemented.
The same problem often occurs with a paginator on Bitrix from a mountain of programmers.
How to get rid of this or provide, so to speak, protection from a fool and exclude such pages in the process of crawling?
Answer the question
In order to leave comments, you need to log in
And if you try to filter repeated variables in the address? those. on the site from the example REQUESTED_FROM will be repeated as many times as necessary
As soon as I can determine from the URL that this page has already been parsed, I don’t know. How to determine by content: you can, for example, store a hash from the HTML code of the page in the database and then, when parsing a new page, look to see if there is already such a hash in the database.
The approach is this - the parser must know the structure of the particular site that it parses. When he takes a link, he knows exactly what it is for - a category, an item, etc. Link type in general. When developing a parser for a specific site, look at the links with your eyes (after all, you will have to look at each one anyway). And if there is a parameter that is not relevant to the case, you will immediately understand it. You need to remove it from the URL with a regexp. Or replace with the same one.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question