How not to parse 404 pages that are replaced by a website?

Z

zyusifov112020-11-28 12:53:21

Python

zyusifov11, 2020-11-28 12:53:21

how to skip pages of a site that give a 404 error but are replaced by a site.

if html.status_code == 200:

doesn't help because the page actually exists, but the content is filled with inscriptions that the page was not found.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Developer, 2020-11-28
@samodum

If the status code is 200 and not 404, then only by analyzing the content on the page. Or by the size of the page with the content of the 404th. No other way.
I would tear off the backend for this.
In general, if 200 arrived, then the page will have to be downloaded anyway. 404 and all other codes are designed to decide whether to render content. If 200 arrived, then the server responds with an ass to the kid that the content is valid, download the brother. And now he slips a trick on you - content with content that there is no page content.
Then you can remember the size of the defective page in bytes and then, taking the 200th response, look at the page size. If it is approximately within the defective page, then do not process it, move on. The size may change slightly due to dynamically sucked data (urls of scripts may change, links to banners, etc., but only slightly). And here we are writing another crutches out of the blue due to the fault of ass-handed backenders