How to parse all pages of a site?

S

SouLWorker2020-06-18 22:02:39

Python

SouLWorker, 2020-06-18 22:02:39

Let's say I have a site site.site and it has pages site.site/siter, site.site/1, site.site/simn.
How can I sort through them, having only the main site.site link, since there are a lot of pages in my task, there is no way to manually, and the difference in page addresses is similar to the one in the example above.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexander, 2020-06-18
@SouLWorker

Let's say you don't know anything about parsing and programming. You have a site.site how do you plan to know that it has a site.site/siter page?
The simplest is a recursive url search on pages with a nesting limit. Open site.site, find everything there, href="([^"]+)"then open all found and search there. Not efficient, but it works.

E

ediboba, 2020-06-19
@ediboba

as a rule, if the site monitors its SEO, then it will have site.site/sitemap.xml or sitemap.html in the public domain.
The name can be different, it can also be specified in the robots.txt file.
Find this file, parse it into links and here you have all the links from the site.