Answer the question
In order to leave comments, you need to log in
How to crawl a website in python?
You need to bypass the site in which the pages are arranged according to the principle list of categories -> category -> landing page . The landing page at the end is parsed by Grab. You need to parse all pages nested in categories . How to solve the problem preferably with python?
Answer the question
In order to leave comments, you need to log in
Parse category URLs (use xpath for example ), follow them and parse the necessary links to the page. Then go to the page and select the data (RegExp, XPath or something else).
I don’t know how this is done in Grab, so I told you the algorithm. Pull the code on the algorithm and voila!
You can use the Grab:Spider module as follows:
1. Create an initial category page parsing task that finds links to category pages.
2. For each link found, create a category task that looks for links to landing pages.
3. For each link found in the previous step, create tasks that contain the logic for parsing landing pages.
An example can be found in the documentation . There is also an article on Habré .
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question