How to crawl a website in python?

S

Sergey Pavlov2014-07-07 21:15:10

Python

Sergey Pavlov, 2014-07-07 21:15:10

You need to bypass the site in which the pages are arranged according to the principle list of categories -> category -> landing page . The landing page at the end is parsed by Grab. You need to parse all pages nested in categories . How to solve the problem preferably with python?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

J

JRazor, 2014-07-07
@JRazor

Parse category URLs (use xpath for example ), follow them and parse the necessary links to the page. Then go to the page and select the data (RegExp, XPath or something else).
I don’t know how this is done in Grab, so I told you the algorithm. Pull the code on the algorithm and voila!

E

eremeevdev, 2014-07-07
@eremeevdev

scrapy is also not a bad web scraping framework

P

PoopZemli, 2014-07-10
@PoopZemli

You can use the Grab:Spider module as follows:
1. Create an initial category page parsing task that finds links to category pages.
2. For each link found, create a category task that looks for links to landing pages.
3. For each link found in the previous step, create tasks that contain the logic for parsing landing pages.
An example can be found in the documentation . There is also an article on Habré .