How to go to the next page in scrapy?

N

naruto_hokagi2018-04-27 09:22:25

Python

naruto_hokagi, 2018-04-27 09:22:25

Hello, I am writing a scrapy news parser, I need it to start parsing from the start url, open each news, extract data, then go to the next page and do the same thing. I only parse the first one, but I don’t want to go further

class GuardianSpider(CrawlSpider):
  name = 'guardian'
  allowed_domains = ['theguardian.com']
  start_urls = ['https://www.theguardian.com/world/europe-news']

  rules = (
    Rule(LinkExtractor(restrict_xpaths=("//div[@class='u-cf index-page']",),
              allow=('https://www.theguardian.com/\w+/\d+/\w+/\d+/\w+',)),
    callback = 'parser_items'),
    Rule(LinkExtractor(restrict_xpaths=("//div[@class='u-cf index-page']",),
              allow=('https://www.theguardian.com/\w+/\w+?page=\d+',)),
    follow = True),
    )

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

E

Evgen, 2018-04-27
@Verz1Lka

In general, I would use `BaseSpider` rather than `CrawlSpider` and manually set the xpaths for next_page and news.
Something like this:

def parse(self, response):
    news_css = 'div.fc-item__container > a::attr(href)'
    for news_link in response.css(news_css).extract():
        req = scrapy.Request(response.follow(url=news_link, callback=self.parser_items)
        yield req

    next_page_css = 'div.pagination__list > a::attr(href)'
    for nextpage_link in response.css(news_css).extract():
        req = scrapy.Request(response.follow(url=nextpage_link, callback=self.parse)
        yield req

PS The code has not been tested, but I think the meaning is clear. Usually, these spiders are easier to work with than BroadCrawl