N
N
naruto_hokagi2018-04-27 09:22:25
Python
naruto_hokagi, 2018-04-27 09:22:25

How to go to the next page in scrapy?

Hello, I am writing a scrapy news parser, I need it to start parsing from the start url, open each news, extract data, then go to the next page and do the same thing. I only parse the first one, but I don’t want to go further

class GuardianSpider(CrawlSpider):
  name = 'guardian'
  allowed_domains = ['theguardian.com']
  start_urls = ['https://www.theguardian.com/world/europe-news']

  rules = (
    Rule(LinkExtractor(restrict_xpaths=("//div[@class='u-cf index-page']",),
              allow=('https://www.theguardian.com/\w+/\d+/\w+/\d+/\w+',)),
    callback = 'parser_items'),
    Rule(LinkExtractor(restrict_xpaths=("//div[@class='u-cf index-page']",),
              allow=('https://www.theguardian.com/\w+/\w+?page=\d+',)),
    follow = True),
    )

Answer the question

In order to leave comments, you need to log in

1 answer(s)
E
Evgen, 2018-04-27
@Verz1Lka

In general, I would use `BaseSpider` rather than `CrawlSpider` and manually set the xpaths for next_page and news.
Something like this:

def parse(self, response):
    news_css = 'div.fc-item__container > a::attr(href)'
    for news_link in response.css(news_css).extract():
        req = scrapy.Request(response.follow(url=news_link, callback=self.parser_items)
        yield req

    next_page_css = 'div.pagination__list > a::attr(href)'
    for nextpage_link in response.css(news_css).extract():
        req = scrapy.Request(response.follow(url=nextpage_link, callback=self.parse)
        yield req

PS The code has not been tested, but I think the meaning is clear. Usually, these spiders are easier to work with than BroadCrawl

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question