Scrapy scraping the wrong page?

N

NoXXik2020-07-05 21:12:34

Python

NoXXik, 2020-07-05 21:12:34

In general, to consolidate knowledge of scrapy, I decided to make a parser for steam things from cs go. I ran into a problem that the first page is parsed, and when the second link is served, the data from the first link arrives again, after looking at the logs, I realized that the link of the form https://steamcommunity.com/market/search?appid=730... becomes https: //steamcommunity.com/market/search?appid=730 - this is the first page. It turns out that he parses the same page twice, then gives an error that this is a repeated parsing and cuts down the spider.

DEBUG: Crawled (200) %20%20%20%205B%5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any%20%20%20%20&appid= 730#p2_popular_des
> ( referrer: none)
2020-07-05 20:56:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/market/search?q=&catego...
%20%20%20%205B% 5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any%20%20%20%20 &appid=730>

class ItemParser(scrapy.Spider):
    name = 'steam_items'
    start_urls = ["""https://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_ProPlayer%
    5B%5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any
    &appid=730#p2_popular_desc"""]

    def parse(self, response):
        count_pages = 6
        page_num = 2
        items = PriceParserItem()

        items_rows = response.xpath('//*[@id="searchResultsRows"]').css("a.market_listing_row_link")

        for row in items_rows:
            name = row.css(".market_listing_item_name::text").extract()
            count = row.css(".market_listing_num_listings_qty::text").extract()
            nprice = row.css(".normal_price::text").extract()[2]
            sprice = row.css(".sale_price::text").extract()
            link = row.css("a::attr(href)").get()

            items['item_name'] = name
            items['item_count'] = count
            items['item_nprice'] = nprice
            items['item_sprice'] = sprice
            items['item_link'] = link
            yield items

            # https://steamcommunity.com/market/search?appid=730#p2_popular_desc
        if page_num < count_pages:
            next_page = """https://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_ProPlayer%
     5B%5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any
     &appid=730#p""" + str(page_num) + '_popular_desc'
            page_num += 1
            yield scrapy.Request(next_page, callback=self.parse)

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Dimonchik, 2020-07-05
@NoXXik

everything after # is not transferred to the server
such sites are parsed wrong,
see Console what is there and where and from where