N
N
NoXXik2020-07-05 21:12:34
Python
NoXXik, 2020-07-05 21:12:34

Scrapy scraping the wrong page?

In general, to consolidate knowledge of scrapy, I decided to make a parser for steam things from cs go. I ran into a problem that the first page is parsed, and when the second link is served, the data from the first link arrives again, after looking at the logs, I realized that the link of the form https://steamcommunity.com/market/search?appid=730... becomes https: //steamcommunity.com/market/search?appid=730 - this is the first page. It turns out that he parses the same page twice, then gives an error that this is a repeated parsing and cuts down the spider.

DEBUG: Crawled (200) %20%20%20%205B%5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any%20%20%20%20&appid= 730#p2_popular_des
> ( referrer: none)
2020-07-05 20:56:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://steamcommunity.com/market/search?q=&catego...
%20%20%20%205B% 5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any%20%20%20%20 &appid=730>

class ItemParser(scrapy.Spider):
    name = 'steam_items'
    start_urls = ["""https://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_ProPlayer%
    5B%5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any
    &appid=730#p2_popular_desc"""]

    def parse(self, response):
        count_pages = 6
        page_num = 2
        items = PriceParserItem()

        items_rows = response.xpath('//*[@id="searchResultsRows"]').css("a.market_listing_row_link")

        for row in items_rows:
            name = row.css(".market_listing_item_name::text").extract()
            count = row.css(".market_listing_num_listings_qty::text").extract()
            nprice = row.css(".normal_price::text").extract()[2]
            sprice = row.css(".sale_price::text").extract()
            link = row.css("a::attr(href)").get()

            items['item_name'] = name
            items['item_count'] = count
            items['item_nprice'] = nprice
            items['item_sprice'] = sprice
            items['item_link'] = link
            yield items

            # https://steamcommunity.com/market/search?appid=730#p2_popular_desc
        if page_num < count_pages:
            next_page = """https://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_ProPlayer%
     5B%5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any
     &appid=730#p""" + str(page_num) + '_popular_desc'
            page_num += 1
            yield scrapy.Request(next_page, callback=self.parse)

Answer the question

In order to leave comments, you need to log in

1 answer(s)
D
Dimonchik, 2020-07-05
@NoXXik

everything after # is not transferred to the server
such sites are parsed wrong,
see Console what is there and where and from where

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question