Scrapy: how to pass a reference to a function so that it, in turn, sends it to the Selector?

J

JRazor2014-02-12 09:58:40

Python

JRazor, 2014-02-12 09:58:40

Hello fellow toasters.
When building a spider, I came across one interesting thing - you need to pass a link to a function so that it, in turn, sends it to the Selector. I have already tried all the available Response and Request Scrapy methods, but the data does not come.
A piece of code for understanding:

start_urls = [
        "http://www.site.ru/"
    ]

    # Парсим start_urls и получаем список ссылок на каталоги
    def parse(self, response):  
        sel = Selector(response)
        self.links = sel.xpath('//*[@id="col-01"]/div/div/ul/li/a/@href').extract()

    # Парсим каждый каталог и получаем список ссылок на элементы каталога
    def parse_catalog(self, response): 
        sel = Selector(response)
        elements = sel.xpath('//*[@id="col-01"]/div[1]/ul[1]/
                                        li[4]/div[2]/strong/text()').extract()[0]
        links_auto = sel.xpath('//div[@class="car-detail-list"]/a/@href').extract()

        # Отправляем ссылку на парсинг страницы
        for link in links_auto:
            self.parse_page(link)

    def parse_page(self, link):
        response = <b>???</b>(link) # Обрабатываем ссылку, но чем?
        self.sel = Selector(response)

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

W

WalterWhite, 2014-02-13
@JRazor

You don't need to call any methods on Request. Scrapy itself will call everything you need, and when you need it.
A spider can receive data (Item) or a path (Request) from a page, or both at the same time. You need to return from sequence methods from Request and/or Item.

def parse(self. response):
    sel = Selector(response)
    # из стартовой страницы выдёргиваем список категорий (носки, трусы, рубашки...)
    for catalog_link in sel.xpath('// . . . . /@href'):
        # указываем что нужно будет запросить страницу по ссылке, 
        # а результат(Response) обработать в методе
        yield Request(url=catalog_link, callback=self.parse_catalog)

def parse_catalog(self, response):
    # ответы сервера будут сыпаться сюда
    sel = Selector(response)
    
    # если сама категория представляет интерес то описываем её
    category = MyCategoryItem()
    category['name'] = sel.xpath( . . .                     # как называется
    category['count'] =  . . .                              # сколько товаров
    . . .
    # и выбрасываем из метода
    yield category
    
    # получаем список ссылок на конкретные рубашки
    for page_link in sel.xpath('//. . . ./@href'):
        # выбрасываем из метода
        yield Response(url=page_link, callback=self.parse_page)

def parse_page(self, response):
     . . .
     item = MyGoodsItem()
     . . .
     yield item

But it will be easier to read the documentation.