How to count the number of times a page is visited in Scrapy?

B

Bjornie2017-10-28 16:23:21

Python

Bjornie, 2017-10-28 16:23:21

Below is an example of a working code in which I want to implement a captcha limit. At the moment, even after a successful captcha, sometimes Amazon returns the captcha page again. Sometimes it happens 15-20 times. I have not been able to understand the reason for this, tk. For the most part, captcha guessing works fine (deathbycaptcha), so I decided to make a limit for such AZIN.
But how to determine that the given url (or asin) was already at the "fortune teller"? I tried several options to the best of my knowledge of Python and ingenuity, but did not come to the desired result. Tell me what can be done in my case? How to make two functions parse_item and get_captcha store state?
An example code is given below:

spoiler

class AmazonproductspiderSpider(scrapy.Spider):
# Читаю файл с ASIN, вызываю parse_item через коллбэк.
def start_requests(self):
        with open('asin.txt') as file:
            for i in file:
                if len(i) > 1:
                        yield scrapy.Request(
                            url='%s/gp/product/%s/' % (self.AMAZON_DOMAIN, asin_from_file),
                            callback=self.parse_item,
                            meta={
                                'asin_from_file': asin_from_file,
                                'country': self.country,
                            }
                        )
    def parse_item(self, response):
        captcha_form = response.xpath('//form[@action="/errors/validateCaptcha"]')
        # Если в респонсе найдена каптча, то срабатывает дальнейший блок кода и вызывается get_captcha
        if captcha_form:
            captcha_img = captcha_form.xpath('.//img/@src').extract_first()
            yield scrapy.Request(
                url=captcha_img,
                callback=self.get_captcha,
                dont_filter=True,
                meta={
                    'callback': self.parse_item,
                    'resp': response,
                    'proxy': response.meta['proxy']
                })
        else:
        # Иначе работает дальше и нужные мне поля передаются дальше в пайплайны по цепочке (Все ОК)
        pass
    # Разгадка каптчи, здесь все работает как нужно. Но хочется ввести лимит на кол-во каптч для одного АЗИН.
    def get_captcha(self, response):
        client = deathbycaptcha.SocketClient(self.DBC_USER, self.DBC_PWD)
        captcha_file = response.body
        try:
            balance = client.get_balance()
            captcha = client.decode(captcha_file, type=2)
            if captcha:
                print("[%s] CAPTCHA %s solved: %s" % ('url', captcha["captcha"], captcha["text"]))
                if '': # check if the CAPTCHA was incorrectly solved
                    client.report(captcha["captcha"])

            yield scrapy.FormRequest.from_response(
                response.meta['resp'],
                formdata={'field-keywords': captcha["text"]},
                callback=response.meta['callback'],
                dont_filter=True,
                meta={
                    'proxy': response.meta['proxy']
                })
            return
        except deathbycaptcha.AccessDeniedException:
            print("error: Access to DBC API denied, check your credentials and/or balance")

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

K

kzoper, 2017-10-29
@kzoper

class AmazonproductspiderSpider(scrapy.Spider):
visited_urls = {}
# Читаю файл с ASIN, вызываю parse_item через коллбэк.
def start_requests(self):

........

    def parse_item(self, response):
        captcha_form = response.xpath('//form[@action="/errors/validateCaptcha"]')
        # Если в респонсе найдена каптча, то срабатывает дальнейший блок кода и вызывается get_captcha
        if captcha_form:
             visited_urls[response.url] += 1
            if visited_urls[response.url] < 2:
                captcha_img = captcha_form.xpath('.//img/@src').extract_first()
                yield scrapy.Request(