Answer the question
In order to leave comments, you need to log in
How to count the number of times a page is visited in Scrapy?
Below is an example of a working code in which I want to implement a captcha limit. At the moment, even after a successful captcha, sometimes Amazon returns the captcha page again. Sometimes it happens 15-20 times. I have not been able to understand the reason for this, tk. For the most part, captcha guessing works fine (deathbycaptcha), so I decided to make a limit for such AZIN.
But how to determine that the given url (or asin) was already at the "fortune teller"? I tried several options to the best of my knowledge of Python and ingenuity, but did not come to the desired result. Tell me what can be done in my case? How to make two functions parse_item and get_captcha store state?
An example code is given below:
class AmazonproductspiderSpider(scrapy.Spider):
# Читаю файл с ASIN, вызываю parse_item через коллбэк.
def start_requests(self):
with open('asin.txt') as file:
for i in file:
if len(i) > 1:
yield scrapy.Request(
url='%s/gp/product/%s/' % (self.AMAZON_DOMAIN, asin_from_file),
callback=self.parse_item,
meta={
'asin_from_file': asin_from_file,
'country': self.country,
}
)
def parse_item(self, response):
captcha_form = response.xpath('//form[@action="/errors/validateCaptcha"]')
# Если в респонсе найдена каптча, то срабатывает дальнейший блок кода и вызывается get_captcha
if captcha_form:
captcha_img = captcha_form.xpath('.//img/@src').extract_first()
yield scrapy.Request(
url=captcha_img,
callback=self.get_captcha,
dont_filter=True,
meta={
'callback': self.parse_item,
'resp': response,
'proxy': response.meta['proxy']
})
else:
# Иначе работает дальше и нужные мне поля передаются дальше в пайплайны по цепочке (Все ОК)
pass
# Разгадка каптчи, здесь все работает как нужно. Но хочется ввести лимит на кол-во каптч для одного АЗИН.
def get_captcha(self, response):
client = deathbycaptcha.SocketClient(self.DBC_USER, self.DBC_PWD)
captcha_file = response.body
try:
balance = client.get_balance()
captcha = client.decode(captcha_file, type=2)
if captcha:
print("[%s] CAPTCHA %s solved: %s" % ('url', captcha["captcha"], captcha["text"]))
if '': # check if the CAPTCHA was incorrectly solved
client.report(captcha["captcha"])
yield scrapy.FormRequest.from_response(
response.meta['resp'],
formdata={'field-keywords': captcha["text"]},
callback=response.meta['callback'],
dont_filter=True,
meta={
'proxy': response.meta['proxy']
})
return
except deathbycaptcha.AccessDeniedException:
print("error: Access to DBC API denied, check your credentials and/or balance")
Answer the question
In order to leave comments, you need to log in
class AmazonproductspiderSpider(scrapy.Spider):
visited_urls = {}
# Читаю файл с ASIN, вызываю parse_item через коллбэк.
def start_requests(self):
........
def parse_item(self, response):
captcha_form = response.xpath('//form[@action="/errors/validateCaptcha"]')
# Если в респонсе найдена каптча, то срабатывает дальнейший блок кода и вызывается get_captcha
if captcha_form:
visited_urls[response.url] += 1
if visited_urls[response.url] < 2:
captcha_img = captcha_form.xpath('.//img/@src').extract_first()
yield scrapy.Request(
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question