Answer the question
In order to leave comments, you need to log in
Parsing Yandex.Search - how to send a captcha?
You need to get links from Yandex search results.
I say right away - I know about Yandex.XML, but I need a "live" issue.
Everything is fine until the captcha appears.
PROXY_HEADERS = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Ubuntu Chromium/55.0.2883.87 Chrome/55.0.2883.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,uk;q=0.6,ru;q=0.4',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Upgrade-Insecure-Requests': '1'
}
s = requests.Session()
s.headers = PROXY_HEADERS
is_captcha = True
while is_captcha:
current_url = "https://yandex.ru/search/?text={}&p={}".format(search, start)
page = s.get(current_url)
parsed = html_parser.document_fromstring(page.text)
# Если есть капча
if parsed.cssselect('.form__captcha'):
is_captcha = True
captcha_src = parsed.cssselect('.form__captcha')[0].get('src')
solved_captcha = get_solved_captcha(captcha_src, s) # капча разгадывается верно - проверял
key = parsed.cssselect('.form__key')[0].get('value')
retpath = parsed.cssselect('.form__retpath')[0].get('value')
c_url = "http://yandex.ru/checkcaptcha"
req_c = s.get(c_url, params={'key': key, 'retpath': retpath, 'rep': solved_captcha})
# И тут в ответ всегда получаю 200 ответ и страницу ввода капчи еще раз.
else:
is_captcha = False
Answer the question
In order to leave comments, you need to log in
You can try to take selenium webdriver. And send requests from a real browser .. Then the captcha should appear less often.
First, Chrome, Chromium, Safari send a unique request id in the headers, as a User-Agent it is better to use Fairlees. Otherwise, I can only recommend CasperJS, PhantomJS or SlimerJS scripting browsers. In theory, they emulate the process of human presence more deeply.
For example, parsing the output of Google
docs.casperjs.org/en/latest/quickstart.html
Somewhere in the terms of using the search from Yandex it is clearly written that it is forbidden to parse this site.
So you are doing everything "wrong" and Yandex will actively oppose you.
I recently encountered a similar problem and decided to write down a little hint for ordinary mortals. I spent a day to sort it out.
1. In response, captcha should come, if it comes, process it.
response.url
2. Go to this URL and click on the button I'm not a robot via a POST request (Burp Suite helped me)
resp = requests.get(capture_url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, 'lxml')
url_part = soup.select_one('form.CheckboxCaptcha-Form').get('action')
link_for_post = urljoin(current_url, url_part)
resp_post = requests.post(link_for_post) # Тут должна быть сама капча
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question