S
S
Serhiy Romanov2017-03-23 18:29:50
Python
Serhiy Romanov, 2017-03-23 18:29:50

Parsing Yandex.Search - how to send a captcha?

You need to get links from Yandex search results.
I say right away - I know about Yandex.XML, but I need a "live" issue.
Everything is fine until the captcha appears.

PROXY_HEADERS = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
                               'Ubuntu Chromium/55.0.2883.87 Chrome/55.0.2883.87 Safari/537.36',
                 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                 'Accept-Encoding': 'gzip, deflate',
                 'Accept-Language': 'en-US,en;q=0.8,uk;q=0.6,ru;q=0.4',
                 'Cache-Control': 'no-cache',
                 'Connection': 'keep-alive',
                 'Pragma': 'no-cache',
                 'Upgrade-Insecure-Requests': '1'
                 }
s = requests.Session()
s.headers = PROXY_HEADERS

is_captcha = True
while is_captcha:
    current_url = "https://yandex.ru/search/?text={}&p={}".format(search, start)
    
    page = s.get(current_url)
    parsed = html_parser.document_fromstring(page.text)

    # Если есть капча
    if parsed.cssselect('.form__captcha'):
        is_captcha = True

        captcha_src = parsed.cssselect('.form__captcha')[0].get('src')
        solved_captcha = get_solved_captcha(captcha_src, s) # капча разгадывается верно - проверял
        key = parsed.cssselect('.form__key')[0].get('value')
        retpath = parsed.cssselect('.form__retpath')[0].get('value')

        c_url = "http://yandex.ru/checkcaptcha"
        req_c = s.get(c_url, params={'key': key, 'retpath': retpath, 'rep': solved_captcha})
       # И тут в ответ всегда получаю 200 ответ и страницу ввода капчи еще раз.
    else:
        is_captcha = False

Who dealt with this? Tell me what I'm doing wrong

Answer the question

In order to leave comments, you need to log in

5 answer(s)
G
g00dv1n, 2017-03-24
@SerhiyRomanov

You can try to take selenium webdriver. And send requests from a real browser .. Then the captcha should appear less often.

D
dummyman, 2017-03-24
@dummyman

First, Chrome, Chromium, Safari send a unique request id in the headers, as a User-Agent it is better to use Fairlees. Otherwise, I can only recommend CasperJS, PhantomJS or SlimerJS scripting browsers. In theory, they emulate the process of human presence more deeply.
For example, parsing the output of Google
docs.casperjs.org/en/latest/quickstart.html

D
Dimonchik, 2017-03-23
@dimonchik2013

captcha popping bot
need to pretend not to be a bot

D
devel787, 2017-03-28
@devel787

Somewhere in the terms of using the search from Yandex it is clearly written that it is forbidden to parse this site.
So you are doing everything "wrong" and Yandex will actively oppose you.

H
herypank, 2021-04-08
@herypank

I recently encountered a similar problem and decided to write down a little hint for ordinary mortals. I spent a day to sort it out.
1. In response, captcha should come, if it comes, process it.
response.url
2. Go to this URL and click on the button I'm not a robot via a POST request (Burp Suite helped me)

resp = requests.get(capture_url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, 'lxml')
url_part = soup.select_one('form.CheckboxCaptcha-Form').get('action')
link_for_post = urljoin(current_url, url_part)
resp_post = requests.post(link_for_post) #  Тут должна быть сама капча

3. Get the link to the captcha and pass it through https://rucaptcha.com/software/python-rucaptcha
What I used
1) https://docs.python-requests.org/en/master/ - For requests
2) https ://docs.python-requests.org/en/master/ - For captcha parsing

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question