How to bypass captcha in Python?

A

Amigun2020-05-11 16:59:37

Python

Amigun, 2020-05-11 16:59:37

I'm making a parser for one site. With a long parsing, the site issues a captcha.
At first I did this: I added a User-Agent and IP (proxy) substitution every time the program makes a request to the site. But it did not help.
Then like this: when the site gives you a captcha to pass, then stop parsing for 1 hour and then continue further. This didn't work either.
Then I made the following decision: open the browser with the page where the captcha pops up using Selenium, I will go through the captcha myself, after that the program stops for 10 minutes, and continue to work. But that didn't help either.
How can you bypass captcha without using services like Anticaptcha, where you have to pay for each captcha that someone completes?
If anything, I use requests and beautifulsoup for parsing (well, classic).

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

Amigun, 2020-05-12
@Amigun

If you have the same problem as me, namely captcha when parsing the site, then read how I solved it.
Initially, I used the requests library (for sending requests to the site) and bs4 (for the parsing itself).
First, I made a delay, if the program encounters a captcha, then it stops its work for 1 hour, and then continues to work. It didn't work, not after an hour, not after two, not after 3 days.
Then I thought to connect selenium , and when the captcha comes out, open this page using selenium , go through the captcha manually, close selenium , and try through requestssend requests to the page again. The captcha still remains.
Rewriting the parser for selenium - without using requests and bs4 , a good solution, but not always suitable. In my case, it was too tedious to use this, so I turned here.
Here I was suggested this option: use sessions (requests.Session()) and clear cookies when changing IP (by the way, I connected libraries for changing IP through a proxy, as well as a library for generating fake-User-Agent) and pass in headers referer parameter. It may work, but not for me. The captcha still remains.
Well, here's the actual solution :)
I decided to replace the requests library with selenium. I connected to the site through it (I even opened a browser window) and using the page_source method I received the html code of the page. By the way, be careful, I understand that using this method you can get only html, without js and css. Therefore, if the site uses js to generate content, then you are unlikely to succeed. And then, with the help of bs4 , I simply parsed the resulting html page, and extracted the data I needed from it. Yes, the captcha still appeared, but it appeared only once, I manually solved it directly through the selenium window , and we can say that I bypassed the captcha, since it didn’t pop up again during the parsing process.
You can try all the methods described above, maybe that will help. Each site needs to be perverted in its own way)