Answer the question
In order to leave comments, you need to log in
How to parse sites correctly so as not to catch the captcha?
How to parse sites correctly so as not to catch the captcha?
I understand that for "correct parsing" it is necessary that the bot has similar behavior to a person. This can be done by adding headers, proxy to the code.
Are there other ways to reduce the risk of captcha or other blocking systems?
Answer the question
In order to leave comments, you need to log in
It is necessary to contact the site owners for normal access to data, through the API.
If such access is not given, then do not try to rub change in your pockets, but find yourself a more worthy occupation.
The appearance of captcha in the general case cannot be prevented in any way. You need to understand that captcha is shown not only to bots. Captcha is shown simply to any site visitors when certain conditions occur. It’s just that it’s more difficult for a person to achieve these conditions in the usual scenario of using the site, but even if it arises, it is very easy to unravel it, but for a bot this is a difficulty.
For example, I parsed one site, and after exactly 500 pages it showed a captcha. It is very likely that if I sat and clicked on the site in the browser and clicked 500 pages in half an hour, I would also see the captcha.
In any case, the captcha will most likely appear periodically. But this does not matter, because there are heaps of services that solve them for a penny. For example, here . Usually they do just that.
I reason like this:
1. If the site was originally created and involves an API (or other system for obtaining its content), paid / free, for users, use it!
2. If the site does not assume the above, moreover, it tries to protect its content, then why are you climbing there at all? Freeloader on someone else's hump to go to paradise, even a dime a dozen.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question