How to scrape a site?

S

swynha2021-09-15 22:57:06

Python

swynha, 2021-09-15 22:57:06

It is necessary to parse tasks from the site znanija.com
Each task has a link like https://znanija.com/task/{task_number }. I make get requests with each time increasing the task number, i.e.:

for (i in range(2, max_i)): #начинаю с 2 т.к. первое задание на сайте имеет такой номер
    r = requests.get(f"https://znanija.com/task/{i}", headers={"user-agent":"..."})
    #...

For the first n requests, the server responds as intended (n from 20 to 50). However, after n requests, the server, apparently, bans me by ip and starts returning the text "Please enable JS and disable any ad blocker" in the response, and when I enter the site from the browser, it asks me to enter a captcha. Question: how to bypass the restriction?
Tried:
1. Use different proxies and user agents (most of the proxies didn't work, and those that worked had a too long timeout and, worse, were not perceived adequately by the server, because it [the server] presumably accepts requests only from CIS)
2. selenium (a browser window was opened where they were asked to complete the captcha, but after that it was instantly closed and the application stopped its work without even throwing an exception; + this option, even being working, is not rational, because it needs to be done (ideally) several million requests)
3. make timeouts for t seconds between each x requests (tried different t and x, still banned after a while)
All of the above did not help;
I apologize in advance if the question is somehow googled: I have been looking for a solution for several days, but the problem is still in force.
Thanks in advance.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

A

antares4045, 2021-09-16
@antares4045

Most free proxies have a huge number of problems, especially in the area of anonymization: most users try to hide their identity from their ISP, and not from the target server.
In addition, the outgoing request may fall off by timeout, but the server does not care: it received the request and instantly issued a response - the problem is with the timeout you set.
Personally, when faced with similar tasks, I set horse delays (up to a minute) and, having received a temporary ban, fell silent for three to four hours (requires some patience, but is the simplest solution to the problem).
With a high degree of probability, the presence of random delays and polling the list out of order slows down the detection of a bot-scarper. but this hypothesis should be tested separately.
You can also try to pretend to be a goolian scanning bot (but this is in the format of the idea).
And free anonymizing https proxy servers that work and do not redirect you to a page with an invitation to buy a subscription are generally akin to unicorns in rarity.