V
V
Vitaly2019-06-11 01:24:14
Python
Vitaly, 2019-06-11 01:24:14

How to reduce the number of 5xx and 4xx errors when parsing?

Task: to track when a free slot for recording appears on the site in the calendar. No one knows when this slot will appear (maybe tomorrow, or maybe in a month).
Wrote a python parser with MechanicalSoup. Started testing startup with 4 proxies, 4 different user-agents (selected randomly at startup) and a timeout first of 2 minutes + a random number from 1 to 60 seconds, then with a timeout of 5 minutes + a random number from 1 to 60 seconds.
Result: with whatever timeout I run, the server periodically returns either 502 Bad Gateway or 403 Forbidden. And when I tested with a timeout of 30 seconds also with 4 proxies, then after a certain period of time 502 Bad Gateway constantly took off. On the nginx server.
How does the server work? Tracks the number of requests from one ip for a certain period of time, and if the limit is exceeded, then bans by ip?
But then why if I send requests every 5 minutes, I still meet 502?
Am I missing some setting when pretending to be a browser?

browser = mechanicalsoup.StatefulBrowser(
            soup_config={'features': 'lxml'},
            raise_on_404=False,
            user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)\
             Chrome/74.0.3729.131 Safari/537.366"
        )

How to reduce the number of 5xx and 4xx errors when parsing?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
Saboteur, 2019-06-11
@saboteur_kiev

It depends not on the parser, but on the site.
It is absolutely impossible to predict what kind of protection against "intruders" they came up with and integrated.
Up to the point that the site stupidly and regularly falls by itself.

A
Andrey_Dolg, 2019-06-11
@Andrey_Dolg

You have 16 user/ip combinations + 2 minute timeout and a user without cookies so this is still a good option.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question